rocksdb

Author	SHA1	Message	Date
Zhichao Cao	4246888101	Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487 ) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0	2020-03-27 16:04:43 -07:00
Levi Tamasi	6f62322fe4	Add blob files to VersionStorageInfo/VersionBuilder (#6597 ) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80	2020-03-26 18:51:53 -07:00
Levi Tamasi	6301dbe7a7	Use function objects as deleters in the block cache (#6545 ) Summary: As the first step of reintroducing eviction statistics for the block cache, the patch switches from using simple function pointers as deleters to function objects implementing an interface. This will enable using deleters that have state, like a smart pointer to the statistics object that is to be updated when an entry is removed from the cache. For now, the patch adds a deleter template class `SimpleDeleter`, which simply casts the `value` pointer to its original type and calls `delete` or `delete[]` on it as appropriate. Note: to prevent object lifecycle issues, deleters must outlive the cache entries referring to them; `SimpleDeleter` ensures this by using the ("leaky") Meyers singleton pattern. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6545 Test Plan: `make asan_check` Reviewed By: siying Differential Revision: D20475823 Pulled By: ltamasi fbshipit-source-id: fe354c33dd96d9bafc094605462352305449a22a	2020-03-26 16:19:58 -07:00
Mike Kolupaev	963af52f15	Fix iterator reading filter block despite read_tier == kBlockCacheTier (#6562 ) Summary: We're seeing iterators with `ReadOptions::read_tier == kBlockCacheTier` sometimes doing file reads. Stack trace: ``` rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice, char, bool) const rocksdb::BlockFetcher::ReadBlockContents() rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>, rocksdb::BlockType, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, rocksdb::BlockContents) const rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>, rocksdb::BlockType, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, bool, bool) const rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const, rocksdb::FilePrefetchBuffer, rocksdb::ReadOptions const&, bool, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>) rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>) const rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, rocksdb::GetContext, rocksdb::BlockCacheLookupContext) const rocksdb::FullFilterBlockReader::RangeMayExist(rocksdb::Slice const, rocksdb::Slice const&, rocksdb::SliceTransform const, rocksdb::Comparator const, rocksdb::Slice const, bool, bool, rocksdb::BlockCacheLookupContext) rocksdb::BlockBasedTable::PrefixMayMatch(rocksdb::Slice const&, rocksdb::ReadOptions const&, rocksdb::SliceTransform const, bool, rocksdb::BlockCacheLookupContext) const rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::SeekImpl(rocksdb::Slice const) rocksdb::ForwardIterator::SeekInternal(rocksdb::Slice const&, bool) rocksdb::DBIter::Seek(rocksdb::Slice const&) ``` `BlockBasedTableIterator::CheckPrefixMayMatch` was missing a check for `kBlockCacheTier`. This PR adds it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6562 Test Plan: deployed it to a logdevice test cluster and looked at logdevice's IO tracing. Reviewed By: siying Differential Revision: D20529368 Pulled By: al13n321 fbshipit-source-id: 65bf33964b1951464415c900336635fb20919611	2020-03-26 15:21:26 -07:00
sdong	6fd0ed4993	CompactRange() to use bottom pool when goes to bottommost level (#6593 ) Summary: In automatic compaction, if a compaction is bottommost, it goes to bottom thread pool. We should do the same for manual compaction too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6593 Test Plan: Add a unit test. See all existing tests pass. Reviewed By: ajkr Differential Revision: D20637408 fbshipit-source-id: cb03031e8f895085f7acf6d2d65e69e84c9ddef3	2020-03-24 20:24:32 -07:00
Huisheng Liu	a6ce5c823b	multiget support for timestamps (#6483 ) Summary: Add timestamp support for MultiGet(). timestamp from readoptions is honored, and timestamps can be returned along with values. MultiReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks. base line (commit `17bef7d3a`): multireadrandom : 104.173 micros/op 307167 ops/sec; (5462999 of 5462999 found) This PR: multireadrandom : 104.199 micros/op 307095 ops/sec; (5307999 of 5307999 found) .\db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=multireadrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6483 Reviewed By: anand1976 Differential Revision: D20498373 Pulled By: riversand963 fbshipit-source-id: 8505f22bc40fd791bc7dd05e48d7e67c91edb627	2020-03-24 11:24:09 -07:00
sdong	921cdd37e2	Fix bug that number of table loading threads is set as a boolean (#6576 ) Summary: When applying a new version in non DB open case, optimize_filters_for_hits is used for max_threads, which is clearly a bug. It is not clear what the indented value in the first place, but it value 1 makes sense here, which would create no extra threads. This bug is not expected to cause user visible problems, assuming C++ implicitly cast bool to 0 or 1. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6576 Test Plan: Run all exsiting test. Reviewed By: ajkr Differential Revision: D20602467 fbshipit-source-id: 40b2cd8619aba09ae9242b36c415464db3c9b737	2020-03-24 10:17:40 -07:00
anand76	a9d168cfd7	Simplify migration to FileSystem API (#6552 ) Summary: The current Env/FileSystem API separation has a couple of issues - 1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation. 2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes. This PR solves the above issues and simplifies the migration in the following ways - 1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```. 1a. - This also makes it more robust by ensuring that even if RocksDB has some stray calls to Env APIs rather than FileSystem, they will go through the same object and thus there is no risk of getting out of sync. 2. Provide a ```NewCompositeEnv()``` API that can be used to construct a PosixEnv with a custom FileSystem implementation. This eliminates an indirection to call Env APIs, and relieves the FileSystem developer of the burden of having to implement wrappers for the Env APIs. 3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and ```NewLogger()``` Tests: 1. New unit tests 2. make check and make asan_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552 Reviewed By: riversand963 Differential Revision: D20592038 Pulled By: anand1976 fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7	2020-03-23 21:54:21 -07:00
Zhichao Cao	d300d10962	Fix the MultiGet testing failure in Circleci (#6578 ) Summary: The MultiGet test in db_basic_test fails in CircleCI vs2019. The reason is that even Snappy compression is enabled, the first compression type is still kNoCompression. This PR checks the list and ensure that only when compression is enable and the compression type is valid, compression will be enabled. Such that, it will not fail the combined read test in MultiGet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6578 Test Plan: make check, db_basic_test. Reviewed By: anand1976 Differential Revision: D20607529 Pulled By: zhichao-cao fbshipit-source-id: dcead264d5c2da105912c18caad34b8510bb04b0	2020-03-23 18:51:09 -07:00
Yanqin Jin	617f479266	Fix LITE build (#6575 ) Summary: Fix LITE build by excluding some unit tests that use features not supported in LITE. ``` db/db_basic_test.cc:1778:8: error: ‘void rocksdb::{anonymous}::TableFileListener::OnTableFileCreated(const rocksdb::TableFileCreationInfo&)’ marked ‘override’, but does not override void OnTableFileCreated(const TableFileCreationInfo& info) override { ^~~~~~~~~~~~~~~~~~ make: *** [db/db_basic_test.o] Error 1 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6575 Reviewed By: ltamasi Differential Revision: D20598598 Pulled By: riversand963 fbshipit-source-id: 367f7cb2500360ad57030b138a94c0f731a04339	2020-03-23 13:05:36 -07:00
Zhichao Cao	5c6346c420	Revert "Added the safe-to-ignore tag to version_edit (#6530 )" (#6569 ) Summary: This reverts commit `e10553f2a6`. Pass make asan_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6569 Reviewed By: riversand963 Differential Revision: D20574319 Pulled By: zhichao-cao fbshipit-source-id: ce36981a21596f5f2e14da6a59a2bb3619509a8b	2020-03-23 10:27:47 -07:00
Yanqin Jin	fb09ef05dc	Attempt to recover from db with missing table files (#6334 ) Summary: There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status. This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version. `DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will not be performed. To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush. Test plan (on devserver): ``` $make check $COMPILE_WITH_ASAN=1 make all && make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334 Reviewed By: anand1976 Differential Revision: D19778960 Pulled By: riversand963 fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8	2020-03-20 19:30:48 -07:00
Cheng Chang	5fd152b7ad	Get block size only in direct IO mode (#6522 ) Summary: When `use_direct_reads` and `use_direct_writes` are `false`, `logical_sector_size_` inside various `*File` implementations are not actually used, so `GetLogicalBlockSize` does not necessarily need to be called for `logical_sector_size_`, just set a default page size. This is a follow up PR for https://github.com/facebook/rocksdb/pull/6457. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6522 Test Plan: make check Reviewed By: siying Differential Revision: D20408885 Pulled By: cheng-chang fbshipit-source-id: f2d3808f41265237e7fa2c0be9f084f8fa97fe3d	2020-03-20 15:26:10 -07:00
Zhichao Cao	e10553f2a6	Added the safe-to-ignore tag to version_edit (#6530 ) Summary: Each time RocksDB switches to a new MANIFEST file from old one, it calls WriteCurrentStateToManifest() which writes a 'snapshot' of the current in-memory state of versions to the beginning of the new manifest as a bunch of version edits. We can distinguish these version edits from other version edits written during normal operations with a custom, safe-to-ignore tag. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6530 Test Plan: added test to version_edit_test, pass make asan_check Reviewed By: riversand963 Differential Revision: D20524516 Pulled By: zhichao-cao fbshipit-source-id: f1de102f5499bfa88dae3caa2f32c7f42cf904db	2020-03-19 11:30:26 -07:00
Levi Tamasi	442404558a	Clean up VersionBuilder a bit (#6556 ) Summary: The whole point of the pimpl idiom is to hide implementation details. Internal helper methods like `CheckConsistency`, `CheckConsistencyForDeletes`, and `MaybeAddFile` do not belong in the public interface of the class. In addition, the patch switches to `unique_ptr` for the implementation object instead of using a raw `delete`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6556 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20523568 Pulled By: ltamasi fbshipit-source-id: 5bbb0ccebd0c47a33b815398c7f9cfe13bd775ac	2020-03-19 10:44:16 -07:00
Yanqin Jin	66ed58083a	Reduce runtime of db_with_timestamp_basic_test (#6546 ) Summary: Reduce runtime by reducing test scale to avoid test time-outs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6546 Test Plan: time ./db_with_timestamp_basic_test and watch internal tests. Reviewed By: zhichao-cao Differential Revision: D20479292 Pulled By: riversand963 fbshipit-source-id: c9e4a155be7699dd4de60fa531de86d442a3ba0a	2020-03-17 10:50:48 -07:00
Cheng Chang	402da454cb	Migrate AppVeyor to CircleCI (#6518 ) Summary: CircleCI is the new recommended CI system internally. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6518 Test Plan: Watch https://app.circleci.com/pipelines/github/facebook/rocksdb Differential Revision: D20454743 Pulled By: cheng-chang fbshipit-source-id: 39031568d6c1d3d25b7fbd78fa9a0e6067ddc47c	2020-03-13 21:58:51 -07:00
Cheng Chang	23eae14d24	Destroy DB at the end of each test in db_logical_block_size_cache_test (#6532 ) Summary: If DB is not deleted, in concurrent test, the tests might fail because of the previously existing DB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6532 Test Plan: make clean && make -j24 LITE=1 db_logical_block_size_cache_test && ./db_logical_block_size_cache_test make clean && make -j24 db_logical_block_size_cache_test && ./db_logical_block_size_cache_test Differential Revision: D20454734 Pulled By: cheng-chang fbshipit-source-id: 8abede2ec1d79c1a4fe1bc95fbda489f8f7ee052	2020-03-13 21:53:38 -07:00
Zhichao Cao	5c30e6c088	Separate timestamp related test from db_basic_test (#6516 ) Summary: In some of the test, db_basic_test may cause time out due to its long running time. Separate the timestamp related test from db_basic_test to avoid the potential issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6516 Test Plan: pass make asan_check Differential Revision: D20423922 Pulled By: zhichao-cao fbshipit-source-id: d6306f89a8de55b07bf57233e4554c09ef1fe23a	2020-03-13 11:37:15 -07:00
Cheng Chang	0d2c8e47e8	OpenForReadOnly is not supported in LITE mode (#6523 ) Summary: In DBLogicalBlockSizeCacheTest, do not test OpenForReadOnly in LITE mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6523 Test Plan: watch test for LITE mode Differential Revision: D20420321 Pulled By: cheng-chang fbshipit-source-id: e45bf6f2800206d6f8ce9af7308e76a08de80643	2020-03-12 14:13:59 -07:00
Levi Tamasi	c15e85bdcb	Move BlobDB related files under db/ to db/blob/ (#6519 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6519 Test Plan: ``` make all make check ``` Differential Revision: D20400691 Pulled By: ltamasi fbshipit-source-id: 20ef911cf1c2c92c7f71ef0b493f9be64f2eef94	2020-03-12 11:00:56 -07:00
Cheng Chang	2d9efc9ab2	Cache result of GetLogicalBufferSize in Linux (#6457 ) Summary: In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down. This PR introduces two new APIs: 1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes. 2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes. Other modifications: 1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms. 2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`. 3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457 Test Plan: 1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`. 2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`. `make check` Differential Revision: D20131243 Pulled By: cheng-chang fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04	2020-03-11 18:40:05 -07:00
sdong	331e6199df	Include more information in file lock failure (#6507 ) Summary: When users fail to open a DB with file lock failure, it is sometimes hard for users to debug. We now include the time the lock is acquired and the thread ID that acquired the lock, to help users debug problems like this. Default Env's thread ID is used. Since type of lockedFiles is changed, rename it to follow naming convention too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6507 Test Plan: Add a unit test and improve an existing test to validate the case. Differential Revision: D20378333 fbshipit-source-id: 312fe0e9733fd1d1e9969c321b90ce523cf4708a	2020-03-11 16:23:08 -07:00
Levi Tamasi	37a635cfe6	Disambiguate CustomFieldTags for the unity build (#6513 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6513 Test Plan: `make unity_test` Differential Revision: D20388919 Pulled By: ltamasi fbshipit-source-id: 88dbceab0723a54ee3939e1644e13dc9a4c70420	2020-03-11 14:45:12 -07:00
Adam Retter	8fc20ac468	Add ppc64le builds to Travis (#6144 ) Summary: Let's see how this goes... Pull Request resolved: https://github.com/facebook/rocksdb/pull/6144 Differential Revision: D20387515 Pulled By: pdillinger fbshipit-source-id: ba2669348c267141dfddff910b4c2224a22cbb38	2020-03-11 12:33:45 -07:00
Levi Tamasi	f5bc3b99d5	Split BlobFileState into an immutable and a mutable part (#6502 ) Summary: It's never too soon to refactor something. The patch splits the recently introduced (`VersionEdit` related) `BlobFileState` into two classes `BlobFileAddition` and `BlobFileGarbage`. The idea is that once blob files are closed, they are immutable, and the only thing that changes is the amount of garbage in them. In the new design, `BlobFileAddition` contains the immutable attributes (currently, the count and total size of all blobs, checksum method, and checksum value), while `BlobFileGarbage` contains the mutable GC-related information elements (count and total size of garbage blobs). This is a better fit for the GC logic and is more consistent with how SST files are handled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6502 Test Plan: `make check` Differential Revision: D20348352 Pulled By: ltamasi fbshipit-source-id: ff93f0121e80ab15e0e0a6525ba0d6af16a0e008	2020-03-10 17:27:26 -07:00
Yanqin Jin	fd1da22111	Support options.max_open_files != -1 with FIFO compaction (#6503 ) Summary: Allow user to specify options.max_open_files != -1 with FIFO compaction. If max_open_files != -1, not all table files are kept open. In the past, FIFO style compaction requires all table files to be open in order to read file creation time from table properties. Later, we added file creation time to MANIFEST, making it possible to read file creation time without opening file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6503 Test Plan: make check Differential Revision: D20353758 Pulled By: riversand963 fbshipit-source-id: ba5c61a648419e47e9ef6d74e0e280e3ee24f296	2020-03-09 18:45:06 -07:00
Yanqin Jin	d93812c9ae	Iterator with timestamp (#6255 ) Summary: Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests. Create an iterator with timestamp. ``` ... read_opts.timestamp = &ts; auto* iter = db->NewIterator(read_opts); // target is key without timestamp. for (iter->Seek(target); iter->Valid(); iter->Next()) {} for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {} delete iter; read_opts.timestamp = &ts1; // lower_bound and upper_bound are without timestamp. read_opts.iterate_lower_bound = &lower_bound; read_opts.iterate_upper_bound = &upper_bound; auto* iter1 = db->NewIterator(read_opts); // Do Seek or SeekToFirst() delete iter1; ``` Test plan (dev server) ``` $make check ``` Simple benchmarking (dev server) 1. The overhead introduced by this PR even when timestamp is disabled. key size: 16 bytes value size: 100 bytes Entries: 1000000 Data reside in main memory, and try to stress iterator. Repeated three times on master and this PR. - Seek without next ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 ``` master: 159047.0 ops/sec this PR: 158922.3 ops/sec (2% drop in throughput) - Seek and next 10 times ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10 ``` master: 109539.3 ops/sec this PR: 107519.7 ops/sec (2% drop in throughput) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255 Differential Revision: D19438227 Pulled By: riversand963 fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746	2020-03-06 16:24:27 -08:00
Yuqi Gu	e171a219d5	Fix db_wal_test::TruncateLastLogAfterRecoverWithoutFlush failure (#6437 ) Summary: `TruncateLastLogAfterRecoverWithoutFlush` case depends on fallocate support of underlying file system. On a file system which lacks of this feature, like zfs, it will fail to allocate predefined file size as this test case intends to do; So a check block is added to detect fallocate support and skip test if not. The related work is done by JunHe77. Thanks! Signed-off-by: Yuqi Gu <yuqi.gu@arm.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6437 Differential Revision: D20145032 Pulled By: pdillinger fbshipit-source-id: c8b691dc508e95acfa2a004ddbc07e2faa76680d	2020-03-05 17:18:16 -08:00
Cheng Chang	afb97094ae	Skip high levels with no key falling in the range in CompactRange (#6482 ) Summary: In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped. This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482 Test Plan: A new test ManualCompactionTest::SkipLevel is added. Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index. Also updated db_compaction_test and db_test to use correct range for full compaction. Differential Revision: D20251149 Pulled By: cheng-chang fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b	2020-03-04 20:15:25 -08:00
Zhichao Cao	e62fe50634	Introduce FaultInjectionTestFS to test fault File system instead of Env (#6414 ) Summary: In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller. A set of ErrorHandlerFSTests are introduced for testing Pull Request resolved: https://github.com/facebook/rocksdb/pull/6414 Test Plan: pass make asan_check, pass error_handler_fs_test. Differential Revision: D20252421 Pulled By: zhichao-cao fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21	2020-03-04 12:35:05 -08:00
Kefu Chai	03dbd11ead	s/const auto/const auto&/ when doing loop (#6477 ) Summary: this silences following warning from clang-11 ``` rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:21: warning: loop variable 'newf' of type 'const std::pair<int, rocksdb::FileMetaData>' creates a copy from type 'const std::pair<int\ , rocksdb::FileMetaData>' [-Wrange-loop-analysis] for (const auto newf : c->edit()->GetNewFiles()) { ^ rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:10: note: use reference type 'const std::pair<int, rocksdb::FileMetaData> &' to prevent copying for (const auto newf : c->edit()->GetNewFiles()) { ^~~~~~~~~~~~~~~~~ & ``` Signed-off-by: Kefu Chai <tchaikov@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6477 Differential Revision: D20211850 Pulled By: ltamasi fbshipit-source-id: 3e89e13a12bba79f1b934d46b7c4c0576cdafb01	2020-03-03 08:41:57 -08:00
sdong	17bef7d3a8	Fix data race of GetCreationTimeOfOldestFile() (#6473 ) Summary: When DBImpl::GetCreationTimeOfOldestFile() calls Version::GetCreationTimeOfOldestFile(), the version is not directly or indirectly referenced, so an event like compaction can race with the operation and cause DBImpl::GetCreationTimeOfOldestFile() to access delocated data. This was caught by an ASAN run: ==268==ERROR: AddressSanitizer: heap-use-after-free on address 0x612000b7d198 at pc 0x000018332913 bp 0x7f391510d310 sp 0x7f391510d308 READ of size 8 at 0x612000b7d198 thread T845 (store_load-33) SCARINESS: 51 (8-byte-read-heap-use-after-free) #0 0x18332912 in rocksdb::Version::GetCreationTimeOfOldestFile(unsigned long) rocksdb/src/db/version_set.cc:1488 https://github.com/facebook/rocksdb/issues/1 0x1803ddaa in rocksdb::DBImpl::GetCreationTimeOfOldestFile(unsigned long) rocksdb/src/db/db_impl/db_impl.cc:4499 https://github.com/facebook/rocksdb/issues/2 0xe24ca09 in rocksdb::StackableDB::GetCreationTimeOfOldestFile(unsigned long) rocksdb/utilities/stackable_db.h:392 ...... 0x612000b7d198 is located 216 bytes inside of 296-byte region [0x612000b7d0c0,0x612000b7d1e8) freed by thread T28 here: ...... https://github.com/facebook/rocksdb/issues/5 0x1832c73f in std::vector<rocksdb::FileMetaData, std::allocator<rocksdb::FileMetaData> >::~vector() third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/stl_vector.h:435 https://github.com/facebook/rocksdb/issues/6 0x1832c73f in rocksdb::VersionStorageInfo::~VersionStorageInfo() rocksdb/src/db/version_set.cc:734 https://github.com/facebook/rocksdb/issues/7 0x1832cf42 in rocksdb::Version::~Version() rocksdb/src/db/version_set.cc:758 https://github.com/facebook/rocksdb/issues/8 0x9d1bb5 in rocksdb::Version::Unref() rocksdb/src/db/version_set.cc:2869 https://github.com/facebook/rocksdb/issues/9 0x183e7631 in rocksdb::Compaction::~Compaction() rocksdb/src/db/compaction/compaction.cc:275 https://github.com/facebook/rocksdb/issues/10 0x9e6de6 in std::default_delete<rocksdb::Compaction>::operator()(rocksdb::Compaction) const third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:78 https://github.com/facebook/rocksdb/issues/11 0x9e6de6 in std::unique_ptr<rocksdb::Compaction, std::default_delete<rocksdb::Compaction> >::reset(rocksdb::Compaction) third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:376 https://github.com/facebook/rocksdb/issues/12 0x9e6de6 in rocksdb::DBImpl::BackgroundCompaction(bool, rocksdb::JobContext, rocksdb::LogBuffer, rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2826 https://github.com/facebook/rocksdb/issues/13 0x9ac3b8 in rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2320 https://github.com/facebook/rocksdb/issues/14 0x9abff7 in rocksdb::DBImpl::BGWorkCompaction(void*) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2096 ...... Fix the issue by reference the super version and use the referenced version from it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6473 Test Plan: Run ASAN for all existing tests. Differential Revision: D20196416 fbshipit-source-id: 5f4a7918110fc7b8dd7841932d376bc9d1e59d6f	2020-03-02 16:37:01 -08:00
Zhichao Cao	8d73137ae8	Replace Directory with FSDirectory in DB (#6468 ) Summary: In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468 Test Plan: pass make asan_check Differential Revision: D20195261 Pulled By: zhichao-cao fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1	2020-03-02 16:16:26 -08:00
Huisheng Liu	904a60ff63	return timestamp from get (#6409 ) Summary: Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included. ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks. base line (commit `72ee067b9`): 101.712 micros/op 314602 ops/sec; 36.0 MB/s (5658999 of 5658999 found) This PR: 100.288 micros/op 319071 ops/sec; 36.5 MB/s (5674999 of 5674999 found) ./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409 Differential Revision: D20200086 Pulled By: riversand963 fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50	2020-03-02 16:01:00 -08:00
Michael R. Crusoe	051696bf98	fix some spelling typos (#6464 ) Summary: Found from Debian's "Lintian" program Pull Request resolved: https://github.com/facebook/rocksdb/pull/6464 Differential Revision: D20162862 Pulled By: zhichao-cao fbshipit-source-id: 06941ee2437b038b2b8045becbe9d2c6fbff3e12	2020-02-28 14:14:03 -08:00
Andrew Kryczka	69679e7375	Fix range deletion tombstone ingestion with global seqno (#6429 ) Summary: Original author: jeffrey-xiao If we are writing a global seqno for an ingested file, the range tombstone metablock gets accessed and put into the cache during ingestion preparation. At the time, the global seqno of the ingested file has not yet been determined, so the cached block will not have a global seqno. When the file is ingested and we read its range tombstone metablock, it will be returned from the cache with no global seqno. In that case, we use the actual seqnos stored in the range tombstones, which are all zero, so the tombstones cover nothing. This commit removes global_seqno_ variable from Block. When iterating over a block, the global seqno for the block is determined by the iterator instead of storing this mutable attribute in Block. Additionally, this commit adds a regression test to check that keys are deleted when ingesting a file with a global seqno and range deletion tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429 Differential Revision: D19961563 Pulled By: ajkr fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7	2020-02-25 15:31:48 -08:00
Levi Tamasi	d87c10c6ab	Add blob file state to VersionEdit (#6416 ) Summary: BlobDB currently does not keep track of blob files: no records are written to the manifest when a blob file is added or removed, and upon opening a database, the list of blob files is populated simply based on the contents of the blob directory. This means that lost blob files cannot be detected at the moment. We plan to solve this issue by making blob files a part of `Version`; as a first step, this patch makes it possible to store information about blob files in `VersionEdit`. Currently, this information includes blob file number, total number and size of all blobs, and total number and size of garbage blobs. However, the format is extensible: new fields can be added in both a forward compatible and a forward incompatible manner if needed (similarly to `kNewFile4`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6416 Test Plan: `make check` Differential Revision: D19894234 Pulled By: ltamasi fbshipit-source-id: f9753e1f2aedf6dadb70c09b345207cb9c58c329	2020-02-24 18:39:53 -08:00
Yanqin Jin	890d87fadc	Some minor fix-ups (#6440 ) Summary: Cleanup some code without any real change in functionality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6440 Differential Revision: D20015891 Pulled By: riversand963 fbshipit-source-id: 33e18754b0f002006a6d4805e9aaf84c0c8ad25a	2020-02-21 15:09:56 -08:00
Zaiyang Li	fcec56e86c	Add function to set row cache on rocksdb_options_t (#6442 ) Summary: Adding a C API function to set `row_cache` on `rocksdb_options_t` as this functionality is missing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6442 Differential Revision: D20036813 Pulled By: riversand963 fbshipit-source-id: c1fa95ea343345fbc1e57961d0d048e0e79be373	2020-02-21 11:12:42 -08:00
Yanqin Jin	362b8d4393	Fix MANIFEST name assignment (#6426 ) Summary: Currently, a new MANIFEST file is assigned a new file number when 1) no MANIFEST is open, or 2) current MANIFEST file size exceeds a threshold. This is not sufficient. There are cases when the caller explicitly specifies that a new MANIFEST be created. For example, if user sets options.write_dbid_to_manifest = true, and there are WAL files, then RocksDB will run into an issue during recovery. `DBImpl::Recover()` will call `LogAndApply()` to write dbid. At this point, the db being recovered creates a new MANIFEST, say, MANIFEST-000003. Since there are WALs, `DBImpl::RecoverLogFiles` will be called. Towards the end of this function, we call `LogAndApply(new_descriptor_log=true)`, which explicitly creates a new MANIFEST. However, the manifest_file_number is wrong before this fix. Consequently, RocksDB opens an existing, non-empty file for append, effectively truncating the file to zero. If a crash occurs, then there will be data loss. Test Plan (devserver): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6426 Test Plan: make check Differential Revision: D19951866 Pulled By: riversand963 fbshipit-source-id: 4b1b9fc28d4fe2ac12764b388ef9e61f05e766da	2020-02-20 14:30:58 -08:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
Andrew Kryczka	c6abe30ee3	Fix concurrent full purge and WAL recycling (#5900 ) Summary: We were removing the file from `log_recycle_files_` before renaming it with `ReuseWritableFile()`. Since `ReuseWritableFile()` occurs outside the DB mutex, it was possible for a concurrent full purge to sneak in and delete the file before it could be renamed. Consequently, `SwitchMemtable()` would fail and the DB would enter read-only mode. The fix is to hold the old file number in `log_recycle_files_` until after the file has been renamed. Full purge uses that list to decide which files to keep, so it can no longer delete a file pending recycling. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5900 Test Plan: new unit test Differential Revision: D19771719 Pulled By: ajkr fbshipit-source-id: 094346349ca3fb499712e62de03905acc30b5ce8	2020-02-18 13:54:13 -08:00
Andrew Kryczka	0f9dcb88b2	Return NotSupported from WriteBatchWithIndex::DeleteRange (#5393 ) Summary: As discovered in https://github.com/facebook/rocksdb/issues/5260 and https://github.com/facebook/rocksdb/issues/5392, reads on the indexed batch do not account for range tombstones. So, return `Status::NotSupported` from `WriteBatchWithIndex::DeleteRange` until we properly support it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5393 Test Plan: added unit test Differential Revision: D19912360 Pulled By: ajkr fbshipit-source-id: 0bbfc978ea015d64516ca708fce2429abba524cb	2020-02-18 11:18:25 -08:00
Cheng Chang	4034e289ad	Fail fast in paranoid mode when LoadTableHandlers fail during recovering (#6368 ) Summary: Previously, when recovering version set, LoadTableHandlers failures are ignored. If paranoid_checks is true, this failure should not be ignored, otherwise, the opened db might be in an inconsistent state. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6368 Test Plan: make check Differential Revision: D19713459 Pulled By: cheng-chang fbshipit-source-id: 68cb94f4f2cc43f8b024b14755193cd45cfcad55	2020-02-14 08:17:10 -08:00
Manuel Ung	908b1ee64e	WriteUnPrepared: Fix assertion during recovery (#6419 ) Summary: During recovery, multiple (un)prepared batches could exist in the same WAL record due to group commit. This breaks an assertion in `MemTableInserter::MarkBeginPrepare`. To fix, reset unprepared_batch_ to false after `MarkEndPrepare`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6419 Differential Revision: D19896148 Pulled By: lth fbshipit-source-id: b1a32ef88f775a0881264a18bd1a4a5b8c85eee3	2020-02-13 18:52:05 -08:00
sdong	ac8e89a443	Should flush and sync WAL when writing it in DB::Open() (#6417 ) Summary: A recent fix related to 2pc https://github.com/facebook/rocksdb/pull/6313/ writes something to WAL, but does not flush or sync. This causes assertion failure "impl->TEST_WALBufferIsEmpty()" if manual_wal_flush = true. We should fsync the entry to make sure a second power reset can recover. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6417 Test Plan: Add manual_wal_flush=true case in TransactionTest.DoubleCrashInRecovery and fix a bug in the test so that the bug can be reproduced. It passes with the fix. Differential Revision: D19894537 fbshipit-source-id: f1e84e49e2269f583c6019743118292cd8b6598e	2020-02-13 18:41:04 -08:00
Huisheng Liu	5138764eb5	Fix destroydb (#6308 ) Summary: It's observed on Windows DestroyDB failed to remove the log file because the logger is still alive in sst file manager and holding a handle to the log file. This fix makes sure the logger is released before attempt to clear the database directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6308 Differential Revision: D19818829 Pulled By: riversand963 fbshipit-source-id: 54c3e6859aadaaba4a49b3e851b73dc35ec7dc6a	2020-02-13 11:21:27 -08:00
Cheng Chang	a676001f95	Revert usage of Defer. (#6410 ) Summary: Seems like this caused the following test failure on AppVeyor: DBTest2.CrashInRecoveryMultipleCF c:\projects\rocksdb\db\db_test_util.cc(107): error: DestroyDB(dbname_, options) IO error: Failed to delete: C:\projects\rocksdb\db_tests\\testrocksdb-3112//db_test2_10791409581227174103/000013.sst: Access is denied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6410 Test Plan: Wait to see whether the AppVeyor test passes. Differential Revision: D19879872 Pulled By: cheng-chang fbshipit-source-id: 59a9c55ca88566e9210c0b715ecc45a4fd9afe26	2020-02-13 10:31:32 -08:00
anand76	3e49249d30	Ensure all MultiGet IO errors are propagated to user (#6403 ) Summary: Unrevert the previous fix to propagate error status, and an additional fix to not treat a memtable lookup MergeInProgress status as an error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6403 Test Plan: Unit tests Tried running stress tests but couldn't repro the stress failure Differential Revision: D19846721 Pulled By: anand1976 fbshipit-source-id: 7db10cccbdc863d9b559497f0a46b608d2488ca4	2020-02-11 17:27:22 -08:00
anand76	35ed530d2c	Revert "Check KeyContext status in MultiGet (#6387 )" (#6401 ) Summary: This reverts commit `d70011bccc`. The commit is causing some stress test failure due to unexpected Status::MergeInProgress() return for some keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6401 Differential Revision: D19826623 Pulled By: anand1976 fbshipit-source-id: edd634cede9cb7bdd2cb8f46e662ea709b16d2f1	2020-02-10 22:23:36 -08:00
Cheng Chang	dafb568052	Add utility class Defer (#6382 ) Summary: Add a utility class `Defer` to defer the execution of a function until the Defer object goes out of scope. Used in VersionSet:: ProcessManifestWrites as an example. The inline comments for class `Defer` have more details. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6382 Test Plan: `make defer_test version_set_test && ./defer_test && ./version_set_test` Differential Revision: D19797538 Pulled By: cheng-chang fbshipit-source-id: b1a9b7306e4fd4f48ec2ab55783caa561a315f0f	2020-02-10 17:59:47 -08:00
Levi Tamasi	cbf5f3be43	Do not move VersionEdit into AtomicGroupReadBuffer (#6400 ) Summary: https://github.com/facebook/rocksdb/pull/6383 surfaced an issue with `VersionSet`/`ReactiveVersionSet` and `AtomicGroupReadBuffer::AddEdit` (which was added in https://github.com/facebook/rocksdb/pull/5411): `AddEdit` moves the `VersionEdit` passed to it into `replay_buffer_`, however, the client `VersionSet` classes keep using it afterwards. This seemed to work before the refactoring but it really did not: since `VersionEdit` used to have a user-declared destructor, no move constructor/move assignment operator was generated, and the `move` in `AddEdit` was really a copy. The patch makes the copy explicit. Note: it should be possible to rework this logic so that we can get away with the move but for now, this should fix the issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6400 Test Plan: `make check` `make analyze` Differential Revision: D19824466 Pulled By: ltamasi fbshipit-source-id: f38033967daf2a39c78dcd6e12978bafe37632b4	2020-02-10 17:15:42 -08:00
Zhichao Cao	4369f2c7bb	Checksum for each SST file and stores in MANIFEST (#6216 ) Summary: In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata. Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216 Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check. Differential Revision: D19171461 Pulled By: zhichao-cao fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87	2020-02-10 15:52:52 -08:00
Yutian Li	2e0159ec9e	Add error status for no_slowdown & low priority write (#6396 ) Summary: When `no_slowdown` is enabled, it returns `Status::Incomplete("Write stall")` if a stall would occur. This patch adds descriptive text for when `no_slowdown` and `low_pri` are enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6396 Differential Revision: D19808978 Pulled By: cheng-chang fbshipit-source-id: a53b0d25ed414c821a086531e0222027f925e627	2020-02-10 12:33:16 -08:00
anand76	d70011bccc	Check KeyContext status in MultiGet (#6387 ) Summary: Currently, any IO errors and checksum mismatches while reading data blocks, are being ignored by the batched MultiGet. Its only looking at the GetContext state. Fix that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6387 Test Plan: Add unit tests Differential Revision: D19799819 Pulled By: anand1976 fbshipit-source-id: 46133dccbb04e64067b9fe6cda73e282203db969	2020-02-07 16:48:16 -08:00
Robert Yang	4e457278fa	db/write_thread.cc: Initialize state (#6275 ) Summary: Fixed an error when compiled with -Og: db/write_thread.cc:183:14: error: 'state' may be used uninitialized in this function [-Werror=maybe-uninitialized] Signed-off-by: Robert Yang <liezhi.yang@windriver.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6275 Differential Revision: D19381755 fbshipit-source-id: a90bf3cd4a7248d9d71219e918fc6253deb97e3c	2020-02-07 15:20:38 -08:00
Levi Tamasi	752c87af78	Clean up VersionEdit a bit (#6383 ) Summary: This is a bunch of small improvements to `VersionEdit`. Namely, the patch * Makes the names and order of variables, methods, and code chunks related to the various information elements more consistent, and adds missing getters for the sake of completeness. * Initializes previously uninitialized stack variables. * Marks all getters const to improve const correctness. * Adds in-class initializers and removes the default ctor that would create an object with uninitialized built-in fields and call `Clear` afterwards. * Adds a new type alias for new files and changes the existing `typedef` for deleted files into a type alias as well. * Makes the helper method `DecodeNewFile4From` private. * Switches from long-winded iterator syntax to range based loops in a couple of places. * Fixes a couple of assignments where an integer 0 was assigned to boolean members. * Fixes a getter which used to return a `const std::string` instead of the intended `const std::string&`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6383 Test Plan: make check Differential Revision: D19780537 Pulled By: ltamasi fbshipit-source-id: b0b4f09fee0ec0e7c7b7a6d76bfe5346e91824d0	2020-02-07 13:27:06 -08:00
Cheng Chang	5f478b9f75	Remove outdated comment (#6379 ) Summary: Since the logic for handling IDENTITY file is now inside `NewDB`, the comment above `NewDB` is no longer relevant. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6379 Test Plan: not needed Differential Revision: D19795440 Pulled By: cheng-chang fbshipit-source-id: 0b1cca87ac6d92474701c46aa4c8d4d708bfa19b	2020-02-07 13:18:43 -08:00
Cheng Chang	0a74e1b958	Add status checks during DB::Open (#6380 ) Summary: Several statuses were not checked during DB::Open. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6380 Test Plan: make check Differential Revision: D19780237 Pulled By: cheng-chang fbshipit-source-id: c8d189d20344bd1607890dd1449345bda2ef96b9	2020-02-07 12:32:09 -08:00
Yanqin Jin	f361cedf06	Atomic flush rollback once on failure (#6385 ) Summary: Before this fix, atomic flush codepath may hit an assertion failure on a specific failure case. If all flush jobs within an atomic flush succeed (they do not write to MANIFEST), but batch writing version edits to MANIFEST fails, then `cfd->imm()->RollbackMemTableFlush()` will be called twice, and the second invocation hits assertion failure `assert(m->flush_in_progress_)` since the first invocation resets the variable `flush_in_progress_` to false already. Test plan (dev server): ``` ./db_flush_test --gtest_filter=DBAtomicFlushTest/DBAtomicFlushTest.RollbackAfterFailToInstallResults make check ``` Both must succeed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6385 Differential Revision: D19782943 Pulled By: riversand963 fbshipit-source-id: 84e1592625e729d1b70fdc8479959387a74cb121	2020-02-07 10:52:10 -08:00
Cheng Chang	f5f79f01a2	Be able to read compatible leveldb sst files (#6370 ) Summary: In `DBSSTTest.SSTsWithLdbSuffixHandling`, some sst files are renamed to ldb files, the original intention of the test is to test that the ldb files can be loaded along with the sst files. The original test checks this by `ASSERT_NE("NOT_FOUND", Get(Key(k)))`, but the problem is `Get(Key(k))` returns IO error due to path not found instead of NOT_FOUND, so the success of ASSERT_NE does not mean the key can be retrieved. This PR updates the test to make sure Get(Key(k)) returns the original value. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6370 Test Plan: make db_sst_test && ./db_sst_test Differential Revision: D19726278 Pulled By: cheng-chang fbshipit-source-id: 993127f56457b315e669af4eeb92d6f956b7a4b7	2020-02-06 10:15:44 -08:00
Mike Kolupaev	1ed7d9b1b5	Avoid lots of calls to Env::GetFileSize() in SstFileManagerImpl when opening DB (#6363 ) Summary: Before this PR it calls GetFileSize() once for each sst file in the DB. This can take a long time if there are be tens of thousands of sst files (e.g. in thousands of column families), and even longer if Env is talking to some remote service rather than local filesystem. This PR makes DB::Open() use sst file sizes that are already known from manifest (typically almost all files in the DB) and only call GetFileSize() for non-sst or obsolete files. Note that GetFileSize() is also called and checked against manifest in CheckConsistency(), so the calls in SstFileManagerImpl were completely redundant. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6363 Test Plan: deployed to a test cluster, looked at a dump of Env calls (from a custom instrumented Env) - no more thousands of GetFileSize()s. Differential Revision: D19702509 Pulled By: al13n321 fbshipit-source-id: 99f8110620cb2e9d0c092dfcdbb11f3af4ff8b73	2020-02-04 13:41:53 -08:00
sdong	69c8614815	Avoid to get manifest file size when recovering from it. (#6369 ) Summary: Right now RocksDB gets manifest file size before recovering from it. The information is available in LogReader. Use it instead to prevent one file system call. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6369 Test Plan: Run all existing tests Differential Revision: D19714872 fbshipit-source-id: 0144be324d403c99e3da875ea2feccc8f64e883d	2020-02-04 11:39:23 -08:00
Mike Kolupaev	637e64b9ac	Add an option to prevent DB::Open() from querying sizes of all sst files (#6353 ) Summary: When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open. If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be really slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes. We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353 Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble. Differential Revision: D19656425 Pulled By: al13n321 fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82	2020-02-04 01:27:26 -08:00
anand76	7330ec0ff1	Fix a test failure in error_handler_test (#6367 ) Summary: Fix an intermittent failure in DBErrorHandlingTest.CompactionManifestWriteError due to a race between background error recovery and the main test thread calling TEST_WaitForCompact(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6367 Test Plan: Run the test using gtest_parallel Differential Revision: D19713802 Pulled By: anand1976 fbshipit-source-id: 29e35dc26e0984fe8334c083e059f4fa1f335d68	2020-02-03 18:16:52 -08:00
sdong	f195d8d523	Use ReadFileToString() to get content from IDENTITY file (#6365 ) Summary: Right now when reading IDENTITY file, we use a very similar logic as ReadFileToString() while it does an extra file size check, which may be expensive in some file systems. There is no reason to duplicate the logic. Use ReadFileToString() instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6365 Test Plan: RUn all existing tests. Differential Revision: D19709399 fbshipit-source-id: 3bac31f3b2471f98a0d2694278b41e9cd34040fe	2020-02-03 17:40:49 -08:00
sdong	36c504be17	Avoid create directory for every column families (#6358 ) Summary: A relatively recent regression causes for every CF, create and open directory is called for the DB directory, unless CF has a private directory. This doesn't scale well with large number of column families. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6358 Test Plan: Run all existing tests and see it pass. strace with db_bench --num_column_families and observe it doesn't open directory for number of column families. Differential Revision: D19675141 fbshipit-source-id: da01d9216f1dae3f03d4064fbd88ce71245bd9be	2020-02-03 14:13:39 -08:00
Huisheng Liu	eb4d6af5ae	Error handler test fix (#6266 ) Summary: MultiDBCompactionError fails when it verifies the number of files on level 0 and level 1 without waiting for compaction to finish. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6266 Differential Revision: D19701639 Pulled By: riversand963 fbshipit-source-id: e96d511bcde705075f073e0b550cebcd2ecfccdc	2020-02-03 13:32:53 -08:00
sdong	800d24ddc5	Fix DBTest2.ChangePrefixExtractor LITE build (#6356 ) Summary: DBTest2.ChangePrefixExtractor fails in LITE build because LITE build doesn't support adaptive build. Fix it by removing the stats check but only check correctness. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6356 Test Plan: Run the test with both of LITE and non-LITE build. Differential Revision: D19669537 fbshipit-source-id: 6d7dd6c8a79f18e80ca1636864b9c71922030d8e	2020-01-31 15:44:14 -08:00
sdong	ec496347bc	Add a unit test for prefix extractor changes (#6323 ) Summary: Add a unit test for prefix extractor change, including a check that fails due to a bug. Also comment out the partitioned filter case which will fail the test too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6323 Test Plan: Run the test and it passes (and fails if the SeekForPrev() part is uncommented) Differential Revision: D19509744 fbshipit-source-id: 678202ca97b5503e9de73b54b90de9e5ba822b72	2020-01-31 11:02:03 -08:00
Maysam Yabandeh	3316d29221	Disable recycle_log_file_num when it is incompatible with recovery mode (#6351 ) Summary: Non-zero recycle_log_file_num is incompatible with kPointInTimeRecovery and kAbsoluteConsistency recovery modes. Currently SanitizeOptions changes the recovery mode to kTolerateCorruptedTailRecords, while to resolve this option conflict it makes more sense to compromise recycle_log_file_num, which is a performance feature, instead of wal_recovery_mode, which is a safety feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6351 Differential Revision: D19648931 Pulled By: maysamyabandeh fbshipit-source-id: dd0bf78349edc007518a00c4d63931fd69294ad7	2020-01-31 07:28:30 -08:00
Yanqin Jin	f2fbc5d668	Shorten certain test names to avoid infra failure (#6352 ) Summary: Unit test names, together with other components, are used to create log files during some internal testing. Overly long names cause infra failure due to file names being too long. Look for internal tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6352 Differential Revision: D19649307 Pulled By: riversand963 fbshipit-source-id: 6f29de096e33c0eaa87d9c8702f810eda50059e7	2020-01-30 23:10:24 -08:00
anand76	fb05b5a652	Force a new manifest file if append to current one fails (#6331 ) Summary: Fix for issue https://github.com/facebook/rocksdb/issues/6316 When an append/sync of the manifest file fails due to an IO error such as NoSpace, we don't always put the DB in read-only mode. This is true for flush and compactions, as well as foreground operatons such as column family add/drop, CompactFiles etc. Subsequent changes to the DB will be recorded in the same manifest file, which would have a corrupted record in the middle due to the previous failure. On next DB::Open(), it will fail to process the full manifest and data will be lost. To fix this, we reset VersionSet::descriptor_log_ on append/sync failure, which will force a new manifest file to be written on the next append. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6331 Test Plan: Add new unit tests in error_handler_test.cc Differential Revision: D19632951 Pulled By: anand1976 fbshipit-source-id: 68d527cb6e59a94cbbbf9f5a17a7f464381d51e3	2020-01-30 10:56:29 -08:00
sdong	71874c5aaf	Fix LITE build with DBTest2.AutoPrefixMode1 (#6346 ) Summary: DBTest2.AutoPrefixMode1 doesn't pass because auto prefix mode is not supported there. Fix it by disabling the test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6346 Test Plan: Run DBTest2.AutoPrefixMode1 in lite mode Differential Revision: D19627486 fbshipit-source-id: fbde75260aeecb7e6fc406e09c19a71a95aa5f08	2020-01-29 16:43:42 -08:00
sdong	02ac6c9a3c	Fix db_bloom_filter_test clang LITE build (#6340 ) Summary: db_bloom_filter_test break with clang LITE build with following message: db/db_bloom_filter_test.cc:23:29: error: unused variable 'kPlainTable' [-Werror,-Wunused-const-variable] static constexpr PseudoMode kPlainTable = -1; ^ Fix it by moving the declaration out of LITE build Pull Request resolved: https://github.com/facebook/rocksdb/pull/6340 Test Plan: USE_CLANG=1 LITE=1 make db_bloom_filter_test and without LITE=1 Differential Revision: D19609834 fbshipit-source-id: 0e88f5c6759238a94f9880d84c785ac18e7cdd7e	2020-01-29 12:57:48 -08:00
Maysam Yabandeh	2f973ca96e	Double Crash in kPointInTimeRecovery with TransactionDB (#6313 ) Summary: In WritePrepared there could be gap in sequence numbers. This breaks the trick we use in kPointInTimeRecovery which assume the first seq in the log right after the corrupted log is one larger than the last seq we read from the logs. To let this trick keep working, we add a dummy entry with the expected sequence to the first log right after recovery. Also in WriteCommitted, if the log right after the corrupted log is empty, since it has no sequence number to let the sequential trick work, it is assumed as unexpected behavior. This is however expected to happen if we close the db after recovering from a corruption and before writing anything new to it. To remedy that, we apply the same technique by writing a dummy entry to the log that is created after the corrupted log. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6313 Differential Revision: D19458291 Pulled By: maysamyabandeh fbshipit-source-id: 09bc49e574690085df45b034ca863ff315937e2d	2020-01-29 11:40:55 -08:00
sdong	8f2bee6747	Add ReadOptions.auto_prefix_mode (#6314 ) Summary: Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314 Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run. Differential Revision: D19458717 fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce	2020-01-28 14:44:05 -08:00
Sagar Vemuri	4f6c86226c	Use the same oldest ancestor time in table properties and manifest Summary: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions passed 96 / 100 times. ``` With the fix: all runs (tried 100, 1000, 10000) succeed. ``` $ TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.LevelTtlCascadingCompactions --repeat=1000 [1000/1000] DBCompactionTest.LevelTtlCascadingCompactions (1895 ms) ``` Test Plan: Build: ``` COMPILE_WITH_TSAN=1 make db_compaction_test -j100 ``` Without the fix: a few runs out of 100 fail: ``` $ TEST_TMPDIR=/dev/shm KEEP_DB=1 ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.LevelTtlCascadingCompactions --repeat=100 ... ... Note: Google Test filter = DBCompactionTest.LevelTtlCascadingCompactions [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBCompactionTest [ RUN ] DBCompactionTest.LevelTtlCascadingCompactions db/db_compaction_test.cc:3687: Failure Expected equality of these values: oldest_time Which is: 1580155869 level_to_files[6][0].oldest_ancester_time Which is: 1580155870 DB is still at /dev/shm//db_compaction_test_6337001442947696266 [ FAILED ] DBCompactionTest.LevelTtlCascadingCompactions (1432 ms) [----------] 1 test from DBCompactionTest (1432 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (1433 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] DBCompactionTest.LevelTtlCascadingCompactions 1 FAILED TEST [80/100] DBCompactionTest.LevelTtlCascadingCompactions returned/aborted with exit code 1 (1489 ms) [100/100] DBCompactionTest.LevelTtlCascadingCompactions (1522 ms) FAILED TESTS (4/100): 1419 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/90) 1434 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/84) 1457 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/82) 1489 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/74) Differential Revision: D19587040 Pulled By: sagar0 fbshipit-source-id: 11191ae9940837643bff47ebe18b299b4be3d950	2020-01-27 19:58:53 -08:00
Andrew Kryczka	5b33cfa1e3	fix `WriteBufferManager` flush log message (#6335 ) Summary: It chooses the oldest memtable, not the largest one. This is an important difference for users whose CFs receive non-uniform write rates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6335 Differential Revision: D19588865 Pulled By: maysamyabandeh fbshipit-source-id: 62ad4325b0182f5f27858584cd73fd5978fb2cec	2020-01-27 15:49:22 -08:00
sdong	f10f135938	Fix regression bug of hash index with iterator total order seek (#6328 ) Summary: https://github.com/facebook/rocksdb/pull/6028 introduces a bug for hash index in SST files. If a table reader is created when total order seek is used, prefix_extractor might be passed into table reader as null. While later when prefix seek is used, the same table reader used, hash index is checked but prefix extractor is null and the program would crash. Fix the issue by fixing http://github.com/facebook/rocksdb/pull/6028 in the way that prefix_extractor is preserved but ReadOptions.total_order_seek is checked Also, a null pointer check is added so that a bug like this won't cause segfault in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6328 Test Plan: Add a unit test that would fail without the fix. Stress test that reproduces the crash would pass. Differential Revision: D19586751 fbshipit-source-id: 8de77690167ddf5a77a01e167cf89430b1bfba42	2020-01-27 15:44:54 -08:00
Levi Tamasi	f34782a67d	Fix the "records dropped" statistics (#6325 ) Summary: The earlier code used two conflicting definitions for the number of input records going into a compaction, one based on the `rocksdb.num.entries` table property and one based on `CompactionIterationStats`. The first one is correct and in line with how output records are counted, while the second one incorrectly ignores input records in various cases when the `CompactionIterator` advances or reseeks the input iterator (this can happen, amongst other cases, when dealing with `SingleDelete`s, regular `Delete`s, `Merge`s, and compaction filters). This can result in the code undercounting the input records and computing an incorrect value for "records dropped" during the compaction. The patch fixes this by switching over to the correct (table property based) input record count for "records dropped". Pull Request resolved: https://github.com/facebook/rocksdb/pull/6325 Test Plan: Tested using `make check` and `db_bench`. Differential Revision: D19525491 Pulled By: ltamasi fbshipit-source-id: 4340b0b2f41546db8e356db70ca02199e48fa636	2020-01-23 15:27:22 -08:00
anand76	0672a6db64	Fix queue manipulation in WriteThread::BeginWriteStall() (#6322 ) Summary: When there is a write stall, the active write group leader calls ```BeginWriteStall()``` to walk the queue of writers and remove any with the ```no_slowdown``` option set. There was a bug in the code which updated the back pointer but not the forward pointer (```link_newer```), corrupting the list and causing some threads to wait forever. This PR fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6322 Test Plan: Add a unit test in db_write_test Differential Revision: D19538313 Pulled By: anand1976 fbshipit-source-id: 6fbed819e594913f435886606f5d36f74f235c3a	2020-01-23 14:01:28 -08:00
matthewvon	e6e8b9e871	Correct pragma once problem with Bazel on Windows (#6321 ) Summary: This is a simple edit to have two #include file paths be consistent within range_del_aggregator.{h,cc} with everywhere else. The impact of this inconsistency is that it actual breaks a Bazel based build on the Windows platform. The same pragma once failure occurs with both Windows Visual C++ 2019 and clang for Windows 9.0. Bazel's "sandboxing" of the builds causes both compilers to not properly recognize "rocksdb/types.h" and "include/rocksdb/types.h" to be the same file (also comparator.h). My guess is that the backslash versus forward slash mixing within path names is the underlying issue. But, everything builds fine once the include paths in these two source files are consistent with the rest of the repository. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6321 Differential Revision: D19506585 Pulled By: ltamasi fbshipit-source-id: 294c346607edc433ab99eaabc9c880ee7426817a	2020-01-21 16:12:43 -08:00
Levi Tamasi	d305f13e21	Make DBCompactionTest.SkipStatsUpdateTest more robust (#6306 ) Summary: Currently, this test case tries to infer whether `VersionStorageInfo::UpdateAccumulatedStats` was called during open by checking the number of files opened against an arbitrary threshold (10). This makes the test brittle and results in sporadic failures. The patch changes the test case to use sync points to directly test whether `UpdateAccumulatedStats` was called. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6306 Test Plan: `make check` Differential Revision: D19439544 Pulled By: ltamasi fbshipit-source-id: ceb7adf578222636a0f51740872d0278cd1a914f	2020-01-21 12:55:55 -08:00
chenyou-fdu	931876e86e	Separate enable-WAL and disable-WAL writer to avoid unwanted data in log files (#6290 ) Summary: When we do concurrently writes, and different write operations will have WAL enable or disable. But the data from write operation with WAL disabled will still be logged into log files, which will lead to extra disk write/sync since we do not want any guarantee for these part of data. Detail can be found in https://github.com/facebook/rocksdb/issues/6280. This PR avoid mixing the two types in a write group. The advantage is simpler reasoning about the write group content Pull Request resolved: https://github.com/facebook/rocksdb/pull/6290 Differential Revision: D19448598 Pulled By: maysamyabandeh fbshipit-source-id: 3d990a0f79a78ea1bfc90773f6ebafc1884c20de	2020-01-17 15:54:55 -08:00
Matt Bell	7e5b04d04f	Expose atomic flush option in C API (#6307 ) Summary: This PR adds a `rocksdb_options_set_atomic_flush` function to the C API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6307 Differential Revision: D19451313 Pulled By: ltamasi fbshipit-source-id: 750495642ef55b1ea7e13477f85c38cd6574849c	2020-01-17 12:57:48 -08:00
sdong	d87cffaea4	Fix another bug caused by recent hash index fix (#6305 ) Summary: Recent bug fix related to hash index introduced a new bug: hash index can return NotFound but it is not handled by BlockBasedTable::Get(). The end result is that Get() stops being executed too early. Fix it by ignoring NotFound code in Get(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6305 Test Plan: A problematic DB used to return NotFound incorrectly, and now able to return correct result. Will try to construct a unit test too.0 Differential Revision: D19438925 fbshipit-source-id: e751afa8c13728d56511cfeb1bc811ecb99f3217	2020-01-17 01:41:04 -08:00
Levi Tamasi	73f65b457e	Adjust thread pool sizes when setting max_background_jobs dynamically (#6300 ) Summary: https://github.com/facebook/rocksdb/pull/2205 introduced a new configuration option called `max_background_jobs`, superseding the earlier options `max_background_flushes` and `max_background_compactions`. However, unlike `max_background_compactions`, setting `max_background_jobs` dynamically through the `SetDBOptions` interface does not adjust the size of the thread pools (see https://github.com/facebook/rocksdb/issues/6298). The patch fixes this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6300 Test Plan: Extended unit test. Differential Revision: D19430899 Pulled By: ltamasi fbshipit-source-id: 704006605b3c13c3d1b997ccc0831ee369721074	2020-01-16 14:35:10 -08:00
sdong	f8b5ef85ec	Fix a bug caused by recent fix of Prefix Hash (#6302 ) Summary: Recent fix to Prefix Hash https://github.com/facebook/rocksdb/pull/6292 caused a bug that the newly created NotFound status in hash index is never reset. This causes reseek or implict reseek to return wrong results sometimes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6302 Test Plan: Add a unit test that would fail. Not fix. crash test with hash test would fail in several seconds. With the fix, it will run about several minutes before failing with another failure. Differential Revision: D19424572 fbshipit-source-id: c5276f36a95fd0e2837e30190476d2fe21ed8566	2020-01-16 10:47:20 -08:00
sdong	d2b4d42d4b	Fix kHashSearch bug with SeekForPrev (#6297 ) Summary: When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target. Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297 Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix. Differential Revision: D19404695 fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3	2020-01-15 14:28:39 -08:00
sdong	76c117b24b	Fix LITE test build broken by recent commit (#6295 ) Summary: A recent commit adds a unit test that uses a function not available in LITE build. Fix it by avoiding the call Pull Request resolved: https://github.com/facebook/rocksdb/pull/6295 Test Plan: Run the test in LITE build and see it passes. Differential Revision: D19395678 fbshipit-source-id: 37b42835bae02511630d80f7cafb1179401bc033	2020-01-14 13:17:04 -08:00
sdong	894c6d21af	Bug when multiple files at one level contains the same smallest key (#6285 ) Summary: The fractional cascading index is not correctly generated when two files at the same level contains the same smallest or largest user key. The result would be that it would hit an assertion in debug mode and lower level files might be skipped. This might cause wrong results when the same user keys are of merge operands and Get() is called using the exact user key. In that case, the lower files would need to further checked. The fix is to fix the fractional cascading index. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6285 Test Plan: Add a unit test which would cause the assertion which would be fixed. Differential Revision: D19358426 fbshipit-source-id: 39b2b1558075fd95e99491d462a67f9f2298c48e	2020-01-13 16:27:42 -08:00
Qinfan Wu	6733be033e	More const pointers in C API (#6283 ) Summary: This makes it easier to call the functions from Rust as otherwise they require mutable types. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6283 Differential Revision: D19349991 Pulled By: wqfish fbshipit-source-id: e8da7a75efe8cd97757baef8ca844a054f2519b4	2020-01-10 19:27:09 -08:00
Sagar Vemuri	cfa585611d	Consider all compaction input files to compute the oldest ancestor time (#6279 ) Summary: Look at all compaction input files to compute the oldest ancestor time. In https://github.com/facebook/rocksdb/issues/5992 we changed how creation_time (aka oldest-ancestor-time) table property of compaction output files is computed from max(creation-time-of-all-compaction-inputs) to min(creation-time-of-all-inputs). This exposed a bug where, during compaction, the creation_time:s of only the L0 compaction inputs were being looked at, and all other input levels were being ignored. This PR fixes the issue. Some TTL compactions when using Level-Style compactions might not have run due to this bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6279 Test Plan: Enhanced the unit tests to validate that the correct time is propagated to the compaction outputs. Differential Revision: D19337812 Pulled By: sagar0 fbshipit-source-id: edf8a72f11e405e93032ff5f45590816debe0bb4	2020-01-10 19:02:42 -08:00
Maysam Yabandeh	eff5e076f5	unordered_write incompatible with max_successive_merges (#6284 ) Summary: unordered_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that and also reverts the changes performed by https://github.com/facebook/rocksdb/pull/6254, in which max_successive_merges was mistakenly declared incompatible with unordered_write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6284 Differential Revision: D19356115 Pulled By: maysamyabandeh fbshipit-source-id: f06dadec777622bd75f267361c022735cf8cecb6	2020-01-10 16:53:19 -08:00
Yanqin Jin	6a9989381f	Fix compilation under LITE (#6277 ) Summary: Fix compilation under LITE by putting `#ifndef ROCKSDB_LITE` around a code block. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6277 Differential Revision: D19334157 Pulled By: riversand963 fbshipit-source-id: 947111ed68aa550f5ea424b216c1442a8af9e32b	2020-01-09 15:57:39 -08:00
Yanqin Jin	cfd9732f65	Remove inaccurate code comment (#6274 ) Summary: Remove a comment. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6274 Differential Revision: D19323151 Pulled By: riversand963 fbshipit-source-id: d0d804d6882edcd94e35544ef45578b32ff1caae	2020-01-08 17:51:42 -08:00
Huisheng Liu	e5b476f551	Update file indexer to take timestamp into consideration (#6205 ) Summary: Exclude timestamp in key comparison during boundary calculation to avoid key versions being excluded. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6205 Differential Revision: D19166765 Pulled By: riversand963 fbshipit-source-id: bbe08816fef8de349a83ebd59a595ad844021f24	2020-01-08 16:31:23 -08:00
Yanqin Jin	a8b1085ae2	Fix test in LITE mode (#6267 ) Summary: Currently, the recently-added test DBTest2.SwitchMemtableRaceWithNewManifest fails in LITE mode since SetOptions() returns "Not supported". I do not want to put `#ifndef ROCKSDB_LITE` because it reduces test coverage. Instead, just trigger compaction on a different column family. The bg compaction thread calling LogAndApply() may race with thread calling SwitchMemtable(). Test Plan (dev server): make check OPT=-DROCKSDB_LITE make check or run DBTest2.SwitchMemtableRaceWithNewManifest 100 times. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6267 Differential Revision: D19301309 Pulled By: riversand963 fbshipit-source-id: 88cedcca2f985968ed3bb234d324ffa2aa04ca50	2020-01-07 13:47:03 -08:00
Yanqin Jin	bce5189f4d	Fix error message (#6264 ) Summary: Fix an error message when CURRENT is not found. Test plan (dev server) ``` make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6264 Differential Revision: D19300699 Pulled By: riversand963 fbshipit-source-id: 303fa206386a125960ecca1dbdeff07422690caf	2020-01-07 12:32:20 -08:00
Connor1996	3e26a94ba1	Add oldest snapshot sequence property (#6228 ) Summary: Add oldest snapshot sequence property, so we can use `db.GetProperty("rocksdb.oldest-snapshot-sequence")` to get the sequence number of the oldest snapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6228 Differential Revision: D19264145 Pulled By: maysamyabandeh fbshipit-source-id: 67fbe5304d89cbc475bd404e30d1299f7b11c010	2020-01-07 08:36:44 -08:00
Yanqin Jin	1aaa145877	Fix a data race for cfd->log_number_ (#6249 ) Summary: A thread calling LogAndApply may release db mutex when calling WriteCurrentStateToManifest() which reads cfd->log_number_. Another thread can call SwitchMemtable() and writes to cfd->log_number_. Solution is to cache the cfd->log_number_ before releasing mutex in LogAndApply. Test Plan (on devserver): ``` $COMPILE_WITH_TSAN=1 make db_stress $./db_stress --acquire_snapshot_one_in=10000 --avoid_unnecessary_blocking_io=1 --block_size=16384 --bloom_bits=16 --bottommost_compression_type=zstd --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_pipelined_write=0 --flush_one_in=1000000 --format_version=5 --get_live_files_and_wal_files_one_in=1000000 --index_block_restart_interval=5 --index_type=0 --log2_keys_per_lock=22 --long_running_snapshots=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000000 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --memtablerep=skip_list --mmap_read=0 --nooverwritepercent=1 --open_files=500000 --ops_per_thread=100000000 --partition_filters=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --set_options_one_in=10000 --snapshot_hold_ops=100000 --subcompactions=2 --sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 --write_dbid_to_manifest=1 --writepercent=35 ``` Then repeat the following multiple times, e.g. 100 after compiling with tsan. ``` $./db_test2 --gtest_filter=DBTest2.SwitchMemtableRaceWithNewManifest ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6249 Differential Revision: D19235077 Pulled By: riversand963 fbshipit-source-id: 79467b52f48739ce7c27e440caa2447a40653173	2020-01-06 20:09:51 -08:00
Qinfan Wu	edaaa1fff2	Add range delete function to C-API (#6259 ) Summary: It seems that the C-API doesn't expose the range delete functionality at the moment, so add the API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6259 Differential Revision: D19290320 Pulled By: pdillinger fbshipit-source-id: 3f403a4c3446d2042d55f1ece7cdc9c040f40c27	2020-01-06 10:46:21 -08:00
Maysam Yabandeh	28e5a9a9fb	Increase max_log_size in FlushJob to 1024 bytes (#6258 ) Summary: When measure_io_stats_ is enabled, the volume of logging is beyond the default limit of 512 size. The patch allows the EventLoggerStream to change the limit, and also sets it to 1024 for FlushJob. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6258 Differential Revision: D19279269 Pulled By: maysamyabandeh fbshipit-source-id: 3fb5d468dad488f289ac99d713378177eb7504d6	2020-01-06 10:16:52 -08:00
Maysam Yabandeh	48a678b7c9	Prevent an incompatible combination of options (#6254 ) Summary: allow_concurrent_memtable_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6254 Differential Revision: D19265819 Pulled By: maysamyabandeh fbshipit-source-id: 47f2e2dc26fe0972c7152f4da15dadb9703f1179	2020-01-02 16:15:06 -08:00
sdong	ef91894798	Fix potential overflow in CalculateSSTWriteHint() (#6212 ) Summary: level passed into ColumnFamilyData::CalculateSSTWriteHint() can be smaller than base_level in current version, which would cause overflow. We see ubsan complains: db/compaction/compaction_job.cc:1511:39: runtime error: load of value 4294967295, which is not a valid value for type 'Env::WriteLifeTimeHint' and I hope this commit fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6212 Test Plan: Run existing tests and see them to pass. Differential Revision: D19168442 fbshipit-source-id: bf8fd86f85478ecfa7556db46dc3242de8c83dc9	2019-12-18 17:04:15 -08:00
Jermy Li	f453bcb40d	Add unit tests for concurrent CF iteration and drop (#6180 ) Summary: improve https://github.com/facebook/rocksdb/issues/6147 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6180 Differential Revision: D19148936 fbshipit-source-id: f691c9879fd51d54e96c1a99670cf85ca4485a89	2019-12-18 11:54:35 -08:00
Mike Kolupaev	ce63eda6f0	Fix use-after-free and double-deleting files in BackgroundCallPurge() (#6193 ) Summary: The bad code was: ``` mutex.Lock(); // `mutex` protects `container` for (auto& x : container) { mutex.Unlock(); // do stuff to x mutex.Lock(); } ``` It's incorrect because both `x` and the iterator may become invalid if another thread modifies the container while this thread is not holding the mutex. Broken by https://github.com/facebook/rocksdb/pull/5796 - it replaced a `while (!container.empty())` loop with a `for (auto x : container)`. (RocksDB code does a lot of such unlocking+re-locking of mutexes, and this type of bugs comes up a lot :/ ) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6193 Test Plan: Ran some logdevice integration tests that were crashing without this fix. Differential Revision: D19116874 Pulled By: al13n321 fbshipit-source-id: 9672bc4227c1b68f46f7436db2b96811adb8c703	2019-12-17 20:08:56 -08:00
Adam Retter	2d16709487	Small tidy and speed up of the travis build (#6181 ) Summary: Cuts about 30-60 seconds to from each Travis Linux build, and about 15 minutes from each macOS build Pull Request resolved: https://github.com/facebook/rocksdb/pull/6181 Differential Revision: D19098357 Pulled By: pdillinger fbshipit-source-id: 863dd1ab09076ad9b03c2b7914908359628315ae	2019-12-17 13:56:45 -08:00
解轶伦	39fcaf8246	delete superversions in BackgroundCallPurge (#6146 ) Summary: I found that CleanupSuperVersion() may block Get() for 30ms+ （per MemTable is 256MB）. Then I found "delete sv" in ~SuperVersion() takes the time. The backtrace looks like this DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() -> DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion() I think it's better to delete in a background thread, please review it。 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146 Differential Revision: D18972066 fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c	2019-12-17 13:22:57 -08:00
Levi Tamasi	02aa22957a	Set CompactionIterator::valid_ to false when PrepareBlobOutput indicates error Summary: With https://github.com/facebook/rocksdb/pull/6121, errors returned by `PrepareBlobValue` result in `CompactionIterator::status_` being set to `Corruption` or `IOError` as appropriate, however, `valid_` is not set to `false`. The error is eventually propagated in `CompactionJob::ProcessKeyValueCompaction` but only after the main loop completes. Setting `valid_` to `false` upon errors enables us to terminate the loop early and fail the compaction sooner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6170 Test Plan: Ran `make check` and used `db_bench` in BlobDB mode. fbshipit-source-id: a2ca88a3ca71115e2605bd34a4c795d8a28bef27	2019-12-17 10:20:16 -08:00
Yanqin Jin	7678cf2df7	Use Env::LoadEnv to create custom Env objects (#6196 ) Summary: As title. Previous assumption was that the underlying lib can always return a shared_ptr<Env>. This is too strong. Therefore, we use Env::LoadEnv to relax it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6196 Test Plan: make check Differential Revision: D19133199 Pulled By: riversand963 fbshipit-source-id: c83a0c02a42610d077054f2de1acfc45126b3a75	2019-12-16 20:03:14 -08:00
Zhichao Cao	cddd637997	Merge adjacent file block reads in RocksDB MultiGet() and Add uncompressed block to cache (#6089 ) Summary: In the current MultiGet, if the KV-pairs do not belong to the data blocks in the block cache, multiple blocks are read from a SST. It will trigger one block read for each block request and read them in parallel. In some cases, if some data blocks are adjacent in the SST, the reads for these blocks can be combined to a single large read, which can reduce the system calls and reduce the read latency if possible. Considering to fill the block cache, if multiple data blocks are in the same memory buffer, we need to copy them to the heap separately. Therefore, only in the case that 1) data block compression is enabled, and 2) compressed block cache is null, we can do combined read. Otherwise, extra memory copy is needed, which may cause extra overhead. In the current case, data blocks will be uncompressed to a new memory space. Also, in the case that 1) data block compression is enabled, and 2) compressed block cache is null, it is possible the data block is actually not compressed. In the current logic, these data blocks will not be added to the uncompressed_cache. So if memory buffer is shared and the data block is not compressed, the data block are copied to the head and fill the cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6089 Test Plan: Added test case to ParallelIO.MultiGet. Pass make asan_check Differential Revision: D18734668 Pulled By: zhichao-cao fbshipit-source-id: 67c5615ed373e51e42635fd74b36f8f3a66d5da4	2019-12-16 16:26:03 -08:00
Levi Tamasi	db7c687523	Fix a data race related to memtable trimming (#6187 ) Summary: https://github.com/facebook/rocksdb/pull/6177 introduced a data race involving `MemTableList::InstallNewVersion` and `MemTableList::NumFlushed`. The patch fixes this by caching whether the current version has any memtable history (i.e. flushed memtables that are kept around for transaction conflict checking) in an `std::atomic<bool>` member called `current_has_history_`, similarly to how `current_memory_usage_excluding_last_` is handled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6187 Test Plan: ``` make clean COMPILE_WITH_TSAN=1 make db_test -j24 ./db_test ``` Differential Revision: D19084059 Pulled By: ltamasi fbshipit-source-id: 327a5af9700fb7102baea2cc8903c085f69543b9	2019-12-16 13:16:31 -08:00
Levi Tamasi	bd8404feff	Do not schedule memtable trimming if there is no history (#6177 ) Summary: We have observed an increase in CPU load caused by frequent calls to `ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory` when using `max_write_buffer_size_to_maintain` to limit the amount of memtable history maintained for transaction conflict checking. Part of the issue is that trimming can potentially be scheduled even if there is no memtable history. The patch adds a check that fixes this. See also https://github.com/facebook/rocksdb/pull/6169. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6177 Test Plan: Compared `perf` output for ``` ./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32 ``` before and after the change. There is a significant reduction for the call chain `rocksdb::DBImpl::TrimMemtableHistory` -> `rocksdb::ColumnFamilyData::InstallSuperVersion` -> `rocksdb::ThreadLocalPtr::StaticMeta::Scrape` even without https://github.com/facebook/rocksdb/pull/6169. Differential Revision: D19057445 Pulled By: ltamasi fbshipit-source-id: dff81882d7b280e17eda7d9b072a2d4882c50f79	2019-12-13 19:11:19 -08:00
anand76	afa2420c2b	Introduce a new storage specific Env API (#5761 ) Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f	2019-12-13 14:48:41 -08:00
Peter Dillinger	58d46d1915	Add useful idioms to Random API (OneInOpt, PercentTrue) (#6154 ) Summary: And clean up related code, especially in stress test. (More clean up of db_stress_test_base.cc coming after this.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6154 Test Plan: make check, make blackbox_crash_test for a bit Differential Revision: D18938180 Pulled By: pdillinger fbshipit-source-id: 524d27621b8dbb25f6dff40f1081e7c00630357e	2019-12-13 14:30:14 -08:00
Levi Tamasi	6d54eb3dc2	Do not create/install new SuperVersion if nothing was deleted during memtable trim (#6169 ) Summary: We have observed an increase in CPU load caused by frequent calls to `ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory` when using `max_write_buffer_size_to_maintain` to limit the amount of memtable history maintained for transaction conflict checking. As it turns out, this is caused by the code creating and installing a new `SuperVersion` even if no memtables were actually trimmed. The patch adds a check to avoid this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6169 Test Plan: Compared `perf` output for ``` ./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32 ``` before and after the change. With the fix, the call chain `rocksdb::DBImpl::TrimMemtableHistory` -> `rocksdb::ColumnFamilyData::InstallSuperVersion` -> `rocksdb::ThreadLocalPtr::StaticMeta::Scrape` no longer registers in the `perf` report. Differential Revision: D19031509 Pulled By: ltamasi fbshipit-source-id: 02686fce594e5b50eba0710e4b28a9b808c8aa20	2019-12-13 13:29:29 -08:00
Levi Tamasi	583c6953d8	Move out valid blobs from the oldest blob files during compaction (#6121 ) Summary: The patch adds logic that relocates live blobs from the oldest N non-TTL blob files as they are encountered during compaction (assuming the BlobDB configuration option `enable_garbage_collection` is `true`), where N is defined as the number of immutable non-TTL blob files multiplied by the value of a new BlobDB configuration option called `garbage_collection_cutoff`. (The default value of this parameter is 0.25, that is, by default the valid blobs residing in the oldest 25% of immutable non-TTL blob files are relocated.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6121 Test Plan: Added unit test and tested using the BlobDB mode of `db_bench`. Differential Revision: D18785357 Pulled By: ltamasi fbshipit-source-id: 8c21c512a18fba777ec28765c88682bb1a5e694e	2019-12-13 10:13:05 -08:00
Jermy Li	c2029f9716	Support concurrent CF iteration and drop (#6147 ) Summary: It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI. This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early. fixed https://github.com/facebook/rocksdb/issues/5982 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6147 Differential Revision: D18926378 fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15	2019-12-12 19:04:48 -08:00
奏之章	c4ce8e637f	Fix RangeDeletion bug (#6062 ) Summary: Read keys from a snapshot that a range deletion were added after the snapshot was created and this range deletion was inside an immutable memtable, we will get wrong key set. More detail rest in codes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6062 Differential Revision: D18966785 Pulled By: pdillinger fbshipit-source-id: 38a60bb1e2d0a1dbfc8ec641617200b6a02b86c3	2019-12-12 15:18:02 -08:00
Connor	a844591201	wait pending memtable writes on file ingestion or compact range (#6113 ) Summary: Summary: This PR fixes two unordered_write related issues: - ingestion job may skip the necessary memtable flush https://github.com/facebook/rocksdb/issues/6026 - compact range may cause memtable is flushed before pending unordered write finished 1. `CompactRange` triggers memtable flush but doesn't wait for pending-writes 2. there are some pending writes but memtable is already flushed 3. the memtable related WAL is removed( note that the pending-writes were recorded in that WAL). 4. pending-writes write to newer created memtable 5. there is a restart 6. missing the previous pending-writes because WAL is removed but they aren't included in SST. How to solve: - Wait pending memtable writes before ingestion job check memtable key range - Wait pending memtable writes before flush memtable. Note that: `CompactRange` calls `RangesOverlapWithMemtables` too without waiting for pending waits, but I'm not sure whether it affects the correctness. Test Plan: make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6113 Differential Revision: D18895674 Pulled By: maysamyabandeh fbshipit-source-id: da22b4476fc7e06c176020e7cc171eb78189ecaf	2019-12-12 14:08:02 -08:00
Levi Tamasi	e1dfe80fe0	Mark BlobIndex::DebugString const Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6157 Test Plan: make check Differential Revision: D18944259 Pulled By: ltamasi fbshipit-source-id: 7fb29447b52d801215bd6ab811e229a7fa2c763d	2019-12-11 17:19:43 -08:00
Peter Dillinger	d0ad3c59d8	Fix c_test:filter for various CACHE_LINE_SIZEs (#6153 ) Summary: This test was recently updated but failed to account for Bloom schema variance by CACHE_LINE_SIZE. (Since CACHE_LINE_SIZE is not defined in our C code, the test now simply allows a valid result for any CACHE_LINE_SIZE, not just the current one.) Unblock https://github.com/facebook/rocksdb/issues/5932 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6153 Test Plan: ran unit test with builds TEST_CACHE_LINE_SIZE=128, =256, and unset (64 on Intel) Differential Revision: D18936015 Pulled By: pdillinger fbshipit-source-id: e5e3852f95283d34d624632c1ae8d3adb2f2662c	2019-12-11 15:17:08 -08:00
奏之章	3717a88289	Fix UniversalCompaction trivial move bug (#6067 ) Summary: `curr.level` is `c->inputs_` index, not real level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6067 Differential Revision: D18935726 fbshipit-source-id: 4354e6e9cd900ca56c96e9d770f0ab6634e45daf	2019-12-11 11:27:53 -08:00
Yi Wu	05a86318a7	Remove unused low_pri_write_rate_limiter_ (#6068 ) Summary: `low_pri_write_rate_limiter_` is not being used. Removing. `WriteController` has an internal low_pri rate limiter which is the real rate limiter for low-pri writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6068 Test Plan: make Differential Revision: D18664120 fbshipit-source-id: dfe3e4de033cf3522b67781b383aad7d0936034c	2019-12-11 10:28:33 -08:00
sdong	a68dff5c35	Apply formatter to some recent commits (#6138 ) Summary: Formatter somehow complains some recent lines changed. Apply them to make the formatter happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6138 Test Plan: See CI passes. Differential Revision: D18895950 fbshipit-source-id: 7d1696cf3e3a682bc10a30cdca748a23c6565255	2019-12-09 15:49:49 -08:00
Peter Dillinger	e43d2c4424	Fix & test rocksdb_filterpolicy_create_bloom_full (#6132 ) Summary: Add overrides needed in FilterPolicy wrapper to fix rocksdb_filterpolicy_create_bloom_full (see issue https://github.com/facebook/rocksdb/issues/6129). Re-enabled assertion in BloomFilterPolicy::CreateFilter that was being violated. Expanded c_test to identify Bloom filter implementations by FP counts. (Without the fix, updated test will trigger assertion and fail otherwise without the assertion.) Fixes https://github.com/facebook/rocksdb/issues/6129 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6132 Test Plan: updated c_test, also run under valgrind. Differential Revision: D18864911 Pulled By: pdillinger fbshipit-source-id: 08e81d7b5368b08e501cd402ef5583f2650c19fa	2019-12-09 12:21:14 -08:00
Ziyue Yang	7e2f831924	Fix wrong ExtractUserKey usage in BlockBasedTableBuilder::EnterUnbuff… (#6100 ) Summary: BlockBasedTableBuilder uses ExtractUserKey in EnterUnbuffered. This would cause index filter building error, since user-provided timestamp is supported by ExtractUserKeyAndStripTimestamp, and it's used in Add. This commit changes ExtractUserKey to ExtractUserKeyAndStripTimestamp. A test case is also added by modifying DBBasicTestWithTimestampWithParam_ PutAndGet test in db_basic_test to cover ExtractUserKeyAndStripTimestamp usage in both kBuffered and kUnbuffered state of BlockBasedTableBuilder. Before the ExtractUserKeyAndStripTimstamp fix: ``` $ ./db_basic_test --gtest_filter="PutAndGet" Note: Google Test filter = PutAndGet [==========] Running 2 tests from 1 test case. [----------] Global test environment set-up. [----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam [ RUN ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0 db/db_basic_test.cc:2109: Failure db_->Get(ropts, cfh, "key" + std::to_string(j), &value) NotFound: db/db_basic_test.cc:2109: Failure db_->Get(ropts, cfh, "key" + std::to_string(j), &value) NotFound: db/db_basic_test.cc:2109: Failure db_->Get(ropts, cfh, "key" + std::to_string(j), &value) NotFound: db/db_basic_test.cc:2109: Failure db_->Get(ropts, cfh, "key" + std::to_string(j), &value) NotFound: db/db_basic_test.cc:2109: Failure db_->Get(ropts, cfh, "key" + std::to_string(j), &value) NotFound: [ FAILED ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0, where GetParam() = false (1177 ms) [ RUN ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1 [ OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1 (1056 ms) [----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam (2233 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test case ran. (2233 ms total) [ PASSED ] 1 test. [ FAILED ] 1 test, listed below: [ FAILED ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0, where GetParam() = false 1 FAILED TEST ``` After the ExtractUserKeyAndStripTimstamp fix: ``` $ ./db_basic_test --gtest_filter="PutAndGet" Note: Google Test filter = PutAndGet [==========] Running 2 tests from 1 test case. [----------] Global test environment set-up. [----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam [ RUN ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0 [ OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0 (1417 ms) [ RUN ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1 [ OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1 (1041 ms) [----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam (2458 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test case ran. (2458 ms total) [ PASSED ] 2 tests. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6100 Differential Revision: D18769654 Pulled By: riversand963 fbshipit-source-id: 76c2cf2c9a5e0d85db95d98e812e6af0c2a15c6b	2019-12-09 10:57:02 -08:00
Peter Dillinger	3a6d9436e8	Use SpecialSkipListFactory in RecalculateScoreAfterPicking (#6125 ) Summary: Test DBTestUniversalCompaction.RecalculateScoreAfterPicking was flaky on ARM, so it now uses SpecialSkipListFactory (like other tests) for predictable memtable flushes. Fixes https://github.com/facebook/rocksdb/issues/5736 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6125 Test Plan: while ./db_universal_compaction_test; do :; done # for a while on ARM and on Intel (both Linux) Differential Revision: D18864821 Pulled By: pdillinger fbshipit-source-id: 2f3ca0ea66ce420dcd6d41b0ec12377112a5a79f	2019-12-09 09:23:50 -08:00
sdong	7d79b32618	Break db_stress_tool.cc to a list of source files (#6134 ) Summary: db_stress_tool.cc now is a giant file. In order to main it easier to improve and maintain, break it down to multiple source files. Most classes are turned into their own files. Separate .h and .cc files are created for gflag definiations. Another .h and .cc files are created for some common functions. Some test execution logic that is only loosely related to class StressTest is moved to db_stress_driver.h and db_stress_driver.cc. All the files are located under db_stress_tool/. The directory name is created as such because if we end it with either stress or test, .gitignore will ignore any file under it and makes it prone to issues in developements. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6134 Test Plan: Build under GCC7 with and without LITE on using GNU Make. Build with GCC 4.8. Build with cmake with -DWITH_TOOL=1 Differential Revision: D18876064 fbshipit-source-id: b25d0a7451840f31ac0f5ebb0068785f783fdf7d	2019-12-08 23:51:01 -08:00
Yanqin Jin	fe1147db1c	Let DBSecondary close files after catch up (#6114 ) Summary: After secondary instance replays the logs from primary, certain files become obsolete. The secondary should find these files, evict their table readers from table cache and close them. If this is not done, the secondary will hold on to these files and prevent their space from being freed. Test plan (devserver): ``` $./db_secondary_test --gtest_filter=DBSecondaryTest.SecondaryCloseFiles $make check $./db_stress -ops_per_thread=100000 -enable_secondary=true -threads=32 -secondary_catch_up_one_in=10000 -clear_column_family_one_in=1000 -reopen=100 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6114 Differential Revision: D18769998 Pulled By: riversand963 fbshipit-source-id: 5d1f151567247196164e1b79d8402fa2045b9120	2019-12-02 17:45:03 -08:00
David Palm	048472f620	Add missing DataBlock-releated functions to the C-API (#6101 ) Summary: Adds two missing functions to the C-API: - `rocksdb_block_based_options_set_data_block_index_type` - `rocksdb_block_based_options_set_data_block_hash_ratio` This enables users in other languages to enjoy the new(-ish) feature. The changes here are partially overlapping with [another PR](https://github.com/facebook/rocksdb/pull/5630) but are more focused on the DataBlock indexing options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6101 Differential Revision: D18765639 fbshipit-source-id: 4a8947e71b179f26fa1eb83c267dd47ee64ac3b3	2019-12-02 11:00:09 -08:00
Yanqin Jin	09fcf4fb6b	Fix a potential bug scheduling unnecessary threads (#6104 ) Summary: RocksDB should decrement the counter `unscheduled_flushes_` as soon as the bg thread is scheduled. Before this fix, the counter is decremented only when the bg thread starts and picks an element from the flush queue. This may cause more than necessary bg threads to be scheduled. Not a correctness issue, but may affect flush thread count. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6104 Test Plan: ``` make check ``` Differential Revision: D18735584 Pulled By: riversand963 fbshipit-source-id: d36272d4a08a494aeeab6200a3cff7a3d1a2dc10	2019-11-27 14:48:49 -08:00
sdong	aa1857e2df	Support options.max_open_files = -1 with periodic_compaction_seconds (#6090 ) Summary: options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090 Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2. Differential Revision: D18702268 fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb	2019-11-26 21:39:56 -08:00
Peter Dillinger	ca3b6c28c9	Expose and elaborate FilterBuildingContext (#6088 ) Summary: This change enables custom implementations of FilterPolicy to wrap a variety of NewBloomFilterPolicy and select among them based on contextual information such as table level and compaction style. * Moves FilterBuildingContext to public API and elaborates it with more useful data. (It would be nice to put more general options-like data, but at the time this object is constructed, we are using internal APIs ImmutableCFOptions and MutableCFOptions and don't have easy access to ColumnFamilyOptions that I can tell.) * Renames BloomFilterPolicy::GetFilterBitsBuilderInternal to GetBuilderWithContext, because it's now public. * Plumbs through the table's "level_at_creation" for filter building context. * Simplified some tests by adding GetBuilder() to MockBlockBasedTableTester. * Adds test as DBBloomFilterTest.ContextCustomFilterPolicy, including sample wrapper class LevelAndStyleCustomFilterPolicy. * Fixes a cross-test bug in DBBloomFilterTest.OptimizeFiltersForHits where it does not reset perf context. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6088 Test Plan: make check, valgrind on db_bloom_filter_test Differential Revision: D18697817 Pulled By: pdillinger fbshipit-source-id: 5f987a2d7b07cc7a33670bc08ca6b4ca698c1cf4	2019-11-26 18:24:10 -08:00
sdong	77eab5c85a	Make default value of options.ttl to be 30 days when it is supported. (#6073 ) Summary: By default options.ttl is disabled. We believe a better default will be 30 days, which means deleted data the database will be removed from SST files slightly after 30 days, for most of the cases. Make the default UINT64_MAX - 1 to indicate that it is not overridden by users. Change periodic_compaction_seconds to be UINT64_MAX - 1 to UINT64_MAX too to be consistent. Also fix a small bug in the previous periodic_compaction_seconds default code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6073 Test Plan: Add unit tests for it. Differential Revision: D18669626 fbshipit-source-id: 957cd4374cafc1557d45a0ba002010552a378cc8	2019-11-26 10:00:32 -08:00
Sagar Vemuri	669ea77d9f	Support ttl in Universal Compaction (#6071 ) Summary: `options.ttl` is now supported in universal compaction, similar to how periodic compactions are implemented in PR https://github.com/facebook/rocksdb/issues/5970 . Setting `options.ttl` will simply set `options.periodic_compaction_seconds` to execute the periodic compactions code path. Discarded PR https://github.com/facebook/rocksdb/issues/4749 in lieu of this. This is a short term work-around/hack of falling back to periodic compactions when ttl is set. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6071 Test Plan: Added a unit test. Differential Revision: D18668336 Pulled By: sagar0 fbshipit-source-id: e75f5b81ba949f77ef9eff05e44bb1c757f58612	2019-11-22 22:13:35 -08:00
sdong	d8c28e692a	Support options.ttl with options.max_open_files = -1 (#6060 ) Summary: Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead. Note that, this change will break forward compatibility for release 5.1 and older. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060 Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0. Differential Revision: D18631623 fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830	2019-11-22 21:23:00 -08:00
Little-Wallace	e50b64bdba	fix unstable unittest caused by #5958 (#6061 ) Summary: Signed-off-by: Little-Wallace <bupt2013211450@gmail.com> This PR is to fix unstable unit test added by (https://github.com/facebook/rocksdb/pull/5958). I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger, the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6061 Differential Revision: D18642301 fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148	2019-11-21 15:24:01 -08:00
Yanqin Jin	0ce0edbe12	Fix a data race between GetColumnFamilyMetaData and MarkFilesBeingCompacted (#6056 ) Summary: Use db mutex to protect the execution of Version::GetColumnFamilyMetaData() called in DBImpl::GetColumnFamilyMetaData(). Without mutex, GetColumnFamilyMetaData() races with MarkFilesBeingCompacted() for access to FileMetaData::being_compacted. Other than mutex, there are several more alternatives. - Make FileMetaData::being_compacted an atomic variable. This will make FileMetaData non-copy-able. - Separate being_compacted from FileMetaData. This requires re-organizing data structures that are already used in many places. Test Plan (dev server): ``` make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6056 Differential Revision: D18620488 Pulled By: riversand963 fbshipit-source-id: 87f89660b5d5e2ab4ef7962b7b2a7d00e346aa3b	2019-11-20 16:36:29 -08:00
sdong	27ec3b3466	Sanitize input in DB::MultiGet() API (#6054 ) Summary: The new DB::MultiGet() doesn't validate input for num_keys > 1 and GCC-9 complains about it. Fix it by directly return when num_keys == 0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6054 Test Plan: Build with GCC-9 and see it passes. Differential Revision: D18608958 fbshipit-source-id: 1c279aff3c7fe6e9d5a6d085ed02550ecea4fdb2	2019-11-20 10:38:01 -08:00
Peter Dillinger	0306e01233	Fixes for g++ 4.9.2 compatibility (#6053 ) Summary: Taken from merryChris in https://github.com/facebook/rocksdb/issues/6043 Stackoverflow ref on {{}} vs. {}: https://stackoverflow.com/questions/26947704/implicit-conversion-failure-from-initializer-list Note to reader: .clear() does not empty out an ostringstream, but .str("") suffices because we don't have to worry about clearing error flags. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6053 Test Plan: make check, manual run of filter_bench Differential Revision: D18602259 Pulled By: pdillinger fbshipit-source-id: f6190f83b8eab4e80e7c107348839edabe727841	2019-11-19 15:43:37 -08:00
Little-Wallace	ec3e3c3e02	Fix corruption with intra-L0 on ingested files (#5958 ) Summary: ## Problem Description Our process was abort when it call `CheckConsistency`. And the information in `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ". Here are the causes of the accident I investigated. * RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested. * When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number. * `CheckConsistency` determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal. * If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the `smallest_seqno` of this new file will be smaller than his `largest_seqno`. * If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable. So `CheckConsistency` will return a `Corruption`. * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1. Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files: > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno So I skip pick sst file which was ingested in function `FindIntraL0Compaction ` ## Reason Here is my bug report: https://github.com/facebook/rocksdb/issues/5913 There are two situations that can cause the check to fail. ### First situation： - First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest. - If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst. - Then some data was put into memtable, and memtable was flushed to L0. We called this sst B. - RocksDB check consistency , and find the `smallest_seqno` of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested. ### Secondary situaion - First we have flushed many sst in L0, we call them [s1, s2, s3]. - There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1. - m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5. - [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6. - compacted 4@0 files to L0 - When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5. - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958 Differential Revision: D18601316 fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267	2019-11-19 15:09:11 -08:00
Levi Tamasi	019eb1f402	Disable blob iterator test with max_sequential_skip_in_iterations==0 in LITE mode (#6052 ) Summary: The SetOptions API used by the test is not supported in LITE mode, so we should skip the new chunk in this case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6052 Test Plan: Ran the unit tests both in regular and LITE mode. Differential Revision: D18601763 Pulled By: ltamasi fbshipit-source-id: 883d6882771e0fb4aae72bb77ba4e63d9febec04	2019-11-19 15:02:41 -08:00
tabokie	20b48c6478	Fix blob context when db_iter uses seek (#6051 ) Summary: Fix: when `db_iter` falls back to using seek by `FindValueForCurrentKeyUsingSeek`, `is_blob_` flag is not properly set on encountering BlobIndex. Also patch existing test for the mentioned code path. Signed-off-by: tabokie <xy.tao@outlook.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6051 Differential Revision: D18596274 Pulled By: ltamasi fbshipit-source-id: 8e4714af263b99dc2c379707d50db88fe6799278	2019-11-19 11:39:02 -08:00
anand76	38cc611297	Fix test failure in LITE mode (#6050 ) Summary: GetSupportedCompressions() is not available in LITE build, so check and use Snappy compression in db_basic_test.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6050 Test Plan: make LITE=1 check make check Differential Revision: D18588114 Pulled By: anand1976 fbshipit-source-id: a193de58c44f91bcc237107f25dbc1b9458eef3d	2019-11-19 10:13:24 -08:00
anand76	5b9233bfe8	Fix a test failure on systems that don't have Snappy compression libraries (#6038 ) Summary: The ParallelIO/DBBasicTestWithParallelIO.MultiGet/11 test fails if Snappy compression library is not installed, since RocksDB defaults to Snappy if none is specified. So dynamically determine the supported compression types and pick the first one. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6038 Differential Revision: D18532370 Pulled By: anand1976 fbshipit-source-id: a0a735114d1f8892ea09f7c4af8688d7bcc5b075	2019-11-18 09:37:18 -08:00
Little-Wallace	f65ec09ef8	Fix IngestExternalFile's bug with two_write_queue (#5976 ) Summary: When two_write_queue enable, IngestExternalFile performs EnterUnbatched on both write queues. SwitchMemtable also EnterUnbatched on 2nd write queue when this option is enabled. When the call stack includes IngestExternalFile -> FlushMemTable -> SwitchMemtable, this results into a deadlock. The implemented solution is to pass on the existing writes_stopped argument in FlushMemTable to skip EnterUnbatched in SwitchMemtable. Fixes https://github.com/facebook/rocksdb/issues/5974 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5976 Differential Revision: D18535943 Pulled By: maysamyabandeh fbshipit-source-id: a4f9d4964c10d4a7ca06b1e0102ca2ec395512bc	2019-11-15 14:00:37 -08:00

1 2 3 4 5 ...

4029 Commits