rocksdb

Author	SHA1	Message	Date
Zhichao Cao	5c30e6c088	Separate timestamp related test from db_basic_test (#6516 ) Summary: In some of the test, db_basic_test may cause time out due to its long running time. Separate the timestamp related test from db_basic_test to avoid the potential issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6516 Test Plan: pass make asan_check Differential Revision: D20423922 Pulled By: zhichao-cao fbshipit-source-id: d6306f89a8de55b07bf57233e4554c09ef1fe23a	2020-03-13 11:37:15 -07:00
Cheng Chang	0d2c8e47e8	OpenForReadOnly is not supported in LITE mode (#6523 ) Summary: In DBLogicalBlockSizeCacheTest, do not test OpenForReadOnly in LITE mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6523 Test Plan: watch test for LITE mode Differential Revision: D20420321 Pulled By: cheng-chang fbshipit-source-id: e45bf6f2800206d6f8ce9af7308e76a08de80643	2020-03-12 14:13:59 -07:00
Levi Tamasi	c15e85bdcb	Move BlobDB related files under db/ to db/blob/ (#6519 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6519 Test Plan: ``` make all make check ``` Differential Revision: D20400691 Pulled By: ltamasi fbshipit-source-id: 20ef911cf1c2c92c7f71ef0b493f9be64f2eef94	2020-03-12 11:00:56 -07:00
Cheng Chang	2d9efc9ab2	Cache result of GetLogicalBufferSize in Linux (#6457 ) Summary: In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down. This PR introduces two new APIs: 1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes. 2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes. Other modifications: 1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms. 2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`. 3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457 Test Plan: 1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`. 2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`. `make check` Differential Revision: D20131243 Pulled By: cheng-chang fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04	2020-03-11 18:40:05 -07:00
sdong	331e6199df	Include more information in file lock failure (#6507 ) Summary: When users fail to open a DB with file lock failure, it is sometimes hard for users to debug. We now include the time the lock is acquired and the thread ID that acquired the lock, to help users debug problems like this. Default Env's thread ID is used. Since type of lockedFiles is changed, rename it to follow naming convention too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6507 Test Plan: Add a unit test and improve an existing test to validate the case. Differential Revision: D20378333 fbshipit-source-id: 312fe0e9733fd1d1e9969c321b90ce523cf4708a	2020-03-11 16:23:08 -07:00
Levi Tamasi	37a635cfe6	Disambiguate CustomFieldTags for the unity build (#6513 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6513 Test Plan: `make unity_test` Differential Revision: D20388919 Pulled By: ltamasi fbshipit-source-id: 88dbceab0723a54ee3939e1644e13dc9a4c70420	2020-03-11 14:45:12 -07:00
Adam Retter	8fc20ac468	Add ppc64le builds to Travis (#6144 ) Summary: Let's see how this goes... Pull Request resolved: https://github.com/facebook/rocksdb/pull/6144 Differential Revision: D20387515 Pulled By: pdillinger fbshipit-source-id: ba2669348c267141dfddff910b4c2224a22cbb38	2020-03-11 12:33:45 -07:00
Levi Tamasi	f5bc3b99d5	Split BlobFileState into an immutable and a mutable part (#6502 ) Summary: It's never too soon to refactor something. The patch splits the recently introduced (`VersionEdit` related) `BlobFileState` into two classes `BlobFileAddition` and `BlobFileGarbage`. The idea is that once blob files are closed, they are immutable, and the only thing that changes is the amount of garbage in them. In the new design, `BlobFileAddition` contains the immutable attributes (currently, the count and total size of all blobs, checksum method, and checksum value), while `BlobFileGarbage` contains the mutable GC-related information elements (count and total size of garbage blobs). This is a better fit for the GC logic and is more consistent with how SST files are handled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6502 Test Plan: `make check` Differential Revision: D20348352 Pulled By: ltamasi fbshipit-source-id: ff93f0121e80ab15e0e0a6525ba0d6af16a0e008	2020-03-10 17:27:26 -07:00
Yanqin Jin	fd1da22111	Support options.max_open_files != -1 with FIFO compaction (#6503 ) Summary: Allow user to specify options.max_open_files != -1 with FIFO compaction. If max_open_files != -1, not all table files are kept open. In the past, FIFO style compaction requires all table files to be open in order to read file creation time from table properties. Later, we added file creation time to MANIFEST, making it possible to read file creation time without opening file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6503 Test Plan: make check Differential Revision: D20353758 Pulled By: riversand963 fbshipit-source-id: ba5c61a648419e47e9ef6d74e0e280e3ee24f296	2020-03-09 18:45:06 -07:00
Yanqin Jin	d93812c9ae	Iterator with timestamp (#6255 ) Summary: Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests. Create an iterator with timestamp. ``` ... read_opts.timestamp = &ts; auto* iter = db->NewIterator(read_opts); // target is key without timestamp. for (iter->Seek(target); iter->Valid(); iter->Next()) {} for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {} delete iter; read_opts.timestamp = &ts1; // lower_bound and upper_bound are without timestamp. read_opts.iterate_lower_bound = &lower_bound; read_opts.iterate_upper_bound = &upper_bound; auto* iter1 = db->NewIterator(read_opts); // Do Seek or SeekToFirst() delete iter1; ``` Test plan (dev server) ``` $make check ``` Simple benchmarking (dev server) 1. The overhead introduced by this PR even when timestamp is disabled. key size: 16 bytes value size: 100 bytes Entries: 1000000 Data reside in main memory, and try to stress iterator. Repeated three times on master and this PR. - Seek without next ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 ``` master: 159047.0 ops/sec this PR: 158922.3 ops/sec (2% drop in throughput) - Seek and next 10 times ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10 ``` master: 109539.3 ops/sec this PR: 107519.7 ops/sec (2% drop in throughput) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255 Differential Revision: D19438227 Pulled By: riversand963 fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746	2020-03-06 16:24:27 -08:00
Yuqi Gu	e171a219d5	Fix db_wal_test::TruncateLastLogAfterRecoverWithoutFlush failure (#6437 ) Summary: `TruncateLastLogAfterRecoverWithoutFlush` case depends on fallocate support of underlying file system. On a file system which lacks of this feature, like zfs, it will fail to allocate predefined file size as this test case intends to do; So a check block is added to detect fallocate support and skip test if not. The related work is done by JunHe77. Thanks! Signed-off-by: Yuqi Gu <yuqi.gu@arm.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6437 Differential Revision: D20145032 Pulled By: pdillinger fbshipit-source-id: c8b691dc508e95acfa2a004ddbc07e2faa76680d	2020-03-05 17:18:16 -08:00
Cheng Chang	afb97094ae	Skip high levels with no key falling in the range in CompactRange (#6482 ) Summary: In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped. This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482 Test Plan: A new test ManualCompactionTest::SkipLevel is added. Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index. Also updated db_compaction_test and db_test to use correct range for full compaction. Differential Revision: D20251149 Pulled By: cheng-chang fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b	2020-03-04 20:15:25 -08:00
Zhichao Cao	e62fe50634	Introduce FaultInjectionTestFS to test fault File system instead of Env (#6414 ) Summary: In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller. A set of ErrorHandlerFSTests are introduced for testing Pull Request resolved: https://github.com/facebook/rocksdb/pull/6414 Test Plan: pass make asan_check, pass error_handler_fs_test. Differential Revision: D20252421 Pulled By: zhichao-cao fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21	2020-03-04 12:35:05 -08:00
Kefu Chai	03dbd11ead	s/const auto/const auto&/ when doing loop (#6477 ) Summary: this silences following warning from clang-11 ``` rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:21: warning: loop variable 'newf' of type 'const std::pair<int, rocksdb::FileMetaData>' creates a copy from type 'const std::pair<int\ , rocksdb::FileMetaData>' [-Wrange-loop-analysis] for (const auto newf : c->edit()->GetNewFiles()) { ^ rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:10: note: use reference type 'const std::pair<int, rocksdb::FileMetaData> &' to prevent copying for (const auto newf : c->edit()->GetNewFiles()) { ^~~~~~~~~~~~~~~~~ & ``` Signed-off-by: Kefu Chai <tchaikov@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6477 Differential Revision: D20211850 Pulled By: ltamasi fbshipit-source-id: 3e89e13a12bba79f1b934d46b7c4c0576cdafb01	2020-03-03 08:41:57 -08:00
sdong	17bef7d3a8	Fix data race of GetCreationTimeOfOldestFile() (#6473 ) Summary: When DBImpl::GetCreationTimeOfOldestFile() calls Version::GetCreationTimeOfOldestFile(), the version is not directly or indirectly referenced, so an event like compaction can race with the operation and cause DBImpl::GetCreationTimeOfOldestFile() to access delocated data. This was caught by an ASAN run: ==268==ERROR: AddressSanitizer: heap-use-after-free on address 0x612000b7d198 at pc 0x000018332913 bp 0x7f391510d310 sp 0x7f391510d308 READ of size 8 at 0x612000b7d198 thread T845 (store_load-33) SCARINESS: 51 (8-byte-read-heap-use-after-free) #0 0x18332912 in rocksdb::Version::GetCreationTimeOfOldestFile(unsigned long) rocksdb/src/db/version_set.cc:1488 https://github.com/facebook/rocksdb/issues/1 0x1803ddaa in rocksdb::DBImpl::GetCreationTimeOfOldestFile(unsigned long) rocksdb/src/db/db_impl/db_impl.cc:4499 https://github.com/facebook/rocksdb/issues/2 0xe24ca09 in rocksdb::StackableDB::GetCreationTimeOfOldestFile(unsigned long) rocksdb/utilities/stackable_db.h:392 ...... 0x612000b7d198 is located 216 bytes inside of 296-byte region [0x612000b7d0c0,0x612000b7d1e8) freed by thread T28 here: ...... https://github.com/facebook/rocksdb/issues/5 0x1832c73f in std::vector<rocksdb::FileMetaData, std::allocator<rocksdb::FileMetaData> >::~vector() third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/stl_vector.h:435 https://github.com/facebook/rocksdb/issues/6 0x1832c73f in rocksdb::VersionStorageInfo::~VersionStorageInfo() rocksdb/src/db/version_set.cc:734 https://github.com/facebook/rocksdb/issues/7 0x1832cf42 in rocksdb::Version::~Version() rocksdb/src/db/version_set.cc:758 https://github.com/facebook/rocksdb/issues/8 0x9d1bb5 in rocksdb::Version::Unref() rocksdb/src/db/version_set.cc:2869 https://github.com/facebook/rocksdb/issues/9 0x183e7631 in rocksdb::Compaction::~Compaction() rocksdb/src/db/compaction/compaction.cc:275 https://github.com/facebook/rocksdb/issues/10 0x9e6de6 in std::default_delete<rocksdb::Compaction>::operator()(rocksdb::Compaction) const third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:78 https://github.com/facebook/rocksdb/issues/11 0x9e6de6 in std::unique_ptr<rocksdb::Compaction, std::default_delete<rocksdb::Compaction> >::reset(rocksdb::Compaction) third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:376 https://github.com/facebook/rocksdb/issues/12 0x9e6de6 in rocksdb::DBImpl::BackgroundCompaction(bool, rocksdb::JobContext, rocksdb::LogBuffer, rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2826 https://github.com/facebook/rocksdb/issues/13 0x9ac3b8 in rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2320 https://github.com/facebook/rocksdb/issues/14 0x9abff7 in rocksdb::DBImpl::BGWorkCompaction(void*) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2096 ...... Fix the issue by reference the super version and use the referenced version from it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6473 Test Plan: Run ASAN for all existing tests. Differential Revision: D20196416 fbshipit-source-id: 5f4a7918110fc7b8dd7841932d376bc9d1e59d6f	2020-03-02 16:37:01 -08:00
Zhichao Cao	8d73137ae8	Replace Directory with FSDirectory in DB (#6468 ) Summary: In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468 Test Plan: pass make asan_check Differential Revision: D20195261 Pulled By: zhichao-cao fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1	2020-03-02 16:16:26 -08:00
Huisheng Liu	904a60ff63	return timestamp from get (#6409 ) Summary: Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included. ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks. base line (commit `72ee067b9`): 101.712 micros/op 314602 ops/sec; 36.0 MB/s (5658999 of 5658999 found) This PR: 100.288 micros/op 319071 ops/sec; 36.5 MB/s (5674999 of 5674999 found) ./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409 Differential Revision: D20200086 Pulled By: riversand963 fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50	2020-03-02 16:01:00 -08:00
Michael R. Crusoe	051696bf98	fix some spelling typos (#6464 ) Summary: Found from Debian's "Lintian" program Pull Request resolved: https://github.com/facebook/rocksdb/pull/6464 Differential Revision: D20162862 Pulled By: zhichao-cao fbshipit-source-id: 06941ee2437b038b2b8045becbe9d2c6fbff3e12	2020-02-28 14:14:03 -08:00
Andrew Kryczka	69679e7375	Fix range deletion tombstone ingestion with global seqno (#6429 ) Summary: Original author: jeffrey-xiao If we are writing a global seqno for an ingested file, the range tombstone metablock gets accessed and put into the cache during ingestion preparation. At the time, the global seqno of the ingested file has not yet been determined, so the cached block will not have a global seqno. When the file is ingested and we read its range tombstone metablock, it will be returned from the cache with no global seqno. In that case, we use the actual seqnos stored in the range tombstones, which are all zero, so the tombstones cover nothing. This commit removes global_seqno_ variable from Block. When iterating over a block, the global seqno for the block is determined by the iterator instead of storing this mutable attribute in Block. Additionally, this commit adds a regression test to check that keys are deleted when ingesting a file with a global seqno and range deletion tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429 Differential Revision: D19961563 Pulled By: ajkr fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7	2020-02-25 15:31:48 -08:00
Levi Tamasi	d87c10c6ab	Add blob file state to VersionEdit (#6416 ) Summary: BlobDB currently does not keep track of blob files: no records are written to the manifest when a blob file is added or removed, and upon opening a database, the list of blob files is populated simply based on the contents of the blob directory. This means that lost blob files cannot be detected at the moment. We plan to solve this issue by making blob files a part of `Version`; as a first step, this patch makes it possible to store information about blob files in `VersionEdit`. Currently, this information includes blob file number, total number and size of all blobs, and total number and size of garbage blobs. However, the format is extensible: new fields can be added in both a forward compatible and a forward incompatible manner if needed (similarly to `kNewFile4`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6416 Test Plan: `make check` Differential Revision: D19894234 Pulled By: ltamasi fbshipit-source-id: f9753e1f2aedf6dadb70c09b345207cb9c58c329	2020-02-24 18:39:53 -08:00
Yanqin Jin	890d87fadc	Some minor fix-ups (#6440 ) Summary: Cleanup some code without any real change in functionality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6440 Differential Revision: D20015891 Pulled By: riversand963 fbshipit-source-id: 33e18754b0f002006a6d4805e9aaf84c0c8ad25a	2020-02-21 15:09:56 -08:00
Zaiyang Li	fcec56e86c	Add function to set row cache on rocksdb_options_t (#6442 ) Summary: Adding a C API function to set `row_cache` on `rocksdb_options_t` as this functionality is missing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6442 Differential Revision: D20036813 Pulled By: riversand963 fbshipit-source-id: c1fa95ea343345fbc1e57961d0d048e0e79be373	2020-02-21 11:12:42 -08:00
Yanqin Jin	362b8d4393	Fix MANIFEST name assignment (#6426 ) Summary: Currently, a new MANIFEST file is assigned a new file number when 1) no MANIFEST is open, or 2) current MANIFEST file size exceeds a threshold. This is not sufficient. There are cases when the caller explicitly specifies that a new MANIFEST be created. For example, if user sets options.write_dbid_to_manifest = true, and there are WAL files, then RocksDB will run into an issue during recovery. `DBImpl::Recover()` will call `LogAndApply()` to write dbid. At this point, the db being recovered creates a new MANIFEST, say, MANIFEST-000003. Since there are WALs, `DBImpl::RecoverLogFiles` will be called. Towards the end of this function, we call `LogAndApply(new_descriptor_log=true)`, which explicitly creates a new MANIFEST. However, the manifest_file_number is wrong before this fix. Consequently, RocksDB opens an existing, non-empty file for append, effectively truncating the file to zero. If a crash occurs, then there will be data loss. Test Plan (devserver): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6426 Test Plan: make check Differential Revision: D19951866 Pulled By: riversand963 fbshipit-source-id: 4b1b9fc28d4fe2ac12764b388ef9e61f05e766da	2020-02-20 14:30:58 -08:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
Andrew Kryczka	c6abe30ee3	Fix concurrent full purge and WAL recycling (#5900 ) Summary: We were removing the file from `log_recycle_files_` before renaming it with `ReuseWritableFile()`. Since `ReuseWritableFile()` occurs outside the DB mutex, it was possible for a concurrent full purge to sneak in and delete the file before it could be renamed. Consequently, `SwitchMemtable()` would fail and the DB would enter read-only mode. The fix is to hold the old file number in `log_recycle_files_` until after the file has been renamed. Full purge uses that list to decide which files to keep, so it can no longer delete a file pending recycling. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5900 Test Plan: new unit test Differential Revision: D19771719 Pulled By: ajkr fbshipit-source-id: 094346349ca3fb499712e62de03905acc30b5ce8	2020-02-18 13:54:13 -08:00
Andrew Kryczka	0f9dcb88b2	Return NotSupported from WriteBatchWithIndex::DeleteRange (#5393 ) Summary: As discovered in https://github.com/facebook/rocksdb/issues/5260 and https://github.com/facebook/rocksdb/issues/5392, reads on the indexed batch do not account for range tombstones. So, return `Status::NotSupported` from `WriteBatchWithIndex::DeleteRange` until we properly support it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5393 Test Plan: added unit test Differential Revision: D19912360 Pulled By: ajkr fbshipit-source-id: 0bbfc978ea015d64516ca708fce2429abba524cb	2020-02-18 11:18:25 -08:00
Cheng Chang	4034e289ad	Fail fast in paranoid mode when LoadTableHandlers fail during recovering (#6368 ) Summary: Previously, when recovering version set, LoadTableHandlers failures are ignored. If paranoid_checks is true, this failure should not be ignored, otherwise, the opened db might be in an inconsistent state. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6368 Test Plan: make check Differential Revision: D19713459 Pulled By: cheng-chang fbshipit-source-id: 68cb94f4f2cc43f8b024b14755193cd45cfcad55	2020-02-14 08:17:10 -08:00
Manuel Ung	908b1ee64e	WriteUnPrepared: Fix assertion during recovery (#6419 ) Summary: During recovery, multiple (un)prepared batches could exist in the same WAL record due to group commit. This breaks an assertion in `MemTableInserter::MarkBeginPrepare`. To fix, reset unprepared_batch_ to false after `MarkEndPrepare`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6419 Differential Revision: D19896148 Pulled By: lth fbshipit-source-id: b1a32ef88f775a0881264a18bd1a4a5b8c85eee3	2020-02-13 18:52:05 -08:00
sdong	ac8e89a443	Should flush and sync WAL when writing it in DB::Open() (#6417 ) Summary: A recent fix related to 2pc https://github.com/facebook/rocksdb/pull/6313/ writes something to WAL, but does not flush or sync. This causes assertion failure "impl->TEST_WALBufferIsEmpty()" if manual_wal_flush = true. We should fsync the entry to make sure a second power reset can recover. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6417 Test Plan: Add manual_wal_flush=true case in TransactionTest.DoubleCrashInRecovery and fix a bug in the test so that the bug can be reproduced. It passes with the fix. Differential Revision: D19894537 fbshipit-source-id: f1e84e49e2269f583c6019743118292cd8b6598e	2020-02-13 18:41:04 -08:00
Huisheng Liu	5138764eb5	Fix destroydb (#6308 ) Summary: It's observed on Windows DestroyDB failed to remove the log file because the logger is still alive in sst file manager and holding a handle to the log file. This fix makes sure the logger is released before attempt to clear the database directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6308 Differential Revision: D19818829 Pulled By: riversand963 fbshipit-source-id: 54c3e6859aadaaba4a49b3e851b73dc35ec7dc6a	2020-02-13 11:21:27 -08:00
Cheng Chang	a676001f95	Revert usage of Defer. (#6410 ) Summary: Seems like this caused the following test failure on AppVeyor: DBTest2.CrashInRecoveryMultipleCF c:\projects\rocksdb\db\db_test_util.cc(107): error: DestroyDB(dbname_, options) IO error: Failed to delete: C:\projects\rocksdb\db_tests\\testrocksdb-3112//db_test2_10791409581227174103/000013.sst: Access is denied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6410 Test Plan: Wait to see whether the AppVeyor test passes. Differential Revision: D19879872 Pulled By: cheng-chang fbshipit-source-id: 59a9c55ca88566e9210c0b715ecc45a4fd9afe26	2020-02-13 10:31:32 -08:00
anand76	3e49249d30	Ensure all MultiGet IO errors are propagated to user (#6403 ) Summary: Unrevert the previous fix to propagate error status, and an additional fix to not treat a memtable lookup MergeInProgress status as an error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6403 Test Plan: Unit tests Tried running stress tests but couldn't repro the stress failure Differential Revision: D19846721 Pulled By: anand1976 fbshipit-source-id: 7db10cccbdc863d9b559497f0a46b608d2488ca4	2020-02-11 17:27:22 -08:00
anand76	35ed530d2c	Revert "Check KeyContext status in MultiGet (#6387 )" (#6401 ) Summary: This reverts commit `d70011bccc`. The commit is causing some stress test failure due to unexpected Status::MergeInProgress() return for some keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6401 Differential Revision: D19826623 Pulled By: anand1976 fbshipit-source-id: edd634cede9cb7bdd2cb8f46e662ea709b16d2f1	2020-02-10 22:23:36 -08:00
Cheng Chang	dafb568052	Add utility class Defer (#6382 ) Summary: Add a utility class `Defer` to defer the execution of a function until the Defer object goes out of scope. Used in VersionSet:: ProcessManifestWrites as an example. The inline comments for class `Defer` have more details. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6382 Test Plan: `make defer_test version_set_test && ./defer_test && ./version_set_test` Differential Revision: D19797538 Pulled By: cheng-chang fbshipit-source-id: b1a9b7306e4fd4f48ec2ab55783caa561a315f0f	2020-02-10 17:59:47 -08:00
Levi Tamasi	cbf5f3be43	Do not move VersionEdit into AtomicGroupReadBuffer (#6400 ) Summary: https://github.com/facebook/rocksdb/pull/6383 surfaced an issue with `VersionSet`/`ReactiveVersionSet` and `AtomicGroupReadBuffer::AddEdit` (which was added in https://github.com/facebook/rocksdb/pull/5411): `AddEdit` moves the `VersionEdit` passed to it into `replay_buffer_`, however, the client `VersionSet` classes keep using it afterwards. This seemed to work before the refactoring but it really did not: since `VersionEdit` used to have a user-declared destructor, no move constructor/move assignment operator was generated, and the `move` in `AddEdit` was really a copy. The patch makes the copy explicit. Note: it should be possible to rework this logic so that we can get away with the move but for now, this should fix the issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6400 Test Plan: `make check` `make analyze` Differential Revision: D19824466 Pulled By: ltamasi fbshipit-source-id: f38033967daf2a39c78dcd6e12978bafe37632b4	2020-02-10 17:15:42 -08:00
Zhichao Cao	4369f2c7bb	Checksum for each SST file and stores in MANIFEST (#6216 ) Summary: In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata. Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216 Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check. Differential Revision: D19171461 Pulled By: zhichao-cao fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87	2020-02-10 15:52:52 -08:00
Yutian Li	2e0159ec9e	Add error status for no_slowdown & low priority write (#6396 ) Summary: When `no_slowdown` is enabled, it returns `Status::Incomplete("Write stall")` if a stall would occur. This patch adds descriptive text for when `no_slowdown` and `low_pri` are enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6396 Differential Revision: D19808978 Pulled By: cheng-chang fbshipit-source-id: a53b0d25ed414c821a086531e0222027f925e627	2020-02-10 12:33:16 -08:00
anand76	d70011bccc	Check KeyContext status in MultiGet (#6387 ) Summary: Currently, any IO errors and checksum mismatches while reading data blocks, are being ignored by the batched MultiGet. Its only looking at the GetContext state. Fix that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6387 Test Plan: Add unit tests Differential Revision: D19799819 Pulled By: anand1976 fbshipit-source-id: 46133dccbb04e64067b9fe6cda73e282203db969	2020-02-07 16:48:16 -08:00
Robert Yang	4e457278fa	db/write_thread.cc: Initialize state (#6275 ) Summary: Fixed an error when compiled with -Og: db/write_thread.cc:183:14: error: 'state' may be used uninitialized in this function [-Werror=maybe-uninitialized] Signed-off-by: Robert Yang <liezhi.yang@windriver.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6275 Differential Revision: D19381755 fbshipit-source-id: a90bf3cd4a7248d9d71219e918fc6253deb97e3c	2020-02-07 15:20:38 -08:00
Levi Tamasi	752c87af78	Clean up VersionEdit a bit (#6383 ) Summary: This is a bunch of small improvements to `VersionEdit`. Namely, the patch * Makes the names and order of variables, methods, and code chunks related to the various information elements more consistent, and adds missing getters for the sake of completeness. * Initializes previously uninitialized stack variables. * Marks all getters const to improve const correctness. * Adds in-class initializers and removes the default ctor that would create an object with uninitialized built-in fields and call `Clear` afterwards. * Adds a new type alias for new files and changes the existing `typedef` for deleted files into a type alias as well. * Makes the helper method `DecodeNewFile4From` private. * Switches from long-winded iterator syntax to range based loops in a couple of places. * Fixes a couple of assignments where an integer 0 was assigned to boolean members. * Fixes a getter which used to return a `const std::string` instead of the intended `const std::string&`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6383 Test Plan: make check Differential Revision: D19780537 Pulled By: ltamasi fbshipit-source-id: b0b4f09fee0ec0e7c7b7a6d76bfe5346e91824d0	2020-02-07 13:27:06 -08:00
Cheng Chang	5f478b9f75	Remove outdated comment (#6379 ) Summary: Since the logic for handling IDENTITY file is now inside `NewDB`, the comment above `NewDB` is no longer relevant. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6379 Test Plan: not needed Differential Revision: D19795440 Pulled By: cheng-chang fbshipit-source-id: 0b1cca87ac6d92474701c46aa4c8d4d708bfa19b	2020-02-07 13:18:43 -08:00
Cheng Chang	0a74e1b958	Add status checks during DB::Open (#6380 ) Summary: Several statuses were not checked during DB::Open. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6380 Test Plan: make check Differential Revision: D19780237 Pulled By: cheng-chang fbshipit-source-id: c8d189d20344bd1607890dd1449345bda2ef96b9	2020-02-07 12:32:09 -08:00
Yanqin Jin	f361cedf06	Atomic flush rollback once on failure (#6385 ) Summary: Before this fix, atomic flush codepath may hit an assertion failure on a specific failure case. If all flush jobs within an atomic flush succeed (they do not write to MANIFEST), but batch writing version edits to MANIFEST fails, then `cfd->imm()->RollbackMemTableFlush()` will be called twice, and the second invocation hits assertion failure `assert(m->flush_in_progress_)` since the first invocation resets the variable `flush_in_progress_` to false already. Test plan (dev server): ``` ./db_flush_test --gtest_filter=DBAtomicFlushTest/DBAtomicFlushTest.RollbackAfterFailToInstallResults make check ``` Both must succeed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6385 Differential Revision: D19782943 Pulled By: riversand963 fbshipit-source-id: 84e1592625e729d1b70fdc8479959387a74cb121	2020-02-07 10:52:10 -08:00
Cheng Chang	f5f79f01a2	Be able to read compatible leveldb sst files (#6370 ) Summary: In `DBSSTTest.SSTsWithLdbSuffixHandling`, some sst files are renamed to ldb files, the original intention of the test is to test that the ldb files can be loaded along with the sst files. The original test checks this by `ASSERT_NE("NOT_FOUND", Get(Key(k)))`, but the problem is `Get(Key(k))` returns IO error due to path not found instead of NOT_FOUND, so the success of ASSERT_NE does not mean the key can be retrieved. This PR updates the test to make sure Get(Key(k)) returns the original value. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6370 Test Plan: make db_sst_test && ./db_sst_test Differential Revision: D19726278 Pulled By: cheng-chang fbshipit-source-id: 993127f56457b315e669af4eeb92d6f956b7a4b7	2020-02-06 10:15:44 -08:00
Mike Kolupaev	1ed7d9b1b5	Avoid lots of calls to Env::GetFileSize() in SstFileManagerImpl when opening DB (#6363 ) Summary: Before this PR it calls GetFileSize() once for each sst file in the DB. This can take a long time if there are be tens of thousands of sst files (e.g. in thousands of column families), and even longer if Env is talking to some remote service rather than local filesystem. This PR makes DB::Open() use sst file sizes that are already known from manifest (typically almost all files in the DB) and only call GetFileSize() for non-sst or obsolete files. Note that GetFileSize() is also called and checked against manifest in CheckConsistency(), so the calls in SstFileManagerImpl were completely redundant. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6363 Test Plan: deployed to a test cluster, looked at a dump of Env calls (from a custom instrumented Env) - no more thousands of GetFileSize()s. Differential Revision: D19702509 Pulled By: al13n321 fbshipit-source-id: 99f8110620cb2e9d0c092dfcdbb11f3af4ff8b73	2020-02-04 13:41:53 -08:00
sdong	69c8614815	Avoid to get manifest file size when recovering from it. (#6369 ) Summary: Right now RocksDB gets manifest file size before recovering from it. The information is available in LogReader. Use it instead to prevent one file system call. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6369 Test Plan: Run all existing tests Differential Revision: D19714872 fbshipit-source-id: 0144be324d403c99e3da875ea2feccc8f64e883d	2020-02-04 11:39:23 -08:00
Mike Kolupaev	637e64b9ac	Add an option to prevent DB::Open() from querying sizes of all sst files (#6353 ) Summary: When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open. If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be really slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes. We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353 Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble. Differential Revision: D19656425 Pulled By: al13n321 fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82	2020-02-04 01:27:26 -08:00
anand76	7330ec0ff1	Fix a test failure in error_handler_test (#6367 ) Summary: Fix an intermittent failure in DBErrorHandlingTest.CompactionManifestWriteError due to a race between background error recovery and the main test thread calling TEST_WaitForCompact(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6367 Test Plan: Run the test using gtest_parallel Differential Revision: D19713802 Pulled By: anand1976 fbshipit-source-id: 29e35dc26e0984fe8334c083e059f4fa1f335d68	2020-02-03 18:16:52 -08:00
sdong	f195d8d523	Use ReadFileToString() to get content from IDENTITY file (#6365 ) Summary: Right now when reading IDENTITY file, we use a very similar logic as ReadFileToString() while it does an extra file size check, which may be expensive in some file systems. There is no reason to duplicate the logic. Use ReadFileToString() instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6365 Test Plan: RUn all existing tests. Differential Revision: D19709399 fbshipit-source-id: 3bac31f3b2471f98a0d2694278b41e9cd34040fe	2020-02-03 17:40:49 -08:00
sdong	36c504be17	Avoid create directory for every column families (#6358 ) Summary: A relatively recent regression causes for every CF, create and open directory is called for the DB directory, unless CF has a private directory. This doesn't scale well with large number of column families. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6358 Test Plan: Run all existing tests and see it pass. strace with db_bench --num_column_families and observe it doesn't open directory for number of column families. Differential Revision: D19675141 fbshipit-source-id: da01d9216f1dae3f03d4064fbd88ce71245bd9be	2020-02-03 14:13:39 -08:00
Huisheng Liu	eb4d6af5ae	Error handler test fix (#6266 ) Summary: MultiDBCompactionError fails when it verifies the number of files on level 0 and level 1 without waiting for compaction to finish. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6266 Differential Revision: D19701639 Pulled By: riversand963 fbshipit-source-id: e96d511bcde705075f073e0b550cebcd2ecfccdc	2020-02-03 13:32:53 -08:00
sdong	800d24ddc5	Fix DBTest2.ChangePrefixExtractor LITE build (#6356 ) Summary: DBTest2.ChangePrefixExtractor fails in LITE build because LITE build doesn't support adaptive build. Fix it by removing the stats check but only check correctness. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6356 Test Plan: Run the test with both of LITE and non-LITE build. Differential Revision: D19669537 fbshipit-source-id: 6d7dd6c8a79f18e80ca1636864b9c71922030d8e	2020-01-31 15:44:14 -08:00
sdong	ec496347bc	Add a unit test for prefix extractor changes (#6323 ) Summary: Add a unit test for prefix extractor change, including a check that fails due to a bug. Also comment out the partitioned filter case which will fail the test too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6323 Test Plan: Run the test and it passes (and fails if the SeekForPrev() part is uncommented) Differential Revision: D19509744 fbshipit-source-id: 678202ca97b5503e9de73b54b90de9e5ba822b72	2020-01-31 11:02:03 -08:00
Maysam Yabandeh	3316d29221	Disable recycle_log_file_num when it is incompatible with recovery mode (#6351 ) Summary: Non-zero recycle_log_file_num is incompatible with kPointInTimeRecovery and kAbsoluteConsistency recovery modes. Currently SanitizeOptions changes the recovery mode to kTolerateCorruptedTailRecords, while to resolve this option conflict it makes more sense to compromise recycle_log_file_num, which is a performance feature, instead of wal_recovery_mode, which is a safety feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6351 Differential Revision: D19648931 Pulled By: maysamyabandeh fbshipit-source-id: dd0bf78349edc007518a00c4d63931fd69294ad7	2020-01-31 07:28:30 -08:00
Yanqin Jin	f2fbc5d668	Shorten certain test names to avoid infra failure (#6352 ) Summary: Unit test names, together with other components, are used to create log files during some internal testing. Overly long names cause infra failure due to file names being too long. Look for internal tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6352 Differential Revision: D19649307 Pulled By: riversand963 fbshipit-source-id: 6f29de096e33c0eaa87d9c8702f810eda50059e7	2020-01-30 23:10:24 -08:00
anand76	fb05b5a652	Force a new manifest file if append to current one fails (#6331 ) Summary: Fix for issue https://github.com/facebook/rocksdb/issues/6316 When an append/sync of the manifest file fails due to an IO error such as NoSpace, we don't always put the DB in read-only mode. This is true for flush and compactions, as well as foreground operatons such as column family add/drop, CompactFiles etc. Subsequent changes to the DB will be recorded in the same manifest file, which would have a corrupted record in the middle due to the previous failure. On next DB::Open(), it will fail to process the full manifest and data will be lost. To fix this, we reset VersionSet::descriptor_log_ on append/sync failure, which will force a new manifest file to be written on the next append. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6331 Test Plan: Add new unit tests in error_handler_test.cc Differential Revision: D19632951 Pulled By: anand1976 fbshipit-source-id: 68d527cb6e59a94cbbbf9f5a17a7f464381d51e3	2020-01-30 10:56:29 -08:00
sdong	71874c5aaf	Fix LITE build with DBTest2.AutoPrefixMode1 (#6346 ) Summary: DBTest2.AutoPrefixMode1 doesn't pass because auto prefix mode is not supported there. Fix it by disabling the test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6346 Test Plan: Run DBTest2.AutoPrefixMode1 in lite mode Differential Revision: D19627486 fbshipit-source-id: fbde75260aeecb7e6fc406e09c19a71a95aa5f08	2020-01-29 16:43:42 -08:00
sdong	02ac6c9a3c	Fix db_bloom_filter_test clang LITE build (#6340 ) Summary: db_bloom_filter_test break with clang LITE build with following message: db/db_bloom_filter_test.cc:23:29: error: unused variable 'kPlainTable' [-Werror,-Wunused-const-variable] static constexpr PseudoMode kPlainTable = -1; ^ Fix it by moving the declaration out of LITE build Pull Request resolved: https://github.com/facebook/rocksdb/pull/6340 Test Plan: USE_CLANG=1 LITE=1 make db_bloom_filter_test and without LITE=1 Differential Revision: D19609834 fbshipit-source-id: 0e88f5c6759238a94f9880d84c785ac18e7cdd7e	2020-01-29 12:57:48 -08:00
Maysam Yabandeh	2f973ca96e	Double Crash in kPointInTimeRecovery with TransactionDB (#6313 ) Summary: In WritePrepared there could be gap in sequence numbers. This breaks the trick we use in kPointInTimeRecovery which assume the first seq in the log right after the corrupted log is one larger than the last seq we read from the logs. To let this trick keep working, we add a dummy entry with the expected sequence to the first log right after recovery. Also in WriteCommitted, if the log right after the corrupted log is empty, since it has no sequence number to let the sequential trick work, it is assumed as unexpected behavior. This is however expected to happen if we close the db after recovering from a corruption and before writing anything new to it. To remedy that, we apply the same technique by writing a dummy entry to the log that is created after the corrupted log. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6313 Differential Revision: D19458291 Pulled By: maysamyabandeh fbshipit-source-id: 09bc49e574690085df45b034ca863ff315937e2d	2020-01-29 11:40:55 -08:00
sdong	8f2bee6747	Add ReadOptions.auto_prefix_mode (#6314 ) Summary: Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314 Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run. Differential Revision: D19458717 fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce	2020-01-28 14:44:05 -08:00
Sagar Vemuri	4f6c86226c	Use the same oldest ancestor time in table properties and manifest Summary: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions passed 96 / 100 times. ``` With the fix: all runs (tried 100, 1000, 10000) succeed. ``` $ TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.LevelTtlCascadingCompactions --repeat=1000 [1000/1000] DBCompactionTest.LevelTtlCascadingCompactions (1895 ms) ``` Test Plan: Build: ``` COMPILE_WITH_TSAN=1 make db_compaction_test -j100 ``` Without the fix: a few runs out of 100 fail: ``` $ TEST_TMPDIR=/dev/shm KEEP_DB=1 ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.LevelTtlCascadingCompactions --repeat=100 ... ... Note: Google Test filter = DBCompactionTest.LevelTtlCascadingCompactions [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBCompactionTest [ RUN ] DBCompactionTest.LevelTtlCascadingCompactions db/db_compaction_test.cc:3687: Failure Expected equality of these values: oldest_time Which is: 1580155869 level_to_files[6][0].oldest_ancester_time Which is: 1580155870 DB is still at /dev/shm//db_compaction_test_6337001442947696266 [ FAILED ] DBCompactionTest.LevelTtlCascadingCompactions (1432 ms) [----------] 1 test from DBCompactionTest (1432 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (1433 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] DBCompactionTest.LevelTtlCascadingCompactions 1 FAILED TEST [80/100] DBCompactionTest.LevelTtlCascadingCompactions returned/aborted with exit code 1 (1489 ms) [100/100] DBCompactionTest.LevelTtlCascadingCompactions (1522 ms) FAILED TESTS (4/100): 1419 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/90) 1434 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/84) 1457 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/82) 1489 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/74) Differential Revision: D19587040 Pulled By: sagar0 fbshipit-source-id: 11191ae9940837643bff47ebe18b299b4be3d950	2020-01-27 19:58:53 -08:00
Andrew Kryczka	5b33cfa1e3	fix `WriteBufferManager` flush log message (#6335 ) Summary: It chooses the oldest memtable, not the largest one. This is an important difference for users whose CFs receive non-uniform write rates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6335 Differential Revision: D19588865 Pulled By: maysamyabandeh fbshipit-source-id: 62ad4325b0182f5f27858584cd73fd5978fb2cec	2020-01-27 15:49:22 -08:00
sdong	f10f135938	Fix regression bug of hash index with iterator total order seek (#6328 ) Summary: https://github.com/facebook/rocksdb/pull/6028 introduces a bug for hash index in SST files. If a table reader is created when total order seek is used, prefix_extractor might be passed into table reader as null. While later when prefix seek is used, the same table reader used, hash index is checked but prefix extractor is null and the program would crash. Fix the issue by fixing http://github.com/facebook/rocksdb/pull/6028 in the way that prefix_extractor is preserved but ReadOptions.total_order_seek is checked Also, a null pointer check is added so that a bug like this won't cause segfault in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6328 Test Plan: Add a unit test that would fail without the fix. Stress test that reproduces the crash would pass. Differential Revision: D19586751 fbshipit-source-id: 8de77690167ddf5a77a01e167cf89430b1bfba42	2020-01-27 15:44:54 -08:00
Levi Tamasi	f34782a67d	Fix the "records dropped" statistics (#6325 ) Summary: The earlier code used two conflicting definitions for the number of input records going into a compaction, one based on the `rocksdb.num.entries` table property and one based on `CompactionIterationStats`. The first one is correct and in line with how output records are counted, while the second one incorrectly ignores input records in various cases when the `CompactionIterator` advances or reseeks the input iterator (this can happen, amongst other cases, when dealing with `SingleDelete`s, regular `Delete`s, `Merge`s, and compaction filters). This can result in the code undercounting the input records and computing an incorrect value for "records dropped" during the compaction. The patch fixes this by switching over to the correct (table property based) input record count for "records dropped". Pull Request resolved: https://github.com/facebook/rocksdb/pull/6325 Test Plan: Tested using `make check` and `db_bench`. Differential Revision: D19525491 Pulled By: ltamasi fbshipit-source-id: 4340b0b2f41546db8e356db70ca02199e48fa636	2020-01-23 15:27:22 -08:00
anand76	0672a6db64	Fix queue manipulation in WriteThread::BeginWriteStall() (#6322 ) Summary: When there is a write stall, the active write group leader calls ```BeginWriteStall()``` to walk the queue of writers and remove any with the ```no_slowdown``` option set. There was a bug in the code which updated the back pointer but not the forward pointer (```link_newer```), corrupting the list and causing some threads to wait forever. This PR fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6322 Test Plan: Add a unit test in db_write_test Differential Revision: D19538313 Pulled By: anand1976 fbshipit-source-id: 6fbed819e594913f435886606f5d36f74f235c3a	2020-01-23 14:01:28 -08:00
matthewvon	e6e8b9e871	Correct pragma once problem with Bazel on Windows (#6321 ) Summary: This is a simple edit to have two #include file paths be consistent within range_del_aggregator.{h,cc} with everywhere else. The impact of this inconsistency is that it actual breaks a Bazel based build on the Windows platform. The same pragma once failure occurs with both Windows Visual C++ 2019 and clang for Windows 9.0. Bazel's "sandboxing" of the builds causes both compilers to not properly recognize "rocksdb/types.h" and "include/rocksdb/types.h" to be the same file (also comparator.h). My guess is that the backslash versus forward slash mixing within path names is the underlying issue. But, everything builds fine once the include paths in these two source files are consistent with the rest of the repository. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6321 Differential Revision: D19506585 Pulled By: ltamasi fbshipit-source-id: 294c346607edc433ab99eaabc9c880ee7426817a	2020-01-21 16:12:43 -08:00
Levi Tamasi	d305f13e21	Make DBCompactionTest.SkipStatsUpdateTest more robust (#6306 ) Summary: Currently, this test case tries to infer whether `VersionStorageInfo::UpdateAccumulatedStats` was called during open by checking the number of files opened against an arbitrary threshold (10). This makes the test brittle and results in sporadic failures. The patch changes the test case to use sync points to directly test whether `UpdateAccumulatedStats` was called. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6306 Test Plan: `make check` Differential Revision: D19439544 Pulled By: ltamasi fbshipit-source-id: ceb7adf578222636a0f51740872d0278cd1a914f	2020-01-21 12:55:55 -08:00
chenyou-fdu	931876e86e	Separate enable-WAL and disable-WAL writer to avoid unwanted data in log files (#6290 ) Summary: When we do concurrently writes, and different write operations will have WAL enable or disable. But the data from write operation with WAL disabled will still be logged into log files, which will lead to extra disk write/sync since we do not want any guarantee for these part of data. Detail can be found in https://github.com/facebook/rocksdb/issues/6280. This PR avoid mixing the two types in a write group. The advantage is simpler reasoning about the write group content Pull Request resolved: https://github.com/facebook/rocksdb/pull/6290 Differential Revision: D19448598 Pulled By: maysamyabandeh fbshipit-source-id: 3d990a0f79a78ea1bfc90773f6ebafc1884c20de	2020-01-17 15:54:55 -08:00
Matt Bell	7e5b04d04f	Expose atomic flush option in C API (#6307 ) Summary: This PR adds a `rocksdb_options_set_atomic_flush` function to the C API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6307 Differential Revision: D19451313 Pulled By: ltamasi fbshipit-source-id: 750495642ef55b1ea7e13477f85c38cd6574849c	2020-01-17 12:57:48 -08:00
sdong	d87cffaea4	Fix another bug caused by recent hash index fix (#6305 ) Summary: Recent bug fix related to hash index introduced a new bug: hash index can return NotFound but it is not handled by BlockBasedTable::Get(). The end result is that Get() stops being executed too early. Fix it by ignoring NotFound code in Get(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6305 Test Plan: A problematic DB used to return NotFound incorrectly, and now able to return correct result. Will try to construct a unit test too.0 Differential Revision: D19438925 fbshipit-source-id: e751afa8c13728d56511cfeb1bc811ecb99f3217	2020-01-17 01:41:04 -08:00
Levi Tamasi	73f65b457e	Adjust thread pool sizes when setting max_background_jobs dynamically (#6300 ) Summary: https://github.com/facebook/rocksdb/pull/2205 introduced a new configuration option called `max_background_jobs`, superseding the earlier options `max_background_flushes` and `max_background_compactions`. However, unlike `max_background_compactions`, setting `max_background_jobs` dynamically through the `SetDBOptions` interface does not adjust the size of the thread pools (see https://github.com/facebook/rocksdb/issues/6298). The patch fixes this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6300 Test Plan: Extended unit test. Differential Revision: D19430899 Pulled By: ltamasi fbshipit-source-id: 704006605b3c13c3d1b997ccc0831ee369721074	2020-01-16 14:35:10 -08:00
sdong	f8b5ef85ec	Fix a bug caused by recent fix of Prefix Hash (#6302 ) Summary: Recent fix to Prefix Hash https://github.com/facebook/rocksdb/pull/6292 caused a bug that the newly created NotFound status in hash index is never reset. This causes reseek or implict reseek to return wrong results sometimes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6302 Test Plan: Add a unit test that would fail. Not fix. crash test with hash test would fail in several seconds. With the fix, it will run about several minutes before failing with another failure. Differential Revision: D19424572 fbshipit-source-id: c5276f36a95fd0e2837e30190476d2fe21ed8566	2020-01-16 10:47:20 -08:00
sdong	d2b4d42d4b	Fix kHashSearch bug with SeekForPrev (#6297 ) Summary: When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target. Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297 Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix. Differential Revision: D19404695 fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3	2020-01-15 14:28:39 -08:00
sdong	76c117b24b	Fix LITE test build broken by recent commit (#6295 ) Summary: A recent commit adds a unit test that uses a function not available in LITE build. Fix it by avoiding the call Pull Request resolved: https://github.com/facebook/rocksdb/pull/6295 Test Plan: Run the test in LITE build and see it passes. Differential Revision: D19395678 fbshipit-source-id: 37b42835bae02511630d80f7cafb1179401bc033	2020-01-14 13:17:04 -08:00
sdong	894c6d21af	Bug when multiple files at one level contains the same smallest key (#6285 ) Summary: The fractional cascading index is not correctly generated when two files at the same level contains the same smallest or largest user key. The result would be that it would hit an assertion in debug mode and lower level files might be skipped. This might cause wrong results when the same user keys are of merge operands and Get() is called using the exact user key. In that case, the lower files would need to further checked. The fix is to fix the fractional cascading index. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6285 Test Plan: Add a unit test which would cause the assertion which would be fixed. Differential Revision: D19358426 fbshipit-source-id: 39b2b1558075fd95e99491d462a67f9f2298c48e	2020-01-13 16:27:42 -08:00
Qinfan Wu	6733be033e	More const pointers in C API (#6283 ) Summary: This makes it easier to call the functions from Rust as otherwise they require mutable types. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6283 Differential Revision: D19349991 Pulled By: wqfish fbshipit-source-id: e8da7a75efe8cd97757baef8ca844a054f2519b4	2020-01-10 19:27:09 -08:00
Sagar Vemuri	cfa585611d	Consider all compaction input files to compute the oldest ancestor time (#6279 ) Summary: Look at all compaction input files to compute the oldest ancestor time. In https://github.com/facebook/rocksdb/issues/5992 we changed how creation_time (aka oldest-ancestor-time) table property of compaction output files is computed from max(creation-time-of-all-compaction-inputs) to min(creation-time-of-all-inputs). This exposed a bug where, during compaction, the creation_time:s of only the L0 compaction inputs were being looked at, and all other input levels were being ignored. This PR fixes the issue. Some TTL compactions when using Level-Style compactions might not have run due to this bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6279 Test Plan: Enhanced the unit tests to validate that the correct time is propagated to the compaction outputs. Differential Revision: D19337812 Pulled By: sagar0 fbshipit-source-id: edf8a72f11e405e93032ff5f45590816debe0bb4	2020-01-10 19:02:42 -08:00
Maysam Yabandeh	eff5e076f5	unordered_write incompatible with max_successive_merges (#6284 ) Summary: unordered_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that and also reverts the changes performed by https://github.com/facebook/rocksdb/pull/6254, in which max_successive_merges was mistakenly declared incompatible with unordered_write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6284 Differential Revision: D19356115 Pulled By: maysamyabandeh fbshipit-source-id: f06dadec777622bd75f267361c022735cf8cecb6	2020-01-10 16:53:19 -08:00
Yanqin Jin	6a9989381f	Fix compilation under LITE (#6277 ) Summary: Fix compilation under LITE by putting `#ifndef ROCKSDB_LITE` around a code block. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6277 Differential Revision: D19334157 Pulled By: riversand963 fbshipit-source-id: 947111ed68aa550f5ea424b216c1442a8af9e32b	2020-01-09 15:57:39 -08:00
Yanqin Jin	cfd9732f65	Remove inaccurate code comment (#6274 ) Summary: Remove a comment. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6274 Differential Revision: D19323151 Pulled By: riversand963 fbshipit-source-id: d0d804d6882edcd94e35544ef45578b32ff1caae	2020-01-08 17:51:42 -08:00
Huisheng Liu	e5b476f551	Update file indexer to take timestamp into consideration (#6205 ) Summary: Exclude timestamp in key comparison during boundary calculation to avoid key versions being excluded. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6205 Differential Revision: D19166765 Pulled By: riversand963 fbshipit-source-id: bbe08816fef8de349a83ebd59a595ad844021f24	2020-01-08 16:31:23 -08:00
Yanqin Jin	a8b1085ae2	Fix test in LITE mode (#6267 ) Summary: Currently, the recently-added test DBTest2.SwitchMemtableRaceWithNewManifest fails in LITE mode since SetOptions() returns "Not supported". I do not want to put `#ifndef ROCKSDB_LITE` because it reduces test coverage. Instead, just trigger compaction on a different column family. The bg compaction thread calling LogAndApply() may race with thread calling SwitchMemtable(). Test Plan (dev server): make check OPT=-DROCKSDB_LITE make check or run DBTest2.SwitchMemtableRaceWithNewManifest 100 times. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6267 Differential Revision: D19301309 Pulled By: riversand963 fbshipit-source-id: 88cedcca2f985968ed3bb234d324ffa2aa04ca50	2020-01-07 13:47:03 -08:00
Yanqin Jin	bce5189f4d	Fix error message (#6264 ) Summary: Fix an error message when CURRENT is not found. Test plan (dev server) ``` make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6264 Differential Revision: D19300699 Pulled By: riversand963 fbshipit-source-id: 303fa206386a125960ecca1dbdeff07422690caf	2020-01-07 12:32:20 -08:00
Connor1996	3e26a94ba1	Add oldest snapshot sequence property (#6228 ) Summary: Add oldest snapshot sequence property, so we can use `db.GetProperty("rocksdb.oldest-snapshot-sequence")` to get the sequence number of the oldest snapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6228 Differential Revision: D19264145 Pulled By: maysamyabandeh fbshipit-source-id: 67fbe5304d89cbc475bd404e30d1299f7b11c010	2020-01-07 08:36:44 -08:00
Yanqin Jin	1aaa145877	Fix a data race for cfd->log_number_ (#6249 ) Summary: A thread calling LogAndApply may release db mutex when calling WriteCurrentStateToManifest() which reads cfd->log_number_. Another thread can call SwitchMemtable() and writes to cfd->log_number_. Solution is to cache the cfd->log_number_ before releasing mutex in LogAndApply. Test Plan (on devserver): ``` $COMPILE_WITH_TSAN=1 make db_stress $./db_stress --acquire_snapshot_one_in=10000 --avoid_unnecessary_blocking_io=1 --block_size=16384 --bloom_bits=16 --bottommost_compression_type=zstd --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_pipelined_write=0 --flush_one_in=1000000 --format_version=5 --get_live_files_and_wal_files_one_in=1000000 --index_block_restart_interval=5 --index_type=0 --log2_keys_per_lock=22 --long_running_snapshots=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000000 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --memtablerep=skip_list --mmap_read=0 --nooverwritepercent=1 --open_files=500000 --ops_per_thread=100000000 --partition_filters=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --set_options_one_in=10000 --snapshot_hold_ops=100000 --subcompactions=2 --sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 --write_dbid_to_manifest=1 --writepercent=35 ``` Then repeat the following multiple times, e.g. 100 after compiling with tsan. ``` $./db_test2 --gtest_filter=DBTest2.SwitchMemtableRaceWithNewManifest ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6249 Differential Revision: D19235077 Pulled By: riversand963 fbshipit-source-id: 79467b52f48739ce7c27e440caa2447a40653173	2020-01-06 20:09:51 -08:00
Qinfan Wu	edaaa1fff2	Add range delete function to C-API (#6259 ) Summary: It seems that the C-API doesn't expose the range delete functionality at the moment, so add the API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6259 Differential Revision: D19290320 Pulled By: pdillinger fbshipit-source-id: 3f403a4c3446d2042d55f1ece7cdc9c040f40c27	2020-01-06 10:46:21 -08:00
Maysam Yabandeh	28e5a9a9fb	Increase max_log_size in FlushJob to 1024 bytes (#6258 ) Summary: When measure_io_stats_ is enabled, the volume of logging is beyond the default limit of 512 size. The patch allows the EventLoggerStream to change the limit, and also sets it to 1024 for FlushJob. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6258 Differential Revision: D19279269 Pulled By: maysamyabandeh fbshipit-source-id: 3fb5d468dad488f289ac99d713378177eb7504d6	2020-01-06 10:16:52 -08:00
Maysam Yabandeh	48a678b7c9	Prevent an incompatible combination of options (#6254 ) Summary: allow_concurrent_memtable_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6254 Differential Revision: D19265819 Pulled By: maysamyabandeh fbshipit-source-id: 47f2e2dc26fe0972c7152f4da15dadb9703f1179	2020-01-02 16:15:06 -08:00
sdong	ef91894798	Fix potential overflow in CalculateSSTWriteHint() (#6212 ) Summary: level passed into ColumnFamilyData::CalculateSSTWriteHint() can be smaller than base_level in current version, which would cause overflow. We see ubsan complains: db/compaction/compaction_job.cc:1511:39: runtime error: load of value 4294967295, which is not a valid value for type 'Env::WriteLifeTimeHint' and I hope this commit fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6212 Test Plan: Run existing tests and see them to pass. Differential Revision: D19168442 fbshipit-source-id: bf8fd86f85478ecfa7556db46dc3242de8c83dc9	2019-12-18 17:04:15 -08:00
Jermy Li	f453bcb40d	Add unit tests for concurrent CF iteration and drop (#6180 ) Summary: improve https://github.com/facebook/rocksdb/issues/6147 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6180 Differential Revision: D19148936 fbshipit-source-id: f691c9879fd51d54e96c1a99670cf85ca4485a89	2019-12-18 11:54:35 -08:00
Mike Kolupaev	ce63eda6f0	Fix use-after-free and double-deleting files in BackgroundCallPurge() (#6193 ) Summary: The bad code was: ``` mutex.Lock(); // `mutex` protects `container` for (auto& x : container) { mutex.Unlock(); // do stuff to x mutex.Lock(); } ``` It's incorrect because both `x` and the iterator may become invalid if another thread modifies the container while this thread is not holding the mutex. Broken by https://github.com/facebook/rocksdb/pull/5796 - it replaced a `while (!container.empty())` loop with a `for (auto x : container)`. (RocksDB code does a lot of such unlocking+re-locking of mutexes, and this type of bugs comes up a lot :/ ) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6193 Test Plan: Ran some logdevice integration tests that were crashing without this fix. Differential Revision: D19116874 Pulled By: al13n321 fbshipit-source-id: 9672bc4227c1b68f46f7436db2b96811adb8c703	2019-12-17 20:08:56 -08:00
Adam Retter	2d16709487	Small tidy and speed up of the travis build (#6181 ) Summary: Cuts about 30-60 seconds to from each Travis Linux build, and about 15 minutes from each macOS build Pull Request resolved: https://github.com/facebook/rocksdb/pull/6181 Differential Revision: D19098357 Pulled By: pdillinger fbshipit-source-id: 863dd1ab09076ad9b03c2b7914908359628315ae	2019-12-17 13:56:45 -08:00
解轶伦	39fcaf8246	delete superversions in BackgroundCallPurge (#6146 ) Summary: I found that CleanupSuperVersion() may block Get() for 30ms+ （per MemTable is 256MB）. Then I found "delete sv" in ~SuperVersion() takes the time. The backtrace looks like this DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() -> DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion() I think it's better to delete in a background thread, please review it。 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146 Differential Revision: D18972066 fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c	2019-12-17 13:22:57 -08:00
Levi Tamasi	02aa22957a	Set CompactionIterator::valid_ to false when PrepareBlobOutput indicates error Summary: With https://github.com/facebook/rocksdb/pull/6121, errors returned by `PrepareBlobValue` result in `CompactionIterator::status_` being set to `Corruption` or `IOError` as appropriate, however, `valid_` is not set to `false`. The error is eventually propagated in `CompactionJob::ProcessKeyValueCompaction` but only after the main loop completes. Setting `valid_` to `false` upon errors enables us to terminate the loop early and fail the compaction sooner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6170 Test Plan: Ran `make check` and used `db_bench` in BlobDB mode. fbshipit-source-id: a2ca88a3ca71115e2605bd34a4c795d8a28bef27	2019-12-17 10:20:16 -08:00
Yanqin Jin	7678cf2df7	Use Env::LoadEnv to create custom Env objects (#6196 ) Summary: As title. Previous assumption was that the underlying lib can always return a shared_ptr<Env>. This is too strong. Therefore, we use Env::LoadEnv to relax it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6196 Test Plan: make check Differential Revision: D19133199 Pulled By: riversand963 fbshipit-source-id: c83a0c02a42610d077054f2de1acfc45126b3a75	2019-12-16 20:03:14 -08:00
Zhichao Cao	cddd637997	Merge adjacent file block reads in RocksDB MultiGet() and Add uncompressed block to cache (#6089 ) Summary: In the current MultiGet, if the KV-pairs do not belong to the data blocks in the block cache, multiple blocks are read from a SST. It will trigger one block read for each block request and read them in parallel. In some cases, if some data blocks are adjacent in the SST, the reads for these blocks can be combined to a single large read, which can reduce the system calls and reduce the read latency if possible. Considering to fill the block cache, if multiple data blocks are in the same memory buffer, we need to copy them to the heap separately. Therefore, only in the case that 1) data block compression is enabled, and 2) compressed block cache is null, we can do combined read. Otherwise, extra memory copy is needed, which may cause extra overhead. In the current case, data blocks will be uncompressed to a new memory space. Also, in the case that 1) data block compression is enabled, and 2) compressed block cache is null, it is possible the data block is actually not compressed. In the current logic, these data blocks will not be added to the uncompressed_cache. So if memory buffer is shared and the data block is not compressed, the data block are copied to the head and fill the cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6089 Test Plan: Added test case to ParallelIO.MultiGet. Pass make asan_check Differential Revision: D18734668 Pulled By: zhichao-cao fbshipit-source-id: 67c5615ed373e51e42635fd74b36f8f3a66d5da4	2019-12-16 16:26:03 -08:00
Levi Tamasi	db7c687523	Fix a data race related to memtable trimming (#6187 ) Summary: https://github.com/facebook/rocksdb/pull/6177 introduced a data race involving `MemTableList::InstallNewVersion` and `MemTableList::NumFlushed`. The patch fixes this by caching whether the current version has any memtable history (i.e. flushed memtables that are kept around for transaction conflict checking) in an `std::atomic<bool>` member called `current_has_history_`, similarly to how `current_memory_usage_excluding_last_` is handled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6187 Test Plan: ``` make clean COMPILE_WITH_TSAN=1 make db_test -j24 ./db_test ``` Differential Revision: D19084059 Pulled By: ltamasi fbshipit-source-id: 327a5af9700fb7102baea2cc8903c085f69543b9	2019-12-16 13:16:31 -08:00
Levi Tamasi	bd8404feff	Do not schedule memtable trimming if there is no history (#6177 ) Summary: We have observed an increase in CPU load caused by frequent calls to `ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory` when using `max_write_buffer_size_to_maintain` to limit the amount of memtable history maintained for transaction conflict checking. Part of the issue is that trimming can potentially be scheduled even if there is no memtable history. The patch adds a check that fixes this. See also https://github.com/facebook/rocksdb/pull/6169. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6177 Test Plan: Compared `perf` output for ``` ./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32 ``` before and after the change. There is a significant reduction for the call chain `rocksdb::DBImpl::TrimMemtableHistory` -> `rocksdb::ColumnFamilyData::InstallSuperVersion` -> `rocksdb::ThreadLocalPtr::StaticMeta::Scrape` even without https://github.com/facebook/rocksdb/pull/6169. Differential Revision: D19057445 Pulled By: ltamasi fbshipit-source-id: dff81882d7b280e17eda7d9b072a2d4882c50f79	2019-12-13 19:11:19 -08:00
anand76	afa2420c2b	Introduce a new storage specific Env API (#5761 ) Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f	2019-12-13 14:48:41 -08:00
Peter Dillinger	58d46d1915	Add useful idioms to Random API (OneInOpt, PercentTrue) (#6154 ) Summary: And clean up related code, especially in stress test. (More clean up of db_stress_test_base.cc coming after this.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6154 Test Plan: make check, make blackbox_crash_test for a bit Differential Revision: D18938180 Pulled By: pdillinger fbshipit-source-id: 524d27621b8dbb25f6dff40f1081e7c00630357e	2019-12-13 14:30:14 -08:00

1 2 3 4 5 ...

3961 Commits