9d0cae7104
425 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Andrew Kryczka
|
c5d367f472 |
Revert open logic changes in #9634 (#9906)
Summary: Left HISTORY.md and unit tests. Added a new unit test to repro the corruption scenario that this PR fixes, and HISTORY.md line for that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9906 Reviewed By: riversand963 Differential Revision: D35940093 Pulled By: ajkr fbshipit-source-id: 9816f99e1ce405ba36f316beb4f6378c37c8c86b |
||
RoeyMaor
|
6d2577e567 |
Bugfix/fix manual flush blocking bug (#9893)
Summary: Fix https://github.com/facebook/rocksdb/issues/9892 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9893 Reviewed By: jay-zhuang Differential Revision: D35880959 Pulled By: ajkr fbshipit-source-id: dad1139ad0983cfbd5c5cd6fa6b71022f889735a |
||
Yanqin Jin
|
fe63899d1a |
Add checks to GetUpdatesSince (#9459)
Summary: Make `DB::GetUpdatesSince` return early if told to scan WALs generated by transactions with write-prepared or write-unprepared policies (`seq_per_batch` is true), as indicated by API comment. Also add checks to `TransactionLogIterator` to clarify some conditions. No API change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9459 Test Plan: make check Closing https://github.com/facebook/rocksdb/issues/1565 Reviewed By: akankshamahajan15 Differential Revision: D33821243 Pulled By: riversand963 fbshipit-source-id: c8b155d020ce0980e2d3b3b1da40b96e65b48d79 |
||
Yanqin Jin
|
0bd4dcde6b |
CompactionIterator sees consistent view of which keys are committed (#9830)
Summary: **This PR does not affect the functionality of `DB` and write-committed transactions.** `CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed. As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if it is committed. In fact, the implementation of `KeyCommitted()` is as follows: ``` inline bool KeyCommitted(SequenceNumber seq) { // For non-txn-db and write-committed, snapshot_checker_ is always nullptr. return snapshot_checker_ == nullptr || snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot; } ``` With that being said, we focus on write-prepared/write-unprepared transactions. A few notes: - A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database. - `CompactionIterator` outputs a key as long as the key is uncommitted. Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`. Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone. To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot to determine whether a key is committed or not with minor change to `KeyCommitted()`. ``` inline bool KeyCommitted(SequenceNumber sequence) { // For non-txn-db and write-committed, snapshot_checker_ is always nullptr. return snapshot_checker_ == nullptr || snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) == SnapshotCheckerResult::kInSnapshot; } ``` As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble for `CompactionIterator`s assertions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830 Test Plan: make check Reviewed By: ltamasi Differential Revision: D35561162 Pulled By: riversand963 fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f |
||
Jay Zhuang
|
dc1c90c4e3 |
Support canceling running RemoteCompaction on remote side (#9725)
Summary: Add the ability to cancel remote compaction on the remote side by setting `OpenAndCompactOptions.canceled` to true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9725 Test Plan: added unittest Reviewed By: ajkr Differential Revision: D35018800 Pulled By: jay-zhuang fbshipit-source-id: be3652f9645e0347df429e42a5614d5a9b3a1ec4 |
||
Peter Dillinger
|
efd035164b |
Meta-internal folly integration with F14FastMap (#9546)
Summary: Especially after updating to C++17, I don't see a compelling case for *requiring* any folly components in RocksDB. I was able to purge the existing hard dependencies, and it can be quite difficult to strip out non-trivial components from folly for use in RocksDB. (The prospect of doing that on F14 has changed my mind on the best approach here.) But this change creates an optional integration where we can plug in components from folly at compile time, starting here with F14FastMap to replace std::unordered_map when possible (probably no public APIs for example). I have replaced the biggest CPU users of std::unordered_map with compile-time pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set. USE_FOLLY is always set in the Meta-internal buck build, and a simulation of that is in the Makefile for public CI testing. A full folly build is not needed, but checking out the full folly repo is much simpler for getting the dependency, and anything else we might want to optionally integrate in the future. Some picky details: * I don't think the distributed mutex stuff is actually used, so it was easy to remove. * I implemented an alternative to `folly::constexpr_log2` (which is much easier in C++17 than C++11) so that I could pull out the hard dependencies on `ConstexprMath.h` * I had to add noexcept move constructors/operators to some types to make F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a macro to make that easier in some common cases. * Updated Meta-internal buck build to use folly F14Map (always) No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a production integration for open source users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546 Test Plan: CircleCI tests updated so that a couple of them use folly. Most internal unit & stress/crash tests updated to use Meta-internal latest folly. (Note: they should probably use buck but they currently use Makefile.) Example performance improvement: when filter partitions are pinned in cache, they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build a test that exercises that heavily. Build DB with ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters ``` and test with (simultaneous runs with & without folly, ~20 times each to see convergence) ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache ``` Average ops/s no folly: 26229.2 Average ops/s with folly: 26853.3 (+2.4%) Reviewed By: ajkr Differential Revision: D34181736 Pulled By: pdillinger fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94 |
||
Jay Zhuang
|
f934a0af46 |
Add event listener support on remote compactor side (#9821)
Summary: So the user is able to set event listener on the compactor side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9821 Test Plan: unittest added Reviewed By: ajkr Differential Revision: D35485388 Pulled By: jay-zhuang fbshipit-source-id: 669d8a3aaee012b75b940470306756c03ffa09b2 |
||
Akanksha Mahajan
|
ae82d91492 |
Remove corrupted WAL files in kPointRecoveryMode with avoid_flush_duing_recovery set true (#9634)
Summary: 1) In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't flush the data from WAL to L0 for all column families if possible. As a result, not all column families can increase their log_numbers, and min_log_number_to_keep won't change. 2) For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change. If we persist a new MANIFEST with advanced log_numbers for some column families, then during a second crash after persisting the MANIFEST, RocksDB will see some column families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail. As a solution, 1. the corrupted WALs whose numbers are larger than the corrupted wal and smaller than the new WAL will be moved to archive folder. 2. Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful Pull Request resolved: https://github.com/facebook/rocksdb/pull/9634 Test Plan: 1. Added new unit tests 2. make crast_test -j Reviewed By: riversand963 Differential Revision: D34463666 Pulled By: akankshamahajan15 fbshipit-source-id: e233d3af0ed4e2028ca0cf051e5a334a0fdc9d19 |
||
Yanqin Jin
|
0ad9ee30ce |
Remove dead code (#9825)
Summary: Options `preserve_deletes` and `iter_start_seqnum` have been removed since 7.0. This PR removes dead code related to these two removed options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9825 Test Plan: make check Reviewed By: akankshamahajan15 Differential Revision: D35517950 Pulled By: riversand963 fbshipit-source-id: 86282ce5ec4087acb94a06a42a1b6d55b1715482 |
||
Andrew Kryczka
|
04623e7cd4 |
Fix GetMergeOperands() heap-use-after-free on flushed memtable (#9805)
Summary: Fixes https://github.com/facebook/rocksdb/issues/9066. Prior to the fix in this PR, this PR's unit test reported the following error under ASAN: ``` ==2175705==ERROR: AddressSanitizer: heap-use-after-free on address 0x61f0000012a5 at pc 0x7f0fc36e76ce bp 0x7ffc103e9ca0 sp 0x7ffc103e9450 READ of size 5 at 0x61f0000012a5 thread T0 #0 0x7f0fc36e76cd in __interceptor_memcpy /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:790 https://github.com/facebook/rocksdb/issues/1 0x7f0fc35a207e in std::char_traits<char>::copy(char*, char const*, unsigned long) /home/engshare/third-party2/libgcc/9.x/src/gcc-9.x/x86_64-facebook-linux/libstdc++-v3/include/bits/char_traits.h:365 https://github.com/facebook/rocksdb/issues/2 0x7f0fc35a207e in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy(char*, char const*, unsigned long) /home/engshare/third-party2/libgcc/9.x/src/gcc-9.x/x86_64-facebook-linux/libstdc++-v3/include/bits/basic_string.h:351 https://github.com/facebook/rocksdb/issues/3 0x7f0fc35a207e in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace(unsigned long, unsigned long, char const*, unsigned long) /home/engshare/third-party2/libgcc/9.x/src/gcc-9.x/x86_64-facebook-linux/libstdc++-v3/include/bits/basic_string.tcc:440 https://github.com/facebook/rocksdb/issues/4 0x8679ca in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign(char const*, unsigned long) /mnt/gvfs/third-party2/libgcc/4959b39cfbe5965a37c861c4c327fa7c5c759b87/9.x/platform009/9202ce7/include/c++/9.3.0/bits/basic_string.h:1422 https://github.com/facebook/rocksdb/issues/5 0x8679ca in rocksdb::PinnableSlice::PinSelf(rocksdb::Slice const&) include/rocksdb/slice.h:171 https://github.com/facebook/rocksdb/issues/6 0x8679ca in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1930 https://github.com/facebook/rocksdb/issues/7 0x547324 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203 https://github.com/facebook/rocksdb/issues/8 0x547324 in rocksdb::DBMergeOperandTest_FlushedMergeOperandReadAfterFreeBug_Test::TestBody() db/db_merge_operand_test.cc:117 https://github.com/facebook/rocksdb/issues/9 0x7241da in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899 https://github.com/facebook/rocksdb/issues/10 0x7241da in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935 https://github.com/facebook/rocksdb/issues/11 0x701a47 in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3973 https://github.com/facebook/rocksdb/issues/12 0x702040 in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3965 https://github.com/facebook/rocksdb/issues/13 0x702040 in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4149 https://github.com/facebook/rocksdb/issues/14 0x7025f7 in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4124 https://github.com/facebook/rocksdb/issues/15 0x7025f7 in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4267 https://github.com/facebook/rocksdb/issues/16 0x704217 in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4253 https://github.com/facebook/rocksdb/issues/17 0x704217 in testing::internal::UnitTestImpl::RunAllTests() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6633 https://github.com/facebook/rocksdb/issues/18 0x72505a in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899 https://github.com/facebook/rocksdb/issues/19 0x72505a in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935 https://github.com/facebook/rocksdb/issues/20 0x704aa1 in testing::UnitTest::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6242 https://github.com/facebook/rocksdb/issues/21 0x4c4aff in RUN_ALL_TESTS() third-party/gtest-1.8.1/fused-src/gtest/gtest.h:22110 https://github.com/facebook/rocksdb/issues/22 0x4c4aff in main db/db_merge_operand_test.cc:404 https://github.com/facebook/rocksdb/issues/23 0x7f0fc3108dc4 in __libc_start_main ../csu/libc-start.c:308 https://github.com/facebook/rocksdb/issues/24 0x5445fd in _start (/data/users/andrewkr/rocksdb/db_merge_operand_test+0x5445fd) 0x61f0000012a5 is located 1061 bytes inside of 3264-byte region [0x61f000000e80,0x61f000001b40) freed by thread T0 here: #0 0x7f0fc375b6af in operator delete(void*, unsigned long) /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/libsanitizer/asan/asan_new_delete.cc:177 https://github.com/facebook/rocksdb/issues/1 0x743be8 in rocksdb::SuperVersion::~SuperVersion() db/column_family.cc:432 https://github.com/facebook/rocksdb/issues/2 0x8052aa in rocksdb::DBImpl::CleanupSuperVersion(rocksdb::SuperVersion*) db/db_impl/db_impl.cc:3534 https://github.com/facebook/rocksdb/issues/3 0x8676c2 in rocksdb::DBImpl::ReturnAndCleanupSuperVersion(rocksdb::ColumnFamilyData*, rocksdb::SuperVersion*) db/db_impl/db_impl.cc:3544 https://github.com/facebook/rocksdb/issues/4 0x8676c2 in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1911 https://github.com/facebook/rocksdb/issues/5 0x547324 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203 https://github.com/facebook/rocksdb/issues/6 0x547324 in rocksdb::DBMergeOperandTest_FlushedMergeOperandReadAfterFreeBug_Test::TestBody() db/db_merge_operand_test.cc:117 https://github.com/facebook/rocksdb/issues/7 0x7241da in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899 https://github.com/facebook/rocksdb/issues/8 0x7241da in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935 https://github.com/facebook/rocksdb/issues/9 0x701a47 in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3973 https://github.com/facebook/rocksdb/issues/10 0x702040 in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3965 https://github.com/facebook/rocksdb/issues/11 0x702040 in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4149 https://github.com/facebook/rocksdb/issues/12 0x7025f7 in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4124 https://github.com/facebook/rocksdb/issues/13 0x7025f7 in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4267 https://github.com/facebook/rocksdb/issues/14 0x704217 in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4253 https://github.com/facebook/rocksdb/issues/15 0x704217 in testing::internal::UnitTestImpl::RunAllTests() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6633 https://github.com/facebook/rocksdb/issues/16 0x72505a in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899 https://github.com/facebook/rocksdb/issues/17 0x72505a in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935 https://github.com/facebook/rocksdb/issues/18 0x704aa1 in testing::UnitTest::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6242 https://github.com/facebook/rocksdb/issues/19 0x4c4aff in RUN_ALL_TESTS() third-party/gtest-1.8.1/fused-src/gtest/gtest.h:22110 https://github.com/facebook/rocksdb/issues/20 0x4c4aff in main db/db_merge_operand_test.cc:404 https://github.com/facebook/rocksdb/issues/21 0x7f0fc3108dc4 in __libc_start_main ../csu/libc-start.c:308 https://github.com/facebook/rocksdb/issues/22 0x5445fd in _start (/data/users/andrewkr/rocksdb/db_merge_operand_test+0x5445fd) ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9805 Test Plan: following the fix in this PR, the new unit test passes Reviewed By: jay-zhuang Differential Revision: D35388415 Pulled By: ajkr fbshipit-source-id: b39c5d002155906c8abc4a3429eca696dbf916d0 |
||
Yanqin Jin
|
1a1c5bda23 |
Disallow commit-time-batch for write-prepared/write-unprepared txn conditionally (#9794)
Summary: For write-prepared/write-unprepared transactions, GetCommitTimeWriteBatch() can be used only if the transaction is started with `TransactionOptions::use_only_the_last_commit_time_batch_for_recovery` set to true. Otherwise, it is possible that multiple uncommitted versions of the same key exist in the database. During bottommost compaction, RocksDB may set the sequence numbers of both to zero once they become committed, causing output SST file to have two identical internal keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9794 Test Plan: make check pay special attention to the following ``` transaction_test --gtest_filter=MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/* ``` Reviewed By: lth Differential Revision: D35327214 Pulled By: riversand963 fbshipit-source-id: 3bae00a28359c10e96e4c6f676d20de5610d8a0f |
||
sdong
|
bbcf7b192c |
Fix DB::Open() error logging (#9784)
Summary: Right now we log a wrong error when DB::Open() fails. Fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9784 Test Plan: CI runs should pass Reviewed By: ajkr, riversand963 Differential Revision: D35290203 fbshipit-source-id: ffc640afa27f6b0a2382ee153dc43f28d9e242be |
||
Yanqin Jin
|
de9df6e818 |
Do not release and re-acquire dbmutex on memtable-switch if no listener (#9758)
Summary: There is no need to release-and-acquire immediately when no listener is registered. This is what we have been doing for `NotifyOnFlushBegin()`, `NotifyOnFlushCompleted()`, `NotifyOnCompactionBegin()`, `NotifyOnCompactionCompleted()`, and some other `NotifyOnXX` methods in event_helpers.cc. Do the same for `NotifyOnMemTableSealed ()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9758 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D35159552 Pulled By: riversand963 fbshipit-source-id: 6e0aac50bd5c8f506d809b6638c33a7a28d1e87f |
||
Peter Dillinger
|
105d7f0c7c |
Document SetOptions API (#9778)
Summary: much needed Some other minor tweaks also Pull Request resolved: https://github.com/facebook/rocksdb/pull/9778 Test Plan: existing tests Reviewed By: ajkr Differential Revision: D35258195 Pulled By: pdillinger fbshipit-source-id: 974ddafc23a540aacceb91da72e81593d818f99c |
||
Peter Dillinger
|
cad809978a |
Fix heap use-after-free race with DropColumnFamily (#9730)
Summary: Although ColumnFamilySet comments say that DB mutex can be freed during iteration, as long as you hold a ref while releasing DB mutex, this is not quite true because UnrefAndTryDelete might delete cfd right before it is needed to get ->next_ for the next iteration of the loop. This change solves the problem by making a wrapper class that makes such iteration easier while handling the tricky details of UnrefAndTryDelete on the previous cfd only after getting next_ in operator++. FreeDeadColumnFamilies should already have been obsolete; this removes it for good. Similarly, ColumnFamilySet::iterator doesn't need to check for cfd with 0 refs, because those are immediately deleted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9730 Test Plan: was reported with ASAN on unit tests like DBLogicalBlockSizeCacheTest.CreateColumnFamily (very rare); keep watching Reviewed By: ltamasi Differential Revision: D35038143 Pulled By: pdillinger fbshipit-source-id: 0a5478d5be96c135343a00603711b7df43ae19c9 |
||
Yanqin Jin
|
e0c84aa0dc |
Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary: There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC. The race condition is between two background flush threads trying to install flush results to the MANIFEST. Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially, both column families have one mutable (active) memtable whose data backed by 6.log. 1. Trigger a manual flush for "cf1", creating a 7.log 2. Insert another key to "default", and trigger flush for "default", creating 8.log 3. BgFlushThread1 finishes writing 9.sst 4. BgFlushThread2 finishes writing 10.sst ``` Time BgFlushThread1 BgFlushThread2 | mutex_.Lock() | precompute min_wal_to_keep as 6 | mutex_.Unlock() | mutex_.Lock() | precompute min_wal_to_keep as 6 | join MANIFEST write queue and mutex_.Unlock() | write to MANIFEST | mutex_.Lock() | cfd1->log_number = 7 | Signal bg_flush_2 and mutex_.Unlock() | wake up and mutex_.Lock() | cfd0->log_number = 8 | FindObsoleteFiles() with job_context->log_number == 7 | mutex_.Unlock() | PurgeObsoleteFiles() deletes 6.log V ``` As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6). Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6). No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`, due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514. The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e. the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist. If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true. We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know the correct min wal number until the other bg flush threads have finished committing to the manifest and updated the `cfd::log_number`. To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`, and use it to track WAL file deletion in non-2pc mode as well. This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread. `min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715 Test Plan: ``` make check ``` Also ran stress test below (with asan) to make sure it completes successfully. ``` TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \ CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \ make J=52 -j52 blackbox_asan_crash_test ``` Reviewed By: ltamasi Differential Revision: D34984412 Pulled By: riversand963 fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005 |
||
Yanqin Jin
|
29bec740f5 |
Return invalid argument if batch is null (#9744)
Summary: Originally, a corruption will be returned by `DBImpl::WriteImpl(batch...)` if batch is null. This is inaccurate since there is no data corruption. Return `Status::InvalidArgument()` instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9744 Test Plan: make check Reviewed By: ltamasi Differential Revision: D35086268 Pulled By: riversand963 fbshipit-source-id: 677397b007a53bc25210eac0178d49c9797b5951 |
||
duyuqi
|
cbe303c19b |
fix a bug, c api, if enable inplace_update_support, and use create sn… (#9471)
Summary: c api release snapshot will core dump when enable inplace_update_support and create snapshot Pull Request resolved: https://github.com/facebook/rocksdb/pull/9471 Reviewed By: akankshamahajan15 Differential Revision: D34965103 Pulled By: riversand963 fbshipit-source-id: c3aeeb9ea7126c2eda1466102794fecf57b6ab77 |
||
Yanqin Jin
|
b2aacaf923 |
Fix assertion error by doing comparison with mutex (#9717)
Summary: On CircleCI MacOS instances, we have been seeing the following assertion error: ``` Assertion failed: (alive_log_files_tail_ == alive_log_files_.rbegin()), function WriteToWAL, file /Users/distiller/project/db/db_impl/db_impl_write.cc, line 1213. Received signal 6 (Abort trap: 6) #0 0x1 https://github.com/facebook/rocksdb/issues/1 abort (in libsystem_c.dylib) + 120 https://github.com/facebook/rocksdb/issues/2 err (in libsystem_c.dylib) + 0 https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::WriteToWAL(rocksdb::WriteBatch const&, rocksdb::log::Writer*, unsigned long long*, unsigned long long*, rocksdb::Env::IOPriority, bool, bool) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:1213) https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long long*, bool, bool, unsigned long long) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:1251) https://github.com/facebook/rocksdb/issues/5 rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long long*, unsigned long long, bool, unsigned long long*, unsigned long, rocksdb::PreReleaseCallback*) (in librocksdb.7.0.0.dylib) (db_impl_ rite.cc:421) https://github.com/facebook/rocksdb/issues/6 rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:109) https://github.com/facebook/rocksdb/issues/7 rocksdb::DB::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:2159) https://github.com/facebook/rocksdb/issues/8 rocksdb::DBImpl::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:37) https://github.com/facebook/rocksdb/issues/9 rocksdb::DB::Put(rocksdb::WriteOptions const&, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db.h:382) https://github.com/facebook/rocksdb/issues/10 rocksdb::DBBasicTestWithTimestampPrefixSeek_IterateWithPrefix_Test::TestBody() (in db_with_timestamp_basic_test) (db_with_timestamp_basic_test.cc:2926) https://github.com/facebook/rocksdb/issues/11 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3899) https://github.com/facebook/rocksdb/issues/12 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3935) https://github.com/facebook/rocksdb/issues/13 testing::Test::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:3980) https://github.com/facebook/rocksdb/issues/14 testing::TestInfo::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:4153) https://github.com/facebook/rocksdb/issues/15 testing::TestCase::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:4266) https://github.com/facebook/rocksdb/issues/16 testing::internal::UnitTestImpl::RunAllTests() (in db_with_timestamp_basic_test) (gtest-all.cc:6632) https://github.com/facebook/rocksdb/issues/17 bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3899) https://github.com/facebook/rocksdb/issues/18 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3935) https://github.com/facebook/rocksdb/issues/19 testing::UnitTest::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:6242) https://github.com/facebook/rocksdb/issues/20 RUN_ALL_TESTS() (in db_with_timestamp_basic_test) (gtest.h:22110) https://github.com/facebook/rocksdb/issues/21 main (in db_with_timestamp_basic_test) (db_with_timestamp_basic_test.cc:3150) https://github.com/facebook/rocksdb/issues/22 start (in libdyld.dylib) + 1 ``` It's likely caused by concurrent, unprotected access to the deque, even though `back()` is never popped, and we are comparing `rbegin()` with a cached `riterator`. To be safe, do the comparison only if we have mutex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9717 Test Plan: One example Ssh to one CircleCI MacOS instance. ``` gtest-parallel -r 1000 -w 8 ./db_test --gtest_filter=DBTest.FlushesInParallelWithCompactRange ``` Reviewed By: pdillinger Differential Revision: D34990696 Pulled By: riversand963 fbshipit-source-id: 62dd48ae6fedbda53d0a64d73de9b948b4c26eee |
||
anand76
|
a88d8795ec |
Expand auto recovery to background read errors (#9679)
Summary: Fix and enhance the background error recovery logic to handle the following situations - 1. Background read errors during flush/compaction (previously was resulting in unrecoverable state) 2. Fix auto recovery failure on read/write errors during atomic flush. It was failing due to a bug in setting the resuming_from_bg_err variable in AtomicFlushMemTablesToOutputFiles. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9679 Test Plan: Add new unit tests in error_handler_fs_test Reviewed By: riversand963 Differential Revision: D34770097 Pulled By: anand1976 fbshipit-source-id: 136da973a28d684b9c74bdf668519b0cbbbe1742 |
||
Jay Zhuang
|
2c8100e60e |
Fix a race condition when disable and enable manual compaction (#9694)
Summary: In https://github.com/facebook/rocksdb/issues/9659, when `DisableManualCompaction()` is issued, the foreground manual compaction thread does not have to wait background compaction thread to finish. Which could be a problem that the user re-enable manual compaction with `EnableManualCompaction()`, it may re-enable the BG compaction which supposed be cancelled. This patch makes the FG compaction wait on `manual_compaction_state.done`, which either be set by BG compaction or Unschedule callback. Then when FG manual compaction thread returns, it should not have BG compaction running. So shared_ptr is no longer needed for `manual_compaction_state`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9694 Test Plan: a StressTest and unittest Reviewed By: ajkr Differential Revision: D34885472 Pulled By: jay-zhuang fbshipit-source-id: e6476175b43e8c59cd49f5c09241036a0716c274 |
||
Yanqin Jin
|
6a76008369 |
Fix TSAN caused by calling rend() and pop_front() . (#9698)
Summary: PR9686 makes `WriteToWAL()` call `assert(...!=rend())` while not holding db mutex or log mutex. Another thread may concurrently call `pop_front()`, causing race condition. To fix, assert only if mutex is held. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9698 Test Plan: COMPILE_WITH_TSAN=1 make check Reviewed By: jay-zhuang Differential Revision: D34898535 Pulled By: riversand963 fbshipit-source-id: 1ddfa5bf1b6ae8d409cab6ff6e1b5321c6803da9 |
||
Yanqin Jin
|
bbdaf63d0f |
Fix a TSAN-reported bug caused by concurrent accesss to std::deque (#9686)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9686 According to https://www.cplusplus.com/reference/deque/deque/back/, " The container is accessed (neither the const nor the non-const versions modify the container). The last element is potentially accessed or modified by the caller. Concurrently accessing or modifying other elements is safe. " Also according to https://www.cplusplus.com/reference/deque/deque/pop_front/, " The container is modified. The first element is modified. Concurrently accessing or modifying other elements is safe (although see iterator validity above). " In RocksDB, we never pop the last element of `DBImpl::alive_log_files_`. We have been exploiting this fact and the above two properties when ensuring correctness when `DBImpl::alive_log_files_` may be accessed concurrently. Specifically, it can be accessed in the write path when db mutex is released. Sometimes, the log_mute_ is held. It can also be accessed in `FindObsoleteFiles()` when db mutex is always held. It can also be accessed during recovery when db mutex is also held. Given the fact that we never pop the last element of alive_log_files_, we currently do not acquire additional locks when accessing it in `WriteToWAL()` as follows ``` alive_log_files_.back().AddSize(log_entry.size()); ``` This is problematic. Check source code of deque.h ``` back() _GLIBCXX_NOEXCEPT { __glibcxx_requires_nonempty(); ... } pop_front() _GLIBCXX_NOEXCEPT { ... if (this->_M_impl._M_start._M_cur != this->_M_impl._M_start._M_last - 1) { ... ++this->_M_impl._M_start._M_cur; } ... } ``` `back()` will actually call `__glibcxx_requires_nonempty()` first. If `__glibcxx_requires_nonempty()` is enabled and not an empty macro, it will call `empty()` ``` bool empty() { return this->_M_impl._M_finish == this->_M_impl._M_start; } ``` You can see that it will access `this->_M_impl._M_start`, racing with `pop_front()`. Therefore, TSAN will actually catch the bug in this case. To be able to use TSAN on our library and unit tests, we should always coordinate concurrent accesses to STL containers properly. We need to pass information about db mutex and log mutex into `WriteToWAL()`, otherwise it's impossible to know which mutex to acquire inside the function. To fix this, we can catch the tail of `alive_log_files_` by reference, so that we do not have to call `back()` in `WriteToWAL()`. Reviewed By: pdillinger Differential Revision: D34780309 fbshipit-source-id: 1def9821f0c437f2736c6a26445d75890377889b |
||
Jay Zhuang
|
4dff279b19 |
DisableManualCompaction may fail to cancel an unscheduled task (#9659)
Summary: https://github.com/facebook/rocksdb/issues/9625 didn't change the unschedule condition which was waiting for the background thread to clean-up the compaction. make sure we only unschedule the task when it's scheduled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9659 Reviewed By: ajkr Differential Revision: D34651820 Pulled By: jay-zhuang fbshipit-source-id: 23f42081b15ec8886cd81cbf131b116e0c74dc2f |
||
Jay Zhuang
|
09b0e8f2c7 |
Fix a timer crash caused by invalid memory management (#9656)
Summary: Timer crash when multiple DB instances doing heavy DB open and close operations concurrently. Which is caused by adding a timer task with smaller timestamp than the current running task. Fix it by moving the getting new task timestamp part within timer mutex protection. And other fixes: - Disallow adding duplicated function name to timer - Fix a minor memory leak in timer when a running task is cancelled Pull Request resolved: https://github.com/facebook/rocksdb/pull/9656 Reviewed By: ajkr Differential Revision: D34626296 Pulled By: jay-zhuang fbshipit-source-id: 6b6d96a5149746bf503546244912a9e41a0c5f6b |
||
slk
|
95305c44a1 |
Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410)
Summary: As disscussed in (https://github.com/facebook/rocksdb/issues/9223), Here added a new API named DB::OpenAndTrimHistory, this API will open DB and trim data to the timestamp specofied by **trim_ts** (The data with newer timestamp than specified trim bound will be removed). This API should only be used at a timestamp-enabled db instance recovery. And this PR implemented a new iterator named HistoryTrimmingIterator to support trimming history with a new API named DB::OpenAndTrimHistory. HistoryTrimmingIterator wrapped around the underlying InternalITerator such that keys whose timestamps newer than **trim_ts** should not be returned to the compaction iterator while **trim_ts** is not null. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9410 Reviewed By: ltamasi Differential Revision: D34410207 Pulled By: riversand963 fbshipit-source-id: e54049dc234eccd673244c566b15df58df5a6236 |
||
Yanqin Jin
|
3b6dc049f7 |
Support user-defined timestamps in write-committed txns (#9629)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9629 Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are locked upon first operation that writes the key or has the intention of writing. For example, `PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`, `PessimisticTransaction::SingleDelete()` will write to or delete a key, while `PessimisticTransaction::GetForUpdate()` is used by application to indicate to RocksDB that the transaction has the intention of performing write operation later in the same transaction. Pessimistic transactions support two-phase commit (2PC). A transaction can be `Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once `Prepare()` succeeds, the transaction has acquired the necessary resources to commit. The resources include locks, persistence of WAL, etc. Write-committed transaction is the default pessimistic transaction implementation. In RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare section. `Commit()` will write a commit marker to the WAL and then write data to the memtables. While writing to the memtables, different keys in the transaction's write batch will be assigned different sequence numbers in ascending order. Until commit/rollback, the transaction holds locks on the keys so that no other transaction can write to the same keys. Furthermore, the keys' sequence numbers represent the order in which they are committed and should be made visible. This is convenient for us to implement support for user-defined timestamps. Since column families with and without timestamps can co-exist in the same database, a transaction may or may not involve timestamps. Based on this observation, we add two optional members to each `PessimisticTransaction`, `read_timestamp_` and `commit_timestamp_`. If no key in the transaction's write batch has timestamp, then setting these two variables do not have any effect. For the rest of this commit, we discuss only the cases when these two variables are meaningful. read_timestamp_ is used mainly for validation, and should be set before first call to `GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls `TryLock()` that can verify if another transaction has written the same key since `read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed written the same key, then validation fails, and RocksDB allows this transaction to refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()` with a different timestamp to read, but the result of the read should not be used to determine data that will be written later. commit_timestamp_ must be set after finishing writing and before transaction commit. This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after prepare phase succeeds. We currently require that the commit timestamp be chosen after all keys are locked. This means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`, `PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`, etc. will return non-ok status because they specify timestamps before locking the keys. Users are also prompted to use the `Transaction` APIs when they receive the non-ok status. Reviewed By: ltamasi Differential Revision: D31822445 fbshipit-source-id: b82abf8e230216dc89cc519564a588224a88fd43 |
||
Hui Xiao
|
ca0ef54f16 |
Rate-limit automatic WAL flush after each user write (#9607)
Summary: **Context:** WAL flush is currently not rate-limited by `Options::rate_limiter`. This PR is to provide rate-limiting to auto WAL flush, the one that automatically happen after each user write operation (i.e, `Options::manual_wal_flush == false`), by adding `WriteOptions::rate_limiter_options`. Note that we are NOT rate-limiting WAL flush that do NOT automatically happen after each user write, such as `Options::manual_wal_flush == true + manual FlushWAL()` (rate-limiting multiple WAL flushes), for the benefits of: - being consistent with [ReadOptions::rate_limiter_priority](https://github.com/facebook/rocksdb/blob/7.0.fb/include/rocksdb/options.h#L515) - being able to turn off some WAL flush's rate-limiting but not all (e.g, turn off specific the WAL flush of a critical user write like a service's heartbeat) `WriteOptions::rate_limiter_options` only accept `Env::IO_USER` and `Env::IO_TOTAL` currently due to an implementation constraint. - The constraint is that we currently queue parallel writes (including WAL writes) based on FIFO policy which does not factor rate limiter priority into this layer's scheduling. If we allow lower priorities such as `Env::IO_HIGH/MID/LOW` and such writes specified with lower priorities occurs before ones specified with higher priorities (even just by a tiny bit in arrival time), the former would have blocked the latter, leading to a "priority inversion" issue and contradictory to what we promise for rate-limiting priority. Therefore we only allow `Env::IO_USER` and `Env::IO_TOTAL` right now before improving that scheduling. A pre-requisite to this feature is to support operation-level rate limiting in `WritableFileWriter`, which is also included in this PR. **Summary:** - Renamed test suite `DBRateLimiterTest to DBRateLimiterOnReadTest` for adding a new test suite - Accept `rate_limiter_priority` in `WritableFileWriter`'s private and public write functions - Passed `WriteOptions::rate_limiter_options` to `WritableFileWriter` in the path of automatic WAL flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9607 Test Plan: - Added new unit test to verify existing flush/compaction rate-limiting does not break, since `DBTest, RateLimitingTest` is disabled and current db-level rate-limiting tests focus on read only (e.g, `db_rate_limiter_test`, `DBTest2, RateLimitedCompactionReads`). - Added new unit test `DBRateLimiterOnWriteWALTest, AutoWalFlush` - `strace -ftt -e trace=write ./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -rate_limit_auto_wal_flush=1 -rate_limiter_bytes_per_sec=15 -rate_limiter_refill_period_us=1000000 -write_buffer_size=100000000 -disable_auto_compactions=1 -num=100` - verified that WAL flush(i.e, system-call _write_) were chunked into 15 bytes and each _write_ was roughly 1 second apart - verified the chunking disappeared when `-rate_limit_auto_wal_flush=0` - crash test: `python3 tools/db_crashtest.py blackbox --disable_wal=0 --rate_limit_auto_wal_flush=1 --rate_limiter_bytes_per_sec=10485760 --interval=10` killed as normal **Benchmarked on flush/compaction to ensure no performance regression:** - compaction with rate-limiting (see table 1, avg over 1280-run): pre-change: **915635 micros/op**; post-change: **907350 micros/op (improved by 0.106%)** ``` #!/bin/bash TEST_TMPDIR=/dev/shm/testdb START=1 NUM_DATA_ENTRY=8 N=10 rm -f compact_bmk_output.txt compact_bmk_output_2.txt dont_care_output.txt for i in $(eval echo "{$START..$NUM_DATA_ENTRY}") do NUM_RUN=$(($N*(2**($i-1)))) for j in $(eval echo "{$START..$NUM_RUN}") do ./db_bench --benchmarks=fillrandom -db=$TEST_TMPDIR -disable_auto_compactions=1 -write_buffer_size=6710886 > dont_care_output.txt && ./db_bench --benchmarks=compact -use_existing_db=1 -db=$TEST_TMPDIR -level0_file_num_compaction_trigger=1 -rate_limiter_bytes_per_sec=100000000 | egrep 'compact' done > compact_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' compact_bmk_output.txt >> compact_bmk_output_2.txt done ``` - compaction w/o rate-limiting (see table 2, avg over 640-run): pre-change: **822197 micros/op**; post-change: **823148 micros/op (regressed by 0.12%)** ``` Same as above script, except that -rate_limiter_bytes_per_sec=0 ``` - flush with rate-limiting (see table 3, avg over 320-run, run on the [patch]( |
||
Ezgi Çiçek
|
27d6ef8e60 |
Rename mutable_cf_options to signify explicity copy (#9666)
Summary: Signify explicit copy with comment and better name for variable `mutable_cf_options` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9666 Reviewed By: riversand963 Differential Revision: D34680934 Pulled By: ezgicicek fbshipit-source-id: b64ef18725fe523835d14ceb4b29bcdfe493f8ed |
||
Jay Zhuang
|
36aec94d85 |
compression_per_level should be used for flush and changeable (#9658)
Summary: - Make `compression_per_level` dynamical changeable with `SetOptions`; - Fix a bug that `compression_per_level` is not used for flush; Pull Request resolved: https://github.com/facebook/rocksdb/pull/9658 Test Plan: CI Reviewed By: ajkr Differential Revision: D34700749 Pulled By: jay-zhuang fbshipit-source-id: a23b9dfa7ad03d393c1d71781d19e91de796f49c |
||
Yanqin Jin
|
659a16d52b |
Fix bug causing incorrect data returned by snapshot read (#9648)
Summary: This bug affects use cases that meet the following conditions - (has only the default column family or disables WAL) and - has at least one event listener - atomic flush is NOT affected. If the above conditions meet, then RocksDB can release the db mutex before picking all the existing memtables to flush. In the meantime, a snapshot can be created and db's sequence number can still be incremented. The upcoming flush will ignore this snapshot. A later read using this snapshot can return incorrect result. To fix this issue, we call the listeners callbacks after picking the memtables so that we avoid creating snapshots during this interval. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9648 Test Plan: make check Reviewed By: ajkr Differential Revision: D34555456 Pulled By: riversand963 fbshipit-source-id: 1438981e9f069a5916686b1a0ad7627f734cf0ee |
||
Jay Zhuang
|
db8647969d |
Unschedule manual compaction from thread-pool queue (#9625)
Summary: PR https://github.com/facebook/rocksdb/issues/9557 introduced a race condition between manual compaction foreground thread and background compaction thread. This PR adds the ability to really unschedule manual compaction from thread-pool queue by differentiate tag name for manual compaction and other tasks. Also fix an issue that db `close()` didn't cancel the manual compaction thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9625 Test Plan: unittest not hang Reviewed By: ajkr Differential Revision: D34410811 Pulled By: jay-zhuang fbshipit-source-id: cb14065eabb8cf1345fa042b5652d4f788c0c40c |
||
Andrew Kryczka
|
9983eecdfb |
Dedicate cacheline for DB mutex (#9637)
Summary: We found a case of cacheline bouncing due to writers locking/unlocking `mutex_` and readers accessing `block_cache_tracer_`. We discovered it only after the issue was fixed by https://github.com/facebook/rocksdb/issues/9462 shifting the `DBImpl` members such that `mutex_` and `block_cache_tracer_` were naturally placed in separate cachelines in our regression testing setup. This PR forces the cacheline alignment of `mutex_` so we don't accidentally reintroduce the problem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9637 Reviewed By: riversand963 Differential Revision: D34502233 Pulled By: ajkr fbshipit-source-id: 46aa313b7fe83e80c3de254e332b6fb242434c07 |
||
Hui Xiao
|
87a8b3c8af |
Deflake DBErrorHandlingFSTest.MultiCFWALWriteError (#9496)
Summary: **Context:** As part of https://github.com/facebook/rocksdb/pull/6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed]( |
||
Siddhartha Roychowdhury
|
39b0d92153 |
Add record to set WAL compression type if enabled (#9556)
Summary: When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs. Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556 Reviewed By: anand1976 Differential Revision: D34308216 Pulled By: sidroyc fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705 |
||
Andrew Kryczka
|
babe56ddba |
Add rate limiter priority to ReadOptions (#9424)
Summary: Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working. `RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`. There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads). The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424 Test Plan: - new unit tests - new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart. - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true` - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true` - crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10` Reviewed By: hx235 Differential Revision: D33747386 Pulled By: ajkr fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c |
||
Yanqin Jin
|
1cda273dc3 |
Fix a silent data loss for write-committed txn (#9571)
Summary: The following sequence of events can cause silent data loss for write-committed transactions. ``` Time thread 1 bg flush | db->Put("a") | txn = NewTxn() | txn->Put("b", "v") | txn->Prepare() // writes only to 5.log | db->SwitchMemtable() // memtable 1 has "a" | // close 5.log, | // creates 8.log | trigger flush | pick memtable 1 | unlock db mutex | write new sst | txn->ctwb->Put("gtid", "1") // writes 8.log | txn->Commit() // writes to 8.log | // writes to memtable 2 | compute min_log_number_to_keep_2pc, this | will be 8 (incorrect). | | Purge obsolete wals, including 5.log | V ``` At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"}, only the commit marker of txn is in 8.log The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`. In the above example, when bg flush thread tries to find obsolete wals, it uses the information computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()` depends on three components - `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit. - `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed. - `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data. The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log. We should return this information via `FindMinPrepLogReferencedByMemTable()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571 Test Plan: ``` ./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/* make check ``` Reviewed By: siying Differential Revision: D34235236 Pulled By: riversand963 fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1 |
||
Jay Zhuang
|
a0c569ee1d |
Cancel manual compaction in thread-pool queue (#9557)
Summary: Fix `DisableManualCompaction()` has to wait scheduled manual compaction to start the execution to cancel the job. When a manual compaction in thread-pool queue is cancel, set the job is_canceled to true and clean the resource. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9557 Test Plan: added unittest that will hang without the change Reviewed By: ajkr Differential Revision: D34214910 Pulled By: jay-zhuang fbshipit-source-id: 89dbaee78ddf26eb13ce862c2b15f4a098b36a78 |
||
Hui Xiao
|
443d8ef094 |
Fix PinSelf() read-after-free in DB::GetMergeOperands() (#9507)
Summary: **Context:** Running the new test `DBMergeOperandTest.MergeOperandReadAfterFreeBug` prior to this fix surfaces the read-after-free bug of PinSef() as below: ``` READ of size 8 at 0x60400002529d thread T0 https://github.com/facebook/rocksdb/issues/5 0x7f199a in rocksdb::PinnableSlice::PinSelf(rocksdb::Slice const&) include/rocksdb/slice.h:171 https://github.com/facebook/rocksdb/issues/6 0x7f199a in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1919 https://github.com/facebook/rocksdb/issues/7 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203 freed by thread T0 here: https://github.com/facebook/rocksdb/issues/3 0x1191399 in rocksdb::cache_entry_roles_detail::RegisteredDeleter<rocksdb::Block, (rocksdb::CacheEntryRole)0>::Delete(rocksdb::Slice const&, void*) cache/cache_entry_roles.h:99 https://github.com/facebook/rocksdb/issues/4 0x719348 in rocksdb::LRUHandle::Free() cache/lru_cache.h:205 https://github.com/facebook/rocksdb/issues/5 0x71047f in rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*, bool) cache/lru_cache.cc:547 https://github.com/facebook/rocksdb/issues/6 0xa78f0a in rocksdb::Cleanable::DoCleanup() include/rocksdb/cleanable.h:60 https://github.com/facebook/rocksdb/issues/7 0xa78f0a in rocksdb::Cleanable::Reset() include/rocksdb/cleanable.h:38 https://github.com/facebook/rocksdb/issues/8 0xa78f0a in rocksdb::PinnedIteratorsManager::ReleasePinnedData() db/pinned_iterators_manager.h:71 https://github.com/facebook/rocksdb/issues/9 0xd0c21b in rocksdb::PinnedIteratorsManager::~PinnedIteratorsManager() db/pinned_iterators_manager.h:24 https://github.com/facebook/rocksdb/issues/10 0xd0c21b in rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool) db/pinned_iterators_manager.h:22 https://github.com/facebook/rocksdb/issues/11 0x7f0fdf in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1886 https://github.com/facebook/rocksdb/issues/12 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203 previously allocated by thread T0 here: https://github.com/facebook/rocksdb/issues/1 0x1239896 in rocksdb::AllocateBlock(unsigned long, **rocksdb::MemoryAllocator*)** memory/memory_allocator.h:35 https://github.com/facebook/rocksdb/issues/2 0x1239896 in rocksdb::BlockFetcher::CopyBufferToHeapBuf() table/block_fetcher.cc:171 https://github.com/facebook/rocksdb/issues/3 0x1239896 in rocksdb::BlockFetcher::GetBlockContents() table/block_fetcher.cc:206 https://github.com/facebook/rocksdb/issues/4 0x122eae5 in rocksdb::BlockFetcher::ReadBlockContents() table/block_fetcher.cc:325 https://github.com/facebook/rocksdb/issues/5 0x11b1f45 in rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, bool, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const table/block_based/block_based_table_reader.cc:1503 ``` Here is the analysis: - We have [PinnedIteratorsManager](https://github.com/facebook/rocksdb/blob/6.28.fb/db/version_set.cc#L1980) with `Cleanable` capability in our `Version::Get()` path. It's responsible for managing the life-time of pinned iterator and invoking registered cleanup functions during its own destruction. - For example in case above, the merge operands's clean-up gets associated with this manger in [GetContext::push_operand](https://github.com/facebook/rocksdb/blob/6.28.fb/table/get_context.cc#L405). During PinnedIteratorsManager's [destruction](https://github.com/facebook/rocksdb/blob/6.28.fb/db/pinned_iterators_manager.h#L67), the release function associated with those merge operand data is invoked. **And that's what we see in "freed by thread T955 here" in ASAN.** - Bug 🐛: `PinnedIteratorsManager` is local to `Version::Get()` while the data of merge operands need to outlive `Version::Get` and stay till they get [PinSelf()](https://github.com/facebook/rocksdb/blob/6.28.fb/db/db_impl/db_impl.cc#L1905), **which is the read-after-free in ASAN.** - This bug is likely to be an overlook of `PinnedIteratorsManager` when developing the API `DB::GetMergeOperands` cuz the current logic works fine with the existing case of getting the *merged value* where the operands do not need to live that long. - This bug was not surfaced much (even in its unit test) due to the release function associated with the merge operands (which are actually blocks put in cache as you can see in `BlockBasedTable::MaybeReadBlockAndLoadToCache` **in "previously allocated by" in ASAN report**) is a cache entry deleter. The deleter will call `Cache::Release()` which, for LRU cache, won't immediately deallocate the block based on LRU policy [unless the cache is full or being instructed to force erase](https://github.com/facebook/rocksdb/blob/6.28.fb/cache/lru_cache.cc#L521-L531) - `DBMergeOperandTest.MergeOperandReadAfterFreeBug` makes the cache extremely small to force cache full. **Summary:** - Fix the bug by align `PinnedIteratorsManager`'s lifetime with the merge operands Pull Request resolved: https://github.com/facebook/rocksdb/pull/9507 Test Plan: - New test `DBMergeOperandTest.MergeOperandReadAfterFreeBug` - db bench on read path - Setup (LSM tree with several levels, cache the whole db to avoid read IO, warm cache with readseq to avoid read IO): `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq -num=1000000 -cache_size=100000000 -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1``TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="readrandom" -num=1000000 -cache_size=100000000 ` - Actual command run (run 20-run for 20 times and then average the 20-run's average micros/op) - `for j in {1..20}; do (for i in {1..20}; do rm -rf /dev/shm/rocksdb/ && TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq,readrandom" -num=1000000 -cache_size=100000000 -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1 | egrep 'readrandom'; done > rr_output_pre.txt && (awk '{sum+=$3; sum_sqrt+=$3^2}END{print sum/20, sqrt(sum_sqrt/20-(sum/20)^2)}' rr_output_pre.txt) >> rr_output_pre_2.txt); done` - **Result: Pre-change: 3.79193 micros/op; Post-change: 3.79528 micros/op (+0.09%)** (pre-change)sorted avg micros/op of each 20-run | std of micros/op of each 20-run | (post-change) sorted avg micros/op of each 20-run | std of micros/op of each 20-run -- | -- | -- | -- 3.58355 | 0.265209 | 3.48715 | 0.382076 3.58845 | 0.519927 | 3.5832 | 0.382726 3.66415 | 0.452097 | 3.677 | 0.563831 3.68495 | 0.430897 | 3.68405 | 0.495355 3.70295 | 0.482893 | 3.68465 | 0.431438 3.719 | 0.463806 | 3.71945 | 0.457157 3.7393 | 0.453423 | 3.72795 | 0.538604 3.7806 | 0.527613 | 3.75075 | 0.444509 3.7817 | 0.426704 | 3.7683 | 0.468065 3.809 | 0.381033 | 3.8086 | 0.557378 3.80985 | 0.466011 | 3.81805 | 0.524833 3.8165 | 0.500351 | 3.83405 | 0.529339 3.8479 | 0.430326 | 3.86285 | 0.44831 3.85125 | 0.434108 | 3.8717 | 0.544098 3.8556 | 0.524602 | 3.895 | 0.411679 3.8656 | 0.476383 | 3.90965 | 0.566636 3.8911 | 0.488477 | 3.92735 | 0.608038 3.898 | 0.493978 | 3.9439 | 0.524511 3.97235 | 0.515008 | 3.9623 | 0.477416 3.9768 | 0.519993 | 3.98965 | 0.521481 - CI Reviewed By: ajkr Differential Revision: D34030519 Pulled By: hx235 fbshipit-source-id: a99ac585c11704c5ed93af033cb29ba0a7b16ae8 |
||
Ezgi Çiçek
|
95d9cb8357 |
Avoid unnecessary copy of sample_slice map (#9551)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9551 Reviewed By: riversand963 Differential Revision: D34169574 Pulled By: ezgicicek fbshipit-source-id: 2e88db59b65bda269917a9b0bed17181a4afd281 |
||
Hui Xiao
|
c5cd31c12b |
Fix TSAN data race in EventListenerTest.MultiCF (#9528)
Summary: **Context:** `EventListenerTest.MultiCF` occasionally failed on TSAN data race as below: ``` WARNING: ThreadSanitizer: data race (pid=2047633) Read of size 8 at 0x7b6000001440 by main thread: #0 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::size() const /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_vector.h:916:40 (listener_test+0x52337c) https://github.com/facebook/rocksdb/issues/1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() /home/circleci/project/db/listener_test.cc:384:7 (listener_test+0x52337c) Previous write of size 8 at 0x7b6000001440 by thread T2: #0 void std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::_M_realloc_insert<rocksdb::DB* const&>(__gnu_cxx::__normal_iterator<rocksdb::DB**, std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> > >, rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/vector.tcc:503:31 (listener_test+0x550654) https://github.com/facebook/rocksdb/issues/1 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::push_back(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_vector.h:1195:4 (listener_test+0x550654) https://github.com/facebook/rocksdb/issues/2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) /home/circleci/project/db/listener_test.cc:255:18 (listener_test+0x550654) ``` After investigation, it is due to the following: (1) `ASSERT_OK(Flush(i));` before the read `std::vector::size()` is supposed to be [blocked on `DB::Impl::bg_cv_` for memtable flush to finish]( |
||
Levi Tamasi
|
320d9a8e8a |
Use a sorted vector instead of a map to store blob file metadata (#9526)
Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc |
||
Akanksha Mahajan
|
9745c68eb1 |
Remove deprecated option new_table_reader_for_compaction_inputs (#9443)
Summary: In RocksDB option new_table_reader_for_compaction_inputs has not effect on Compaction or on the behavior of RocksDB library. Therefore, we are removing it in the upcoming 7.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443 Test Plan: CircleCI Reviewed By: ajkr Differential Revision: D33788508 Pulled By: akankshamahajan15 fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d |
||
Yanqin Jin
|
629e3e1d77 |
Fix spelling in public API (#9490)
Summary: I feel it would be nice if we can fix this spelling error. In `SizeApproximationOptions`, the `include_memtabtles` should be `include_memtables`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9490 Test Plan: make check Reviewed By: hx235 Differential Revision: D33949862 Pulled By: riversand963 fbshipit-source-id: b2be67501b65d4aabb6b8df1bf25eb8d54cc1466 |
||
Yanqin Jin
|
3122cb4358 |
Revise APIs related to user-defined timestamp (#8946)
Summary: ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`. Namely, `WriteOptions` should not include information about "what-to-write", but should just include information about "how-to-write". According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore, this PR removes `WriteOptions::timestamp` for compliance. After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and `SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity made me reconsider doing it in another PR (maybe). For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take extra `timestamp` information when writing to `WriteBatch`es. These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list. Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to `WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps allocated already and multiple timestamps can be updated. The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp size of the default column family. This will be used to allocate space when calling APIs that do not specify a column family handle. Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing some assertions about timestamp to returning Status code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946 Test Plan: make check ./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8 ./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0 Make sure there is no perf regression by running the following ``` ./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom ``` Before this PR ``` DB path: [/dev/shm/rocksdb] fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s ``` After this PR ``` DB path: [/dev/shm/rocksdb] fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s ``` Reviewed By: ltamasi Differential Revision: D33721359 Pulled By: riversand963 fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09 |
||
Jay Zhuang
|
980b9ff385 |
Add more micro-benchmark tests (#9436)
Summary: * Add more micro-benchmark tests * Expose an API in DBImpl for waiting for compactions (still not visible to the user) * Add argument name for ribbon_bench * remove benchmark run from CI, as it runs too long. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9436 Test Plan: CI Reviewed By: riversand963 Differential Revision: D33777836 Pulled By: jay-zhuang fbshipit-source-id: c05de3bc082cc05b5d019f00b324e774bf4bbd96 |
||
Yanqin Jin
|
d10c5c08d3 |
Remove iter_start_seqnum and preserve_deletes (#9430)
Summary: According to https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl.cc#L2896:L2911 and https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl_open.cc#L203:L208, we are going to remove `iter_start_seqnum` and `preserve_deletes` starting from RocksDB 7.0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9430 Test Plan: make check and CI Reviewed By: ajkr Differential Revision: D33753639 Pulled By: riversand963 fbshipit-source-id: c80aab8e8d8fc33e52472fed524ed703d0ffc8b6 |
||
Yanqin Jin
|
dd203ed604 |
Disallow a combination of options (#9348)
Summary: Disallow `immutable_db_opts.use_direct_io_for_flush_and_compaction == true` and `mutable_db_opts.writable_file_max_buffer_size == 0`, since it causes `WritableFileWriter::Append()` to loop forever and does not make much sense in direct IO. This combination of options itself does not make much sense: asking RocksDB to do direct IO but not allowing RocksDB to allocate a buffer. We should detect this false combination and warn user early, no matter whether the application is running on a platform that supports direct IO or not. In the case of platform **not** supporting direct IO, it's ok if the user learns about this and then finds that direct IO is not supported. One tricky thing: the constructor of `WritableFileWriter` is being used in our unit tests, and it's impossible to return status code from constructor. Since we do not throw, I put an assertion for now. Fortunately, the constructor is not exposed to external applications. Closing https://github.com/facebook/rocksdb/issues/7109 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9348 Test Plan: make check Reviewed By: ajkr Differential Revision: D33371924 Pulled By: riversand963 fbshipit-source-id: 2a3701ab541cee23bffda8a36cdf37b2d235edfa |
||
Siddhartha Roychowdhury
|
c27ca23644 |
Add option for WAL compression algorithm (#9432)
Summary: Add an option to set the WAL compression algorithm - wal_compression. TODO: WAL compression is not implemented and will only support zstd initially. Will be added in subsequent diffs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9432 Reviewed By: pdillinger Differential Revision: D33797275 Pulled By: sidroyc fbshipit-source-id: 8db81d9c9cea5e2e4f1445d3aecad8106137b8e7 |
||
Peter Dillinger
|
8064a3ac31 |
Fix flaky EventListenerTest.DisableBGCompaction (#9400)
Summary: Wasn't able to easily reproduce error, but easy to see a race condition between TestFlushListener::OnFlushCompleted and DBTestBase::Close(), which frees CF handles before closing DB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9400 Test Plan: CI etc. Reviewed By: riversand963 Differential Revision: D33645134 Pulled By: pdillinger fbshipit-source-id: d0ec914cc43c9e14f53da633876b95b61995138d |