Commit Graph

4817 Commits

Author SHA1 Message Date
Hui Xiao
49623f9c8e Account memory of big memory users in BlockBasedTable in global memory limit (#9748)
Summary:
**Context:**
Through heap profiling, we discovered that `BlockBasedTableReader` objects can accumulate and lead to high memory usage (e.g, `max_open_file = -1`). These memories are currently not saved, not tracked, not constrained and not cache evict-able. As a first step to improve this, similar to https://github.com/facebook/rocksdb/pull/8428,  this PR is to track an estimate of `BlockBasedTableReader` object's memory in block cache and fail future creation if the memory usage exceeds the available space of cache at the time of creation.

**Summary:**
- Approximate big memory users  (`BlockBasedTable::Rep` and `TableProperties` )' memory usage in addition to the existing estimated ones (filter block/index block/un-compression dictionary)
- Charge all of these memory usages to block cache on `BlockBasedTable::Open()` and release them on `~BlockBasedTable()` as there is no memory usage fluctuation of concern in between
- Refactor on CacheReservationManager (and its call-sites) to add concurrent support for BlockBasedTable  used in this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9748

Test Plan:
- New unit tests
- db bench: `OpenDb` : **-0.52% in ms**
  - Setup `./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -disable_auto_compactions=1 -write_buffer_size=1048576`
  - Repeated run with pre-change w/o feature and post-change with feature, benchmark `OpenDb`:  `./db_bench -benchmarks=readrandom -use_existing_db=1 -db=/dev/shm/testdb -reserve_table_reader_memory=true (remove this when running w/o feature) -file_opening_threads=3 -open_files=-1 -report_open_timing=true| egrep 'OpenDb:'`

#-run | (feature-off) avg milliseconds | std milliseconds | (feature-on) avg milliseconds | std milliseconds | change (%)
-- | -- | -- | -- | -- | --
10 | 11.4018 | 5.95173 | 9.47788 | 1.57538 | -16.87382694
20 | 9.23746 | 0.841053 | 9.32377 | 1.14074 | 0.9343477536
40 | 9.0876 | 0.671129 | 9.35053 | 1.11713 | 2.893283155
80 | 9.72514 | 2.28459 | 9.52013 | 1.0894 | -2.108041632
160 | 9.74677 | 0.991234 | 9.84743 | 1.73396 | 1.032752389
320 | 10.7297 | 5.11555 | 10.547 | 1.97692 | **-1.70275031**
640 | 11.7092 | 2.36565 | 11.7869 | 2.69377 | **0.6635807741**

-  db bench on write with cost to cache in WriteBufferManager (just in case this PR's CRM refactoring accidentally slows down anything in WBM) : `fillseq` : **+0.54% in micros/op**
`./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -disable_auto_compactions=1 -cost_write_buffer_to_cache=true -write_buffer_size=10000000000 | egrep 'fillseq'`

#-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  avg micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 6.15 | 0.260187 | 6.289 | 0.371192 | 2.260162602
20 | 7.28025 | 0.465402 | 7.37255 | 0.451256 | 1.267813605
40 | 7.06312 | 0.490654 | 7.13803 | 0.478676 | **1.060579461**
80 | 7.14035 | 0.972831 | 7.14196 | 0.92971 | **0.02254791432**

-  filter bench: `bloom filter`: **-0.78% in ms/key**
    - ` ./filter_bench -impl=2 -quick -reserve_table_builder_memory=true | grep 'Build avg'`

#-run | (pre-PR) avg ns/key | std ns/key | (post-PR)  ns/key | std ns/key | change (%)
-- | -- | -- | -- | -- | --
10 | 26.4369 | 0.442182 | 26.3273 | 0.422919 | **-0.4145720565**
20 | 26.4451 | 0.592787 | 26.1419 | 0.62451 | **-1.1465262**

- Crash test `python3 tools/db_crashtest.py blackbox --reserve_table_reader_memory=1 --cache_size=1` killed as normal

Reviewed By: ajkr

Differential Revision: D35136549

Pulled By: hx235

fbshipit-source-id: 146978858d0f900f43f4eb09bfd3e83195e3be28
2022-04-06 10:33:00 -07:00
Andrew Kryczka
04623e7cd4 Fix GetMergeOperands() heap-use-after-free on flushed memtable (#9805)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/9066.

Prior to the fix in this PR, this PR's unit test reported the following error under ASAN:

```
==2175705==ERROR: AddressSanitizer: heap-use-after-free on address 0x61f0000012a5 at pc 0x7f0fc36e76ce bp 0x7ffc103e9ca0 sp 0x7ffc103e9450
READ of size 5 at 0x61f0000012a5 thread T0
    #0 0x7f0fc36e76cd in __interceptor_memcpy /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:790
    https://github.com/facebook/rocksdb/issues/1 0x7f0fc35a207e in std::char_traits<char>::copy(char*, char const*, unsigned long) /home/engshare/third-party2/libgcc/9.x/src/gcc-9.x/x86_64-facebook-linux/libstdc++-v3/include/bits/char_traits.h:365
    https://github.com/facebook/rocksdb/issues/2 0x7f0fc35a207e in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy(char*, char const*, unsigned long) /home/engshare/third-party2/libgcc/9.x/src/gcc-9.x/x86_64-facebook-linux/libstdc++-v3/include/bits/basic_string.h:351
    https://github.com/facebook/rocksdb/issues/3 0x7f0fc35a207e in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace(unsigned long, unsigned long, char const*, unsigned long) /home/engshare/third-party2/libgcc/9.x/src/gcc-9.x/x86_64-facebook-linux/libstdc++-v3/include/bits/basic_string.tcc:440
    https://github.com/facebook/rocksdb/issues/4 0x8679ca in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign(char const*, unsigned long) /mnt/gvfs/third-party2/libgcc/4959b39cfbe5965a37c861c4c327fa7c5c759b87/9.x/platform009/9202ce7/include/c++/9.3.0/bits/basic_string.h:1422
    https://github.com/facebook/rocksdb/issues/5 0x8679ca in rocksdb::PinnableSlice::PinSelf(rocksdb::Slice const&) include/rocksdb/slice.h:171
    https://github.com/facebook/rocksdb/issues/6 0x8679ca in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1930
    https://github.com/facebook/rocksdb/issues/7 0x547324 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203
    https://github.com/facebook/rocksdb/issues/8 0x547324 in rocksdb::DBMergeOperandTest_FlushedMergeOperandReadAfterFreeBug_Test::TestBody() db/db_merge_operand_test.cc:117
    https://github.com/facebook/rocksdb/issues/9 0x7241da in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899
    https://github.com/facebook/rocksdb/issues/10 0x7241da in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935
    https://github.com/facebook/rocksdb/issues/11 0x701a47 in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3973
    https://github.com/facebook/rocksdb/issues/12 0x702040 in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3965
    https://github.com/facebook/rocksdb/issues/13 0x702040 in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4149
    https://github.com/facebook/rocksdb/issues/14 0x7025f7 in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4124
    https://github.com/facebook/rocksdb/issues/15 0x7025f7 in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4267
    https://github.com/facebook/rocksdb/issues/16 0x704217 in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4253
    https://github.com/facebook/rocksdb/issues/17 0x704217 in testing::internal::UnitTestImpl::RunAllTests() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6633
    https://github.com/facebook/rocksdb/issues/18 0x72505a in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899
    https://github.com/facebook/rocksdb/issues/19 0x72505a in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935
    https://github.com/facebook/rocksdb/issues/20 0x704aa1 in testing::UnitTest::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6242
    https://github.com/facebook/rocksdb/issues/21 0x4c4aff in RUN_ALL_TESTS() third-party/gtest-1.8.1/fused-src/gtest/gtest.h:22110
    https://github.com/facebook/rocksdb/issues/22 0x4c4aff in main db/db_merge_operand_test.cc:404
    https://github.com/facebook/rocksdb/issues/23 0x7f0fc3108dc4 in __libc_start_main ../csu/libc-start.c:308
    https://github.com/facebook/rocksdb/issues/24 0x5445fd in _start (/data/users/andrewkr/rocksdb/db_merge_operand_test+0x5445fd)

0x61f0000012a5 is located 1061 bytes inside of 3264-byte region [0x61f000000e80,0x61f000001b40)
freed by thread T0 here:
    #0 0x7f0fc375b6af in operator delete(void*, unsigned long) /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/libsanitizer/asan/asan_new_delete.cc:177
    https://github.com/facebook/rocksdb/issues/1 0x743be8 in rocksdb::SuperVersion::~SuperVersion() db/column_family.cc:432
    https://github.com/facebook/rocksdb/issues/2 0x8052aa in rocksdb::DBImpl::CleanupSuperVersion(rocksdb::SuperVersion*) db/db_impl/db_impl.cc:3534
    https://github.com/facebook/rocksdb/issues/3 0x8676c2 in rocksdb::DBImpl::ReturnAndCleanupSuperVersion(rocksdb::ColumnFamilyData*, rocksdb::SuperVersion*) db/db_impl/db_impl.cc:3544
    https://github.com/facebook/rocksdb/issues/4 0x8676c2 in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1911
    https://github.com/facebook/rocksdb/issues/5 0x547324 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203
    https://github.com/facebook/rocksdb/issues/6 0x547324 in rocksdb::DBMergeOperandTest_FlushedMergeOperandReadAfterFreeBug_Test::TestBody() db/db_merge_operand_test.cc:117
    https://github.com/facebook/rocksdb/issues/7 0x7241da in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899
    https://github.com/facebook/rocksdb/issues/8 0x7241da in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935
    https://github.com/facebook/rocksdb/issues/9 0x701a47 in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3973
    https://github.com/facebook/rocksdb/issues/10 0x702040 in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3965
    https://github.com/facebook/rocksdb/issues/11 0x702040 in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4149
    https://github.com/facebook/rocksdb/issues/12 0x7025f7 in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4124
    https://github.com/facebook/rocksdb/issues/13 0x7025f7 in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4267
    https://github.com/facebook/rocksdb/issues/14 0x704217 in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4253
    https://github.com/facebook/rocksdb/issues/15 0x704217 in testing::internal::UnitTestImpl::RunAllTests() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6633
    https://github.com/facebook/rocksdb/issues/16 0x72505a in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899
    https://github.com/facebook/rocksdb/issues/17 0x72505a in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935
    https://github.com/facebook/rocksdb/issues/18 0x704aa1 in testing::UnitTest::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6242
    https://github.com/facebook/rocksdb/issues/19 0x4c4aff in RUN_ALL_TESTS() third-party/gtest-1.8.1/fused-src/gtest/gtest.h:22110
    https://github.com/facebook/rocksdb/issues/20 0x4c4aff in main db/db_merge_operand_test.cc:404
    https://github.com/facebook/rocksdb/issues/21 0x7f0fc3108dc4 in __libc_start_main ../csu/libc-start.c:308
    https://github.com/facebook/rocksdb/issues/22 0x5445fd in _start (/data/users/andrewkr/rocksdb/db_merge_operand_test+0x5445fd)
...
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9805

Test Plan: following the fix in this PR, the new unit test passes

Reviewed By: jay-zhuang

Differential Revision: D35388415

Pulled By: ajkr

fbshipit-source-id: b39c5d002155906c8abc4a3429eca696dbf916d0
2022-04-05 12:26:36 -07:00
Yanqin Jin
1a1c5bda23 Disallow commit-time-batch for write-prepared/write-unprepared txn conditionally (#9794)
Summary:
For write-prepared/write-unprepared transactions,
GetCommitTimeWriteBatch() can be used only if the transaction is started
with `TransactionOptions::use_only_the_last_commit_time_batch_for_recovery` set
to true. Otherwise, it is possible that multiple uncommitted versions of the
same key exist in the database. During bottommost compaction, RocksDB may
set the sequence numbers of both to zero once they become committed, causing
output SST file to have two identical internal keys.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9794

Test Plan:
make check
pay special attention to the following
```
transaction_test --gtest_filter=MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/*
```

Reviewed By: lth

Differential Revision: D35327214

Pulled By: riversand963

fbshipit-source-id: 3bae00a28359c10e96e4c6f676d20de5610d8a0f
2022-04-05 11:10:20 -07:00
Peter Dillinger
6534c6dea4 Fix remaining uses of "backupable" (#9792)
Summary:
Various renaming and fixes to get rid of remaining uses of
"backupable" which is terminology leftover from the original, flawed
design of BackupableDB. Now any DB can be backed up, using BackupEngine.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9792

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D35334386

Pulled By: pdillinger

fbshipit-source-id: 2108a42b4575c8cccdfd791c549aae93ec2f3329
2022-04-05 09:52:33 -07:00
Bo Wang
bcabee737f Improve comments for some files (#9793)
Summary:
Update the comments, e.g. fixing typo, formatting, etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9793

Reviewed By: jay-zhuang

Differential Revision: D35323989

Pulled By: gitbw95

fbshipit-source-id: 4a72fc02b67abaae8be0d1439b68f9967a68052d
2022-04-01 16:06:14 -07:00
Andrew Kryczka
f246e56d0a Fix a few documentation errors including in public APIs (#9789)
Summary:
The internal WriteBatch doc wrongly indicated which optypes are followed by varstring. Updated some optypes according to the following code: 76383bea5d/db/write_batch.cc (L418-L429)

The `Iterator::Refresh()` + `DeleteRange()` bug was fixed in https://github.com/facebook/rocksdb/issues/9258; removed the warnings.

`GetMergeOperands()` does populate `*number_of_operands` including upon successful return: 76383bea5d/db/db_impl/db_impl.cc (L1917-L1919)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9789

Reviewed By: riversand963

Differential Revision: D35303421

Pulled By: ajkr

fbshipit-source-id: 9b0e1be5f6b2e2b31461e6c33ecb5f5381824452
2022-04-01 10:30:17 -07:00
Yanqin Jin
6eafdf135a Encode min_log_number_to_keep and delete_wals_before in one version edit (#9766)
Summary:
min_log_number_to_keep denotes that the WALs whose numbers are below
this value **will** be deleted by RocksDB.
delete_wals_before will be used by RocksDB if
track_and_verify_wals_in_manifest is set to true. During recovery,
RocksDB uses the info encoded in delete_wals_before to reconstruct its
knowledge about what WALs to expect existing.
If these two tags are not encoded in the same VersionEdit, then it's
possible for min_log_number_to_keep=100 to exist, but
delete_wals_before=100 to be lost due to power failure. Subsequent
recovery will delete 99.log. If the db crashes again, the following
recovery will expect to see 99.log since there is no
delete_wals_before=100 in the MANIFEST, but the WAL is already deleted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9766

Test Plan:
First of all, make check.
Second, format compatibility.
SHORT_TEST=1 ./tools/check_format_compatible.sh

Reviewed By: ltamasi

Differential Revision: D35203623

Pulled By: riversand963

fbshipit-source-id: 45623fc4b4b50d299d5e0f9559a3a4c5e9522c8f
2022-03-31 20:00:52 -07:00
sdong
bbcf7b192c Fix DB::Open() error logging (#9784)
Summary:
Right now we log a wrong error when DB::Open() fails. Fix it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9784

Test Plan: CI runs should pass

Reviewed By: ajkr, riversand963

Differential Revision: D35290203

fbshipit-source-id: ffc640afa27f6b0a2382ee153dc43f28d9e242be
2022-03-31 15:52:01 -07:00
Yanqin Jin
de9df6e818 Do not release and re-acquire dbmutex on memtable-switch if no listener (#9758)
Summary:
There is no need to release-and-acquire immediately when no listener is registered. This is
what we have been doing for `NotifyOnFlushBegin()`, `NotifyOnFlushCompleted()`, `NotifyOnCompactionBegin()`,
`NotifyOnCompactionCompleted()`, and some other `NotifyOnXX` methods in event_helpers.cc.
Do the same for `NotifyOnMemTableSealed ()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9758

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D35159552

Pulled By: riversand963

fbshipit-source-id: 6e0aac50bd5c8f506d809b6638c33a7a28d1e87f
2022-03-30 20:48:23 -07:00
bbkot
e55018a8ce fixing issue #8345 RocksDB does not work when using UNC network paths (#9384)
Summary:
Fix https://github.com/facebook/rocksdb/issues/8345
RocksDB does not work with network filesystem paths on Windows, e.g. "\\hostname\folder\..."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9384

Reviewed By: mrambacher

Differential Revision: D33830622

Pulled By: riversand963

fbshipit-source-id: 2a99dc3c94415eb1460e110784b97d71600218f1
2022-03-30 15:55:31 -07:00
Peter Dillinger
105d7f0c7c Document SetOptions API (#9778)
Summary:
much needed

Some other minor tweaks also

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9778

Test Plan: existing tests

Reviewed By: ajkr

Differential Revision: D35258195

Pulled By: pdillinger

fbshipit-source-id: 974ddafc23a540aacceb91da72e81593d818f99c
2022-03-30 14:51:12 -07:00
Peter Dillinger
40e3f30a28 Fix FileStorageInfo fields from GetLiveFilesMetaData (#9769)
Summary:
In making `SstFileMetaData` inherit from `FileStorageInfo`, I
overlooked setting some `FileStorageInfo` fields when then default
`SstFileMetaData()` ctor is used. This affected `GetLiveFilesMetaData()`.

Also removed some buggy `static_cast<size_t>`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9769

Test Plan: Updated tests

Reviewed By: jay-zhuang

Differential Revision: D35220383

Pulled By: pdillinger

fbshipit-source-id: 05b4ee468258dbd3699517e1124838bf405fe7f8
2022-03-29 14:36:35 -07:00
Peter Dillinger
cad809978a Fix heap use-after-free race with DropColumnFamily (#9730)
Summary:
Although ColumnFamilySet comments say that DB mutex can be
freed during iteration, as long as you hold a ref while releasing DB
mutex, this is not quite true because UnrefAndTryDelete might delete cfd
right before it is needed to get ->next_ for the next iteration of the
loop.

This change solves the problem by making a wrapper class that makes such
iteration easier while handling the tricky details of UnrefAndTryDelete
on the previous cfd only after getting next_ in operator++.

FreeDeadColumnFamilies should already have been obsolete; this removes
it for good. Similarly, ColumnFamilySet::iterator doesn't need to check
for cfd with 0 refs, because those are immediately deleted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9730

Test Plan:
was reported with ASAN on unit tests like
DBLogicalBlockSizeCacheTest.CreateColumnFamily (very rare); keep watching

Reviewed By: ltamasi

Differential Revision: D35038143

Pulled By: pdillinger

fbshipit-source-id: 0a5478d5be96c135343a00603711b7df43ae19c9
2022-03-24 13:05:17 -07:00
Yanqin Jin
e0c84aa0dc Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.

The race condition is between two background flush threads trying to install flush results to the MANIFEST.

Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.

1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst

```
Time  BgFlushThread1                                    BgFlushThread2
 |    mutex_.Lock()
 |    precompute min_wal_to_keep as 6
 |    mutex_.Unlock()
 |                                                     mutex_.Lock()
 |                                                     precompute min_wal_to_keep as 6
 |                                                     join MANIFEST write queue and mutex_.Unlock()
 |    write to MANIFEST
 |    mutex_.Lock()
 |    cfd1->log_number = 7
 |    Signal bg_flush_2 and mutex_.Unlock()
 |                                                     wake up and mutex_.Lock()
 |                                                     cfd0->log_number = 8
 |                                                     FindObsoleteFiles() with job_context->log_number == 7
 |                                                     mutex_.Unlock()
 |                                                     PurgeObsoleteFiles() deletes 6.log
 V
```

As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.

We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715

Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```

Reviewed By: ltamasi

Differential Revision: D34984412

Pulled By: riversand963

fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
2022-03-23 19:41:31 -07:00
Yanqin Jin
29bec740f5 Return invalid argument if batch is null (#9744)
Summary:
Originally, a corruption will be returned by `DBImpl::WriteImpl(batch...)` if batch is
null. This is inaccurate since there is no data corruption.
Return `Status::InvalidArgument()` instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9744

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D35086268

Pulled By: riversand963

fbshipit-source-id: 677397b007a53bc25210eac0178d49c9797b5951
2022-03-23 14:28:13 -07:00
Peter Dillinger
91687d70ea Fix a major performance bug in 7.0 re: filter compatibility (#9736)
Summary:
Bloom filters generated by pre-7.0 releases are not read by
7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name()
in https://github.com/facebook/rocksdb/issues/9590. This can severely impact read performance and read I/O on
upgrade or downgrade with existing DB, but not data correctness.

To fix, we go back using the old, unified name in SST metadata but (for
a while anyway) recognize the aliases that could be generated by early
7.0.x releases. This unfortunately requires a public API change to avoid
interfering with all the good changes from https://github.com/facebook/rocksdb/issues/9590, but the API change
only affects users with custom FilterPolicy, which should be very few.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9736

Test Plan:
manual

Generate DBs with
```
./db_bench.7.0 -db=/dev/shm/rocksdb.7.0 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0
```
and similar. Compare with
```
for IMPL in 6.29 7.0 fixed; do for DB in 6.29 7.0 fixed; do echo "Testing $IMPL on $DB:"; ./db_bench.$IMPL -db=/dev/shm/rocksdb.$DB -use_existing_db -readonly -bloom_bits=10 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=10 2>&1 | grep micros/op; done; done
```

Results:
```
Testing 6.29 on 6.29:
readrandom   :      34.381 micros/op 29085 ops/sec;    3.2 MB/s (291999 of 291999 found)
Testing 6.29 on 7.0:
readrandom   :     190.443 micros/op 5249 ops/sec;    0.6 MB/s (52999 of 52999 found)
Testing 6.29 on fixed:
readrandom   :      40.148 micros/op 24907 ops/sec;    2.8 MB/s (249999 of 249999 found)
Testing 7.0 on 6.29:
readrandom   :     229.430 micros/op 4357 ops/sec;    0.5 MB/s (43999 of 43999 found)
Testing 7.0 on 7.0:
readrandom   :      33.348 micros/op 29986 ops/sec;    3.3 MB/s (299999 of 299999 found)
Testing 7.0 on fixed:
readrandom   :     152.734 micros/op 6546 ops/sec;    0.7 MB/s (65999 of 65999 found)
Testing fixed on 6.29:
readrandom   :      32.024 micros/op 31224 ops/sec;    3.5 MB/s (312999 of 312999 found)
Testing fixed on 7.0:
readrandom   :      33.990 micros/op 29390 ops/sec;    3.3 MB/s (294999 of 294999 found)
Testing fixed on fixed:
readrandom   :      28.714 micros/op 34825 ops/sec;    3.9 MB/s (348999 of 348999 found)
```

Just paying attention to order of magnitude of ops/sec (short test
durations, lots of noise), it's clear that with the fix we can read <= 6.29
& >= 7.0 at full speed, where neither 6.29 nor 7.0 can on both. And 6.29
release can properly read fixed DB at full speed.

Reviewed By: siying, ajkr

Differential Revision: D35057844

Pulled By: pdillinger

fbshipit-source-id: a46893a6af4bf084375ebe4728066d00eb08f050
2022-03-23 10:00:54 -07:00
Yanqin Jin
3bd150c442 Print information about all column families when using ldb (#9719)
Summary:
Before this PR, the following command prints only the default column
family's information in the end:
```
ldb --db=. --hex manifest_dump --verbose
```

We should print all column families instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9719

Test Plan:
`make check` makes sure nothing breaks.

Generate a DB, use the above command to verify all column families are
printed.

Reviewed By: akankshamahajan15

Differential Revision: D34992453

Pulled By: riversand963

fbshipit-source-id: de1d38c4539cd89f74e1a6240ad7a6e2416bf198
2022-03-22 20:29:01 -07:00
gitbw95
8102690a52 Update Cache::Release param from force_erase to erase_if_last_ref (#9728)
Summary:
The param name force_erase may be misleading, since the handle is erased only if it has last reference even if the param is set true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9728

Reviewed By: pdillinger

Differential Revision: D35038673

Pulled By: gitbw95

fbshipit-source-id: 0d16d1e8fed17b97eba7fb53207119332f659a5f
2022-03-22 10:22:18 -07:00
KNOEEE
cb4d188a34 Fix a bug in PosixClock (#9695)
Summary:
Multiplier here should be 1e6 to get microseconds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9695

Reviewed By: ajkr

Differential Revision: D34897086

Pulled By: jay-zhuang

fbshipit-source-id: 9c1d0811ea740ba0a007edc2da199edbd000b88b
2022-03-21 16:11:02 -07:00
duyuqi
cbe303c19b fix a bug, c api, if enable inplace_update_support, and use create sn… (#9471)
Summary:
c api release snapshot will core dump when enable inplace_update_support and create snapshot

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9471

Reviewed By: akankshamahajan15

Differential Revision: D34965103

Pulled By: riversand963

fbshipit-source-id: c3aeeb9ea7126c2eda1466102794fecf57b6ab77
2022-03-21 12:04:33 -07:00
Peter Dillinger
a8a422e962 Add manifest fix-up utility for file temperatures (#9683)
Summary:
The goal of this change is to allow changes to the "current" (in
FileSystem) file temperatures to feed back into DB metadata, so that
they can inform decisions and stats reporting. In part because of
modular code factoring, it doesn't seem easy to do this automagically,
where opening an SST file and observing current Temperature different
from expected would trigger a change in metadata and DB manifest write
(essentially giving the deep read path access to the write path). It is also
difficult to do this while the DB is open because of the limitations of
LogAndApply.

This change allows updating file temperature metadata on a closed DB
using an experimental utility function UpdateManifestForFilesState()
or `ldb update_manifest --update_temperatures`. This should suffice for
"migration" scenarios where outside tooling has placed or re-arranged DB
files into a (different) tiered configuration without going through
RocksDB itself (currently, only compaction can change temperature
metadata).

Some details:
* Refactored and added unit test for `ldb unsafe_remove_sst_file` because
of shared functionality
* Pulled in autovector.h changes from https://github.com/facebook/rocksdb/issues/9546 to fix SuperVersionContext
move constructor (related to an older draft of this change)

Possible follow-up work:
* Support updating manifest with file checksums, such as when a
new checksum function is used and want existing DB metadata updated
for it.
* It's possible that for some repair scenarios, lighter weight than
full repair, we might want to support UpdateManifestForFilesState() to
modify critical file details like size or checksum using same
algorithm. But let's make sure these are differentiated from modifying
file details in ways that don't suspect corruption (or require extreme
trust).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9683

Test Plan: unit tests added

Reviewed By: jay-zhuang

Differential Revision: D34798828

Pulled By: pdillinger

fbshipit-source-id: cfd83e8fb10761d8c9e7f9c020d68c9106a95554
2022-03-18 16:35:51 -07:00
Yanqin Jin
b2aacaf923 Fix assertion error by doing comparison with mutex (#9717)
Summary:
On CircleCI MacOS instances, we have been seeing the following assertion error:
```
Assertion failed: (alive_log_files_tail_ == alive_log_files_.rbegin()), function WriteToWAL, file /Users/distiller/project/db/db_impl/db_impl_write.cc, line 1213.
Received signal 6 (Abort trap: 6)
#0   0x1
https://github.com/facebook/rocksdb/issues/1   abort (in libsystem_c.dylib) + 120
https://github.com/facebook/rocksdb/issues/2   err (in libsystem_c.dylib) + 0
https://github.com/facebook/rocksdb/issues/3   rocksdb::DBImpl::WriteToWAL(rocksdb::WriteBatch const&, rocksdb::log::Writer*, unsigned long long*, unsigned long long*, rocksdb::Env::IOPriority, bool, bool) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:1213)
https://github.com/facebook/rocksdb/issues/4   rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long long*, bool, bool, unsigned long long) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:1251)
https://github.com/facebook/rocksdb/issues/5   rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long long*, unsigned long long, bool, unsigned long long*, unsigned long, rocksdb::PreReleaseCallback*) (in librocksdb.7.0.0.dylib) (db_impl_	rite.cc:421)
https://github.com/facebook/rocksdb/issues/6   rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:109)
https://github.com/facebook/rocksdb/issues/7   rocksdb::DB::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:2159)
https://github.com/facebook/rocksdb/issues/8   rocksdb::DBImpl::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:37)
https://github.com/facebook/rocksdb/issues/9   rocksdb::DB::Put(rocksdb::WriteOptions const&, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db.h:382)
https://github.com/facebook/rocksdb/issues/10  rocksdb::DBBasicTestWithTimestampPrefixSeek_IterateWithPrefix_Test::TestBody() (in db_with_timestamp_basic_test) (db_with_timestamp_basic_test.cc:2926)
https://github.com/facebook/rocksdb/issues/11  void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3899)
https://github.com/facebook/rocksdb/issues/12  void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3935)
https://github.com/facebook/rocksdb/issues/13  testing::Test::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:3980)
https://github.com/facebook/rocksdb/issues/14  testing::TestInfo::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:4153)
https://github.com/facebook/rocksdb/issues/15  testing::TestCase::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:4266)
https://github.com/facebook/rocksdb/issues/16  testing::internal::UnitTestImpl::RunAllTests() (in db_with_timestamp_basic_test) (gtest-all.cc:6632)
https://github.com/facebook/rocksdb/issues/17  bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3899)
https://github.com/facebook/rocksdb/issues/18  bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3935)
https://github.com/facebook/rocksdb/issues/19  testing::UnitTest::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:6242)
https://github.com/facebook/rocksdb/issues/20  RUN_ALL_TESTS() (in db_with_timestamp_basic_test) (gtest.h:22110)
https://github.com/facebook/rocksdb/issues/21  main (in db_with_timestamp_basic_test) (db_with_timestamp_basic_test.cc:3150)
https://github.com/facebook/rocksdb/issues/22  start (in libdyld.dylib) + 1
```

It's likely caused by concurrent, unprotected access to the deque, even though `back()` is never popped,
and we are comparing `rbegin()` with a cached `riterator`. To be safe, do the comparison only if we have mutex.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9717

Test Plan:
One example
Ssh to one CircleCI MacOS instance.
```
gtest-parallel -r 1000 -w 8 ./db_test --gtest_filter=DBTest.FlushesInParallelWithCompactRange
```

Reviewed By: pdillinger

Differential Revision: D34990696

Pulled By: riversand963

fbshipit-source-id: 62dd48ae6fedbda53d0a64d73de9b948b4c26eee
2022-03-18 13:11:57 -07:00
Peter Dillinger
cff0d1e8e6 New backup meta schema, with file temperatures (#9660)
Summary:
The primary goal of this change is to add support for backing up and
restoring (applying on restore) file temperature metadata, without
committing to either the DB manifest or the FS reported "current"
temperatures being exclusive "source of truth".

To achieve this goal, we need to add temperature information to backup
metadata, which requires updated backup meta schema. Fortunately I
prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version
6.19.0 for this kind of schema update. (Previously, backup meta schema
was not extensible! Making this schema update public will allow some
other "nice to have" features like taking backups with hard links, and
avoiding crc32c checksum computation when another checksum is already
available.) While schema version 2 is newly public, the default schema
version is still 1. Until we change the default, users will need to set
to 2 to enable features like temperature data backup+restore. New
metadata like temperature information will be ignored with a warning
in versions before this change and since 6.19.0. The metadata is
considered ignorable because a functioning DB can be restored without
it.

Some detail:
* Some renaming because "future schema" is now just public schema 2.
* Initialize some atomics in TestFs (linter reported)
* Add temperature hint support to SstFileDumper (used by BackupEngine)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660

Test Plan:
related unit test majorly updated for the new functionality,
including some shared testing support for tracking temperatures in a FS.

Some other tests and testing hooks into production code also updated for
making the backup meta schema change public.

Reviewed By: ajkr

Differential Revision: D34686968

Pulled By: pdillinger

fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a
2022-03-18 11:06:17 -07:00
Yanqin Jin
5894761056 Improve stress test for transactions (#9568)
Summary:
Test only, no change to functionality.
Extremely low risk of library regression.

Update test key generation by maintaining existing and non-existing keys.
Update db_crashtest.py to drive multiops_txn stress test for both write-committed and write-prepared.
Add a make target 'blackbox_crash_test_with_multiops_txn'.

Running the following commands caught the bug exposed in https://github.com/facebook/rocksdb/issues/9571.
```
$rm -rf /tmp/rocksdbtest/*
$./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0 \
    -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 \
   -readpercent=20 -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 \
   -ub_c=100 -destroy_db_initially=0 -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0
$./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0
   -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 -readpercent=20 \
   -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 -ub_c=100 -destroy_db_initially=0 \
   -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0
```

Running the following command caught a bug which will be fixed in https://github.com/facebook/rocksdb/issues/9648 .
```
$TEST_TMPDIR=/dev/shm make blackbox_crash_test_with_multiops_wc_txn
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9568

Reviewed By: jay-zhuang

Differential Revision: D34308154

Pulled By: riversand963

fbshipit-source-id: 99ff1b65c19b46c471d2f2d3b47adcd342a1b9e7
2022-03-16 19:00:04 -07:00
anand76
a88d8795ec Expand auto recovery to background read errors (#9679)
Summary:
Fix and enhance the background error recovery logic to handle the
following situations -
1. Background read errors during flush/compaction (previously was
resulting in unrecoverable state)
2. Fix auto recovery failure on read/write errors during atomic flush.
It was failing due to a bug in setting the resuming_from_bg_err variable
in AtomicFlushMemTablesToOutputFiles.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9679

Test Plan: Add new unit tests in error_handler_fs_test

Reviewed By: riversand963

Differential Revision: D34770097

Pulled By: anand1976

fbshipit-source-id: 136da973a28d684b9c74bdf668519b0cbbbe1742
2022-03-15 14:45:34 -07:00
Jay Zhuang
2c8100e60e Fix a race condition when disable and enable manual compaction (#9694)
Summary:
In https://github.com/facebook/rocksdb/issues/9659, when `DisableManualCompaction()` is issued, the foreground
manual compaction thread does not have to wait background compaction
thread to finish. Which could be a problem that the user re-enable
manual compaction with `EnableManualCompaction()`, it may re-enable the
BG compaction which supposed be cancelled.
This patch makes the FG compaction wait on
`manual_compaction_state.done`, which either be set by BG compaction or
Unschedule callback. Then when FG manual compaction thread returns, it
should not have BG compaction running. So shared_ptr is no longer needed
for `manual_compaction_state`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9694

Test Plan: a StressTest and unittest

Reviewed By: ajkr

Differential Revision: D34885472

Pulled By: jay-zhuang

fbshipit-source-id: e6476175b43e8c59cd49f5c09241036a0716c274
2022-03-15 12:31:14 -07:00
Yanqin Jin
6a76008369 Fix TSAN caused by calling rend() and pop_front(). (#9698)
Summary:
PR9686 makes `WriteToWAL()` call `assert(...!=rend())` while not holding
db mutex or log mutex. Another thread may concurrently call
`pop_front()`, causing race condition.
To fix, assert only if mutex is held.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9698

Test Plan: COMPILE_WITH_TSAN=1 make check

Reviewed By: jay-zhuang

Differential Revision: D34898535

Pulled By: riversand963

fbshipit-source-id: 1ddfa5bf1b6ae8d409cab6ff6e1b5321c6803da9
2022-03-15 12:16:40 -07:00
gukaifeng
89429a9081 fix a bug of the ticker NO_FILE_OPENS (#9677)
Summary:
In the original code, the value of `NO_FILE_OPENS` corresponding to the Ticker item will be increased regardless of whether the file is successfully opened or not. Even counts are repeated, which can lead to skewed counts.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9677

Reviewed By: jay-zhuang

Differential Revision: D34725733

Pulled By: ajkr

fbshipit-source-id: 841234ed03802c0105fd2107d82a740265ead576
2022-03-15 09:55:49 -07:00
Jermy Li
3da8236837 fix: Reusing-Iterator reads stale keys after DeleteRange() performed (#9258)
Summary:
fix https://github.com/facebook/rocksdb/issues/9255

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9258

Reviewed By: pdillinger

Differential Revision: D34879684

Pulled By: ajkr

fbshipit-source-id: 5934f4b7524dc27ecdf1430e0456a0fc02958fc7
2022-03-15 09:50:21 -07:00
Yanqin Jin
bbdaf63d0f Fix a TSAN-reported bug caused by concurrent accesss to std::deque (#9686)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9686

According to https://www.cplusplus.com/reference/deque/deque/back/,
"
The container is accessed (neither the const nor the non-const versions modify the container).
The last element is potentially accessed or modified by the caller. Concurrently accessing or modifying other elements is safe.
"

Also according to https://www.cplusplus.com/reference/deque/deque/pop_front/,
"
The container is modified.
The first element is modified. Concurrently accessing or modifying other elements is safe (although see iterator validity above).
"
In RocksDB, we never pop the last element of `DBImpl::alive_log_files_`. We have been
exploiting this fact and the above two properties when ensuring correctness when
`DBImpl::alive_log_files_` may be accessed concurrently. Specifically, it can be accessed
in the write path when db mutex is released. Sometimes, the log_mute_ is held. It can also be accessed in `FindObsoleteFiles()`
when db mutex is always held. It can also be accessed
during recovery when db mutex is also held.
Given the fact that we never pop the last element of alive_log_files_, we currently do not
acquire additional locks when accessing it in `WriteToWAL()` as follows
```
alive_log_files_.back().AddSize(log_entry.size());
```

This is problematic.

Check source code of deque.h
```
  back() _GLIBCXX_NOEXCEPT
  {
__glibcxx_requires_nonempty();
...
  }

  pop_front() _GLIBCXX_NOEXCEPT
  {
...
  if (this->_M_impl._M_start._M_cur
      != this->_M_impl._M_start._M_last - 1)
    {
      ...
      ++this->_M_impl._M_start._M_cur;
    }
  ...
  }
```

`back()` will actually call `__glibcxx_requires_nonempty()` first.
If `__glibcxx_requires_nonempty()` is enabled and not an empty macro,
it will call `empty()`
```
bool empty() {
return this->_M_impl._M_finish == this->_M_impl._M_start;
}
```
You can see that it will access `this->_M_impl._M_start`, racing with `pop_front()`.
Therefore, TSAN will actually catch the bug in this case.

To be able to use TSAN on our library and unit tests, we should always coordinate
concurrent accesses to STL containers properly.

We need to pass information about db mutex and log mutex into `WriteToWAL()`, otherwise
it's impossible to know which mutex to acquire inside the function.

To fix this, we can catch the tail of `alive_log_files_` by reference, so that we do not have to call `back()` in `WriteToWAL()`.

Reviewed By: pdillinger

Differential Revision: D34780309

fbshipit-source-id: 1def9821f0c437f2736c6a26445d75890377889b
2022-03-14 18:49:55 -07:00
Jay Zhuang
4dff279b19 DisableManualCompaction may fail to cancel an unscheduled task (#9659)
Summary:
https://github.com/facebook/rocksdb/issues/9625 didn't change the unschedule condition which was waiting for the background thread to clean-up the compaction.
make sure we only unschedule the task when it's scheduled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9659

Reviewed By: ajkr

Differential Revision: D34651820

Pulled By: jay-zhuang

fbshipit-source-id: 23f42081b15ec8886cd81cbf131b116e0c74dc2f
2022-03-12 20:07:04 -08:00
Jay Zhuang
09b0e8f2c7 Fix a timer crash caused by invalid memory management (#9656)
Summary:
Timer crash when multiple DB instances doing heavy DB open and close
operations concurrently. Which is caused by adding a timer task with
smaller timestamp than the current running task. Fix it by moving the
getting new task timestamp part within timer mutex protection.
And other fixes:
- Disallow adding duplicated function name to timer
- Fix a minor memory leak in timer when a running task is cancelled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9656

Reviewed By: ajkr

Differential Revision: D34626296

Pulled By: jay-zhuang

fbshipit-source-id: 6b6d96a5149746bf503546244912a9e41a0c5f6b
2022-03-12 11:45:56 -08:00
slk
95305c44a1 Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410)
Summary:
As disscussed in (https://github.com/facebook/rocksdb/issues/9223), Here added a new API  named DB::OpenAndTrimHistory, this API will open DB and trim data to the timestamp specofied by **trim_ts** (The data with newer timestamp than specified trim bound will be removed). This API should only be used at a timestamp-enabled db instance recovery.

And this PR implemented a new iterator named HistoryTrimmingIterator to support trimming history with a new API named DB::OpenAndTrimHistory. HistoryTrimmingIterator wrapped around the underlying InternalITerator such that keys whose timestamps newer than **trim_ts** should not be returned to the compaction iterator while **trim_ts** is not null.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9410

Reviewed By: ltamasi

Differential Revision: D34410207

Pulled By: riversand963

fbshipit-source-id: e54049dc234eccd673244c566b15df58df5a6236
2022-03-11 16:13:23 -08:00
Baptiste Lemaire
7bed6595f3 Fix mempurge crash reported in #8958 (#9671)
Summary:
Change the `MemPurge` code to address a failure during a crash test reported in https://github.com/facebook/rocksdb/issues/8958.

### Details and results of the crash investigation:
These failures happened in a specific scenario where the list of immutable tables was composed of 2 or more memtables, and the last memtable was the output of a previous `Mempurge` operation. Because the `PickMemtablesToFlush` function included a sorting of the memtables (previous PR related to the Mempurge project), and because the `VersionEdit` of the flush class is piggybacked onto a single one of these memtables, the `VersionEdit` was not properly selected and applied to the `VersionSet` of the DB. Since the `VersionSet` was not edited properly, the database was losing track of the SST file created during the flush process, which was subsequently deleted (and as you can expect, caused the tests to crash).
The following command consistently failed, which was quite convenient to investigate the issue:
`$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done`

### Solution proposed
The memtables are no longer sorted based on their `memtableID` in the `PickMemtablesToFlush` function. Additionally, the `next_log_number` of the memtable created as an output of the `Mempurge` function now takes in the correct value (the log number of the first memtable being mempurged). Finally, the VersionEdit object of the flush class now takes the maximum `next_log_number` of the stack of memtables being flushed, which doesnt change anything when Mempurge is `off` but becomes necessary when Mempurge is `on`.

### Testing of the solution
The following command no longer fails:
``$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done``
Additionally, I ran `db_crashtest` (`whitebox` and `blackbox`) for 2.5 hours with MemPurge on and did not observe any crash.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9671

Reviewed By: pdillinger

Differential Revision: D34697424

Pulled By: bjlemaire

fbshipit-source-id: d1ab675b361904351ac81a35c184030e52222874
2022-03-10 15:16:55 -08:00
Siddhartha Roychowdhury
fec4403ff1 Integrate WAL compression into log reader/writer. (#9642)
Summary:
Integrate the streaming compress/uncompress API into WAL compression.
The streaming compression object is stored in the log_writer along with a reusable output buffer to store the compressed buffer(s).
The streaming uncompress object is stored in the log_reader along with a reusable output buffer to store the uncompressed buffer(s).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9642

Test Plan:
Added unit tests to verify different scenarios - large buffers, split compressed buffers, etc.

Future optimizations:
The overhead for small records is quite high, so it makes sense to compress only buffers above a certain threshold and use a separate record type to indicate that those records are compressed.

Reviewed By: anand1976

Differential Revision: D34709167

Pulled By: sidroyc

fbshipit-source-id: a37a3cd1301adff6152fb3fcd23726106af07dd4
2022-03-09 15:49:53 -08:00
Yanqin Jin
3b6dc049f7 Support user-defined timestamps in write-committed txns (#9629)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9629

Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are
locked upon first operation that writes the key or has the intention of writing. For example,
`PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`,
`PessimisticTransaction::SingleDelete()` will write to or delete a key, while
`PessimisticTransaction::GetForUpdate()` is used by application to indicate
to RocksDB that the transaction has the intention of performing write operation later
in the same transaction.
Pessimistic transactions support two-phase commit (2PC). A transaction can be
`Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once
`Prepare()` succeeds, the transaction has acquired the necessary resources to commit.
The resources include locks, persistence of WAL, etc.
Write-committed transaction is the default pessimistic transaction implementation. In
RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare
section. `Commit()` will write a commit marker to the WAL and then write data to the
memtables. While writing to the memtables, different keys in the transaction's write batch
will be assigned different sequence numbers in ascending order.
Until commit/rollback, the transaction holds locks on the keys so that no other transaction
can write to the same keys. Furthermore, the keys' sequence numbers represent the order
in which they are committed and should be made visible. This is convenient for us to
implement support for user-defined timestamps.
Since column families with and without timestamps can co-exist in the same database,
a transaction may or may not involve timestamps. Based on this observation, we add two
optional members to each `PessimisticTransaction`, `read_timestamp_` and
`commit_timestamp_`. If no key in the transaction's write batch has timestamp, then
setting these two variables do not have any effect. For the rest of this commit, we discuss
only the cases when these two variables are meaningful.

read_timestamp_ is used mainly for validation, and should be set before first call to
`GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls
`TryLock()` that can verify if another transaction has written the same key since
`read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed
written the same key, then validation fails, and RocksDB allows this transaction to
refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()`
with a different timestamp to read, but the result of the read should not be used to
determine data that will be written later.

commit_timestamp_ must be set after finishing writing and before transaction commit.
This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after
prepare phase succeeds.

We currently require that the commit timestamp be chosen after all keys are locked. This
means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used
by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`,
`PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`,
etc. will return non-ok status because they specify timestamps before locking the keys.
Users are also prompted to use the `Transaction` APIs when they receive the non-ok status.

Reviewed By: ltamasi

Differential Revision: D31822445

fbshipit-source-id: b82abf8e230216dc89cc519564a588224a88fd43
2022-03-08 16:20:59 -08:00
Hui Xiao
ca0ef54f16 Rate-limit automatic WAL flush after each user write (#9607)
Summary:
**Context:**
WAL flush is currently not rate-limited by `Options::rate_limiter`. This PR is to provide rate-limiting to auto WAL flush, the one that automatically happen after each user write operation (i.e, `Options::manual_wal_flush == false`), by adding `WriteOptions::rate_limiter_options`.

Note that we are NOT rate-limiting WAL flush that do NOT automatically happen after each user write, such as  `Options::manual_wal_flush == true + manual FlushWAL()` (rate-limiting multiple WAL flushes),  for the benefits of:
- being consistent with [ReadOptions::rate_limiter_priority](https://github.com/facebook/rocksdb/blob/7.0.fb/include/rocksdb/options.h#L515)
- being able to turn off some WAL flush's rate-limiting but not all (e.g, turn off specific the WAL flush of a critical user write like a service's heartbeat)

`WriteOptions::rate_limiter_options` only accept `Env::IO_USER` and `Env::IO_TOTAL` currently due to an implementation constraint.
- The constraint is that we currently queue parallel writes (including WAL writes) based on FIFO policy which does not factor rate limiter priority into this layer's scheduling. If we allow lower priorities such as `Env::IO_HIGH/MID/LOW` and such writes specified with lower priorities occurs before ones specified with higher priorities (even just by a tiny bit in arrival time), the former would have blocked the latter, leading to a "priority inversion" issue and contradictory to what we promise for rate-limiting priority. Therefore we only allow `Env::IO_USER` and `Env::IO_TOTAL`  right now before improving that scheduling.

A pre-requisite to this feature is to support operation-level rate limiting in `WritableFileWriter`, which is also included in this PR.

**Summary:**
- Renamed test suite `DBRateLimiterTest to DBRateLimiterOnReadTest` for adding a new test suite
- Accept `rate_limiter_priority` in `WritableFileWriter`'s private and public write functions
- Passed `WriteOptions::rate_limiter_options` to `WritableFileWriter` in the path of automatic WAL flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9607

Test Plan:
- Added new unit test to verify existing flush/compaction rate-limiting does not break, since `DBTest, RateLimitingTest` is disabled and current db-level rate-limiting tests focus on read only (e.g, `db_rate_limiter_test`, `DBTest2, RateLimitedCompactionReads`).
- Added new unit test `DBRateLimiterOnWriteWALTest, AutoWalFlush`
- `strace -ftt -e trace=write ./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -rate_limit_auto_wal_flush=1 -rate_limiter_bytes_per_sec=15 -rate_limiter_refill_period_us=1000000 -write_buffer_size=100000000 -disable_auto_compactions=1 -num=100`
   - verified that WAL flush(i.e, system-call _write_) were chunked into 15 bytes and each _write_ was roughly 1 second apart
   - verified the chunking disappeared when `-rate_limit_auto_wal_flush=0`
- crash test: `python3 tools/db_crashtest.py blackbox --disable_wal=0  --rate_limit_auto_wal_flush=1 --rate_limiter_bytes_per_sec=10485760 --interval=10` killed as normal

**Benchmarked on flush/compaction to ensure no performance regression:**
- compaction with rate-limiting  (see table 1, avg over 1280-run):  pre-change: **915635 micros/op**; post-change:
   **907350 micros/op (improved by 0.106%)**
```
#!/bin/bash
TEST_TMPDIR=/dev/shm/testdb
START=1
NUM_DATA_ENTRY=8
N=10

rm -f compact_bmk_output.txt compact_bmk_output_2.txt dont_care_output.txt
for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
do
    NUM_RUN=$(($N*(2**($i-1))))
    for j in $(eval echo "{$START..$NUM_RUN}")
    do
       ./db_bench --benchmarks=fillrandom -db=$TEST_TMPDIR -disable_auto_compactions=1 -write_buffer_size=6710886 > dont_care_output.txt && ./db_bench --benchmarks=compact -use_existing_db=1 -db=$TEST_TMPDIR -level0_file_num_compaction_trigger=1 -rate_limiter_bytes_per_sec=100000000 | egrep 'compact'
    done > compact_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' compact_bmk_output.txt >> compact_bmk_output_2.txt
done
```
- compaction w/o rate-limiting  (see table 2, avg over 640-run):  pre-change: **822197 micros/op**; post-change: **823148 micros/op (regressed by 0.12%)**
```
Same as above script, except that -rate_limiter_bytes_per_sec=0
```
- flush with rate-limiting (see table 3, avg over 320-run, run on the [patch](ee5c6023a9) to augment current db_bench ): pre-change: **745752 micros/op**; post-change: **745331 micros/op (regressed by 0.06 %)**
```
 #!/bin/bash
TEST_TMPDIR=/dev/shm/testdb
START=1
NUM_DATA_ENTRY=8
N=10

rm -f flush_bmk_output.txt flush_bmk_output_2.txt

for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
do
    NUM_RUN=$(($N*(2**($i-1))))
    for j in $(eval echo "{$START..$NUM_RUN}")
    do
       ./db_bench -db=$TEST_TMPDIR -write_buffer_size=1048576000 -num=1000000 -rate_limiter_bytes_per_sec=100000000 -benchmarks=fillseq,flush | egrep 'flush'
    done > flush_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' flush_bmk_output.txt >> flush_bmk_output_2.txt
done

```
- flush w/o rate-limiting (see table 4, avg over 320-run, run on the [patch](ee5c6023a9) to augment current db_bench): pre-change: **487512 micros/op**, post-change: **485856 micors/ops (improved by 0.34%)**
```
Same as above script, except that -rate_limiter_bytes_per_sec=0
```

| table 1 - compact with rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
-- | -- | -- | -- | -- | --
10 | 896978 | 16046.9 | 901242 | 15670.9 | 0.475373978
20 | 893718 | 15813 | 886505 | 17544.7 | -0.8070778478
40 | 900426 | 23882.2 | 894958 | 15104.5 | -0.6072681153
80 | 906635 | 21761.5 | 903332 | 23948.3 | -0.3643141948
160 | 898632 | 21098.9 | 907583 | 21145 | 0.9960695813
3.20E+02 | 905252 | 22785.5 | 908106 | 25325.5 | 0.3152713278
6.40E+02 | 905213 | 23598.6 | 906741 | 21370.5 | 0.1688000504
**1.28E+03** | **908316** | **23533.1** | **907350** | **24626.8** | **-0.1063506533**
average over #-run | 901896.25 | 21064.9625 | 901977.125 | 20592.025 | 0.008967217682

| table 2 - compact w/o rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
-- | -- | -- | -- | -- | --
10 | 811211 | 26996.7 | 807586 | 28456.4 | -0.4468627768
20 | 815465 | 14803.7 | 814608 | 28719.7 | -0.105093413
40 | 809203 | 26187.1 | 797835 | 25492.1 | -1.404839082
80 | 822088 | 28765.3 | 822192 | 32840.4 | 0.01265071379
160 | 821719 | 36344.7 | 821664 | 29544.9 | -0.006693285661
3.20E+02 | 820921 | 27756.4 | 821403 | 28347.7 | 0.05871454135
**6.40E+02** | **822197** | **28960.6** | **823148** | **30055.1** | **0.1156657103**
average over #-run | 8.18E+05 | 2.71E+04 | 8.15E+05 | 2.91E+04 |  -0.25

| table 3 - flush with rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
-- | -- | -- | -- | -- | --
10 | 741721 | 11770.8 | 740345 | 5949.76 | -0.1855144994
20 | 735169 | 3561.83 | 743199 | 9755.77 | 1.09226586
40 | 743368 | 8891.03 | 742102 | 8683.22 | -0.1703059588
80 | 742129 | 8148.51 | 743417 | 9631.58| 0.1735547324
160 | 749045 | 9757.21 | 746256 | 9191.86 | -0.3723407806
**3.20E+02** | **745752** | **9819.65** | **745331** | **9840.62** | **-0.0564530836**
6.40E+02 | 749006 | 11080.5 | 748173 | 10578.7 | -0.1112140624
average over #-run | 743741.4286 | 9004.218571 | 744117.5714 | 9090.215714 | 0.05057441238

| table 4 - flush w/o rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op (%)
-- | -- | -- | -- | -- | --
10 | 477283 | 24719.6 | 473864 | 12379 | -0.7163464863
20 | 486743 | 20175.2 | 502296 | 23931.3 | 3.195320734
40 | 482846 | 15309.2 | 489820 | 22259.5 | 1.444352858
80 | 491490 | 21883.1 | 490071 | 23085.7 | -0.2887139108
160 | 493347 | 28074.3 | 483609 | 21211.7 | -1.973864238
**3.20E+02** | **487512** | **21401.5** | **485856** | **22195.2** | **-0.3396839462**
6.40E+02 | 490307 | 25418.6 | 485435 | 22405.2 | -0.9936631539
average over #-run | 4.87E+05 | 2.24E+04 | 4.87E+05 | 2.11E+04 | 0.00E+00

Reviewed By: ajkr

Differential Revision: D34442441

Pulled By: hx235

fbshipit-source-id: 4790f13e1e5c0a95ae1d1cc93ffcf69dc6e78bdd
2022-03-08 13:19:39 -08:00
Ezgi Çiçek
27d6ef8e60 Rename mutable_cf_options to signify explicity copy (#9666)
Summary:
Signify explicit copy with comment and better name for variable `mutable_cf_options`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9666

Reviewed By: riversand963

Differential Revision: D34680934

Pulled By: ezgicicek

fbshipit-source-id: b64ef18725fe523835d14ceb4b29bcdfe493f8ed
2022-03-08 11:26:40 -08:00
Jay Zhuang
36aec94d85 compression_per_level should be used for flush and changeable (#9658)
Summary:
- Make `compression_per_level` dynamical changeable with `SetOptions`;
- Fix a bug that `compression_per_level` is not used for flush;

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9658

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D34700749

Pulled By: jay-zhuang

fbshipit-source-id: a23b9dfa7ad03d393c1d71781d19e91de796f49c
2022-03-07 18:06:19 -08:00
Peter Dillinger
ce60d0cbe5 Test refactoring for Backups+Temperatures (#9655)
Summary:
In preparation for more support for file Temperatures in BackupEngine,
this change does some test refactoring:
* Move DBTest2::BackupFileTemperature test to
BackupEngineTest::FileTemperatures, with some updates to make it work
in the new home. This test will soon be expanded for deeper backup work.
* Move FileTemperatureTestFS from db_test2.cc to db_test_util.h, to
support sharing because of above moved test, but split off the "no link"
part to the test needing it.
* Use custom FileSystems in backupable_db_test rather than custom Envs,
because going through Env file interfaces doesn't support temperatures.
* Fix RemapFileSystem to map DirFsyncOptions::renamed_new_name
parameter to FsyncWithDirOptions, which was required because this
limitation caused a crash only after moving to higher fidelity of
FileSystem interface (vs. LegacyDirectoryWrapper throwing away some
parameter details)
* `backupable_options_` -> `engine_options_` as part of the ongoing
work to get rid of the obsolete "backupable" naming.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9655

Test Plan: test code updates only

Reviewed By: jay-zhuang

Differential Revision: D34622183

Pulled By: pdillinger

fbshipit-source-id: f24b7a596a89b9e089e960f4e5d772575513e93f
2022-03-04 12:32:30 -08:00
Hui Xiao
fc61e98ae6 Attempt to deflake DBLogicalBlockSizeCacheTest.CreateColumnFamilies (#9516)
Summary:
**Context:**
`DBLogicalBlockSizeCacheTest.CreateColumnFamilies` is flaky on a rare occurrence of assertion failure below
```
db/db_logical_block_size_cache_test.cc:210
Expected equality of these values:
  1
  cache_->GetRefCount(cf_path_0_)
    Which is: 2
```

Root-cause: `ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[0]));` in the test may not successfully decrease the ref count of `cf_path_0_` since the decreasing only happens in the clean-up of `ColumnFamilyData` when `ColumnFamilyData` has no referencing to it, which may not be true when `db->DestroyColumnFamilyHandle(cfs[0])` is called since background work such as `DumpStats()` can hold reference to that `ColumnFamilyData` (suggested and repro-d by ajkr ). Similar case `ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[1]));`.

See following for a deterministic repro:
```
 diff --git a/db/db_impl/db_impl.cc b/db/db_impl/db_impl.cc
index 196b428a3..4e7a834c4 100644
 --- a/db/db_impl/db_impl.cc
+++ b/db/db_impl/db_impl.cc
@@ -956,10 +956,16 @@ void DBImpl::DumpStats() {
         // near-atomically.
         // Get a ref before unlocking
         cfd->Ref();
+        if (cfd->GetName() == "cf1" || cfd->GetName() == "cf2") {
+          TEST_SYNC_POINT("DBImpl::DumpStats:PostCFDRef");
+        }
         {
           InstrumentedMutexUnlock u(&mutex_);
           cfd->internal_stats()->CollectCacheEntryStats(/*foreground=*/false);
         }
+        if (cfd->GetName() == "cf1" || cfd->GetName() == "cf2") {
+          TEST_SYNC_POINT("DBImpl::DumpStats::PreCFDUnrefAndTryDelete");
+        }
         cfd->UnrefAndTryDelete();
       }
     }
 diff --git a/db/db_logical_block_size_cache_test.cc b/db/db_logical_block_size_cache_test.cc
index 1057871c9..c3872c036 100644
 --- a/db/db_logical_block_size_cache_test.cc
+++ b/db/db_logical_block_size_cache_test.cc
@@ -9,6 +9,7 @@
 #include "env/io_posix.h"
 #include "rocksdb/db.h"
 #include "rocksdb/env.h"
+#include "test_util/sync_point.h"

 namespace ROCKSDB_NAMESPACE {
 class EnvWithCustomLogicalBlockSizeCache : public EnvWrapper {
@@ -183,6 +184,15 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
   ASSERT_EQ(1, cache_->GetRefCount(dbname_));

   std::vector<ColumnFamilyHandle*> cfs;
+  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
+  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
+      {{"DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PostSetupTwoCFH",
+        "DBImpl::DumpStats:StartRunning"},
+       {"DBImpl::DumpStats:PostCFDRef",
+        "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PreDeleteTwoCFH"},
+       {"DBLogicalBlockSizeCacheTest::CreateColumnFamilies::"
+        "PostFinishCheckingRef",
+        "DBImpl::DumpStats::PreCFDUnrefAndTryDelete"}});
   ASSERT_OK(db->CreateColumnFamilies(cf_options, {"cf1", "cf2"}, &cfs));
   ASSERT_EQ(2, cache_->Size());
   ASSERT_TRUE(cache_->Contains(dbname_));
@@ -190,7 +200,7 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
   ASSERT_TRUE(cache_->Contains(cf_path_0_));
   ASSERT_EQ(2, cache_->GetRefCount(cf_path_0_));
   }

    // Delete one handle will not drop cache because another handle is still
   // referencing cf_path_0_.
+  TEST_SYNC_POINT(
+      "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PostSetupTwoCFH");
+  TEST_SYNC_POINT(
+      "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PreDeleteTwoCFH");
   ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[0]));
   ASSERT_EQ(2, cache_->Size());
   ASSERT_TRUE(cache_->Contains(dbname_));
@@ -209,16 +221,20 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
   ASSERT_TRUE(cache_->Contains(cf_path_0_));
    // Will fail
   ASSERT_EQ(1, cache_->GetRefCount(cf_path_0_));

   // Delete the last handle will drop cache.
   ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[1]));
   ASSERT_EQ(1, cache_->Size());
   ASSERT_TRUE(cache_->Contains(dbname_));
   // Will fail
   ASSERT_EQ(1, cache_->GetRefCount(dbname_));

+  TEST_SYNC_POINT(
+      "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::"
+      "PostFinishCheckingRef");
   delete db;
   ASSERT_EQ(0, cache_->Size());
   ASSERT_OK(DestroyDB(dbname_, options,
       {{"cf1", cf_options}, {"cf2", cf_options}}));
+  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
 }
```

**Summary**
- Removed the flaky assertion
- Clarified the comments for the test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9516

Test Plan:
- CI
- Monitor for future flakiness

Reviewed By: ajkr

Differential Revision: D34055232

Pulled By: hx235

fbshipit-source-id: 9bf83ae5fa88bf6fc829876494d4692082e4c357
2022-03-04 11:35:28 -08:00
Hui Xiao
4a776d81cc Dynamic toggling of BlockBasedTableOptions::detect_filter_construct_corruption (#9654)
Summary:
**Context/Summary:**
As requested, `BlockBasedTableOptions::detect_filter_construct_corruption` can now be dynamically configured using `DB::SetOptions` after this PR

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9654

Test Plan: - New unit test

Reviewed By: pdillinger

Differential Revision: D34622609

Pulled By: hx235

fbshipit-source-id: c06773ef3d029e6bf1724d3a72dffd37a8ec66d9
2022-03-04 10:35:08 -08:00
Yanqin Jin
659a16d52b Fix bug causing incorrect data returned by snapshot read (#9648)
Summary:
This bug affects use cases that meet the following conditions
- (has only the default column family or disables WAL) and
- has at least one event listener
- atomic flush is NOT affected.

If the above conditions meet, then RocksDB can release the db mutex before picking all the
existing memtables to flush. In the meantime, a snapshot can be created and db's sequence
number can still be incremented. The upcoming flush will ignore this snapshot.
A later read using this snapshot can return incorrect result.

To fix this issue, we call the listeners callbacks after picking the memtables so that we avoid
creating snapshots during this interval.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9648

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D34555456

Pulled By: riversand963

fbshipit-source-id: 1438981e9f069a5916686b1a0ad7627f734cf0ee
2022-03-02 21:03:14 -08:00
Yuriy Chernyshov
73fd589b1a Do not rely on ADL when invoking std::max_element (#9608)
Summary:
Certain STLs use raw pointers and ADL does not work for them.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9608

Reviewed By: ajkr

Differential Revision: D34583012

Pulled By: riversand963

fbshipit-source-id: 7de6bbc8a080c3e7243ce0d758fe83f1663168aa
2022-03-02 17:41:02 -08:00
Jay Zhuang
db8647969d Unschedule manual compaction from thread-pool queue (#9625)
Summary:
PR https://github.com/facebook/rocksdb/issues/9557 introduced a race condition between manual compaction
foreground thread and background compaction thread.
This PR adds the ability to really unschedule manual compaction from
thread-pool queue by differentiate tag name for manual compaction and
other tasks.
Also fix an issue that db `close()` didn't cancel the manual compaction thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9625

Test Plan: unittest not hang

Reviewed By: ajkr

Differential Revision: D34410811

Pulled By: jay-zhuang

fbshipit-source-id: cb14065eabb8cf1345fa042b5652d4f788c0c40c
2022-03-02 13:43:00 -08:00
sdong
33742c2a9f Remove BlockBasedTableOptions.hash_index_allow_collision (#9454)
Summary:
BlockBasedTableOptions.hash_index_allow_collision is already deprecated and has no effect. Delete it for preparing 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9454

Test Plan: Run all existing tests.

Reviewed By: ajkr

Differential Revision: D33805827

fbshipit-source-id: ed8a436d1d083173ec6aef2a762ba02e1eefdc9d
2022-03-01 13:58:02 -08:00
Andrew Kryczka
9983eecdfb Dedicate cacheline for DB mutex (#9637)
Summary:
We found a case of cacheline bouncing due to writers locking/unlocking `mutex_` and readers accessing `block_cache_tracer_`. We discovered it only after the issue was fixed by https://github.com/facebook/rocksdb/issues/9462 shifting the `DBImpl` members such that `mutex_` and `block_cache_tracer_` were naturally placed in separate cachelines in our regression testing setup. This PR forces the cacheline alignment of `mutex_` so we don't accidentally reintroduce the problem.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9637

Reviewed By: riversand963

Differential Revision: D34502233

Pulled By: ajkr

fbshipit-source-id: 46aa313b7fe83e80c3de254e332b6fb242434c07
2022-02-27 11:36:54 -08:00
Hui Xiao
87a8b3c8af Deflake DBErrorHandlingFSTest.MultiCFWALWriteError (#9496)
Summary:
**Context:**
As part of https://github.com/facebook/rocksdb/pull/6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed](e66199d848 (diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R396)). Before re-enabling file deletion, it `assert(versions_->io_status().ok());`, which IMO assumes `versions_` is **the** `version_` in the recovery process.

However, this is not necessarily true due to `s = error_handler_.ClearBGError();` happening before that assertion can unblock some foreground thread by [`EventHelpers::NotifyOnErrorRecoveryEnd()`](3122cb4358/db/error_handler.cc (L552-L553)) as part of the `ClearBGError()`. That foreground thread can do whatever it wants including closing/reopening the db and clean up that same `versions_`.

As a consequence,  `assert(versions_->io_status().ok());`, will access `io_status()` of a nullptr and test like `DBErrorHandlingFSTest.MultiCFWALWriteError` becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to [reopen the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494), where [`versions_` gets reset to nullptr](https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl.cc?fbclid=IwAR2uRhwBiPKgmE9q_6CM2mzbfwjoRgsGpXOrHruSJUDcAKc9rYZtVSvKdOY#L678) as part of the old db clean-up. If this happens right before `assert(versions_->io_status().ok()); ` gets excuted in the background thread, then we can see error like
```
db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet'
assert(versions_->io_status().ok());
```

**Summary:**
- I proposed to call `s = error_handler_.ClearBGError();` after we know it's fine to wake up foreground, which I think is right before we LOG `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");`
   - As the context,  the orignal https://github.com/facebook/rocksdb/pull/3997  introducing `DBImpl::Resume()` calls `s = error_handler_.ClearBGError();` very close to calling `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` while the later https://github.com/facebook/rocksdb/pull/6949 distances these two calls a bit.
   - And it seems fine to me that `s = error_handler_.ClearBGError();` happens after `EnableFileDeletions(/*force=*/true);` at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before `s = error_handler_.ClearBGError();`, which basically is resetting some state variables.
- In addition, to preserve the previous behavior of  https://github.com/facebook/rocksdb/pull/6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated `enable_file_deletion_s` from the general `s`
- In addition, to make `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` more clear, I separated it into its own if-block.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9496

Test Plan:
- Manually reproduce the assertion failure in`DBErrorHandlingFSTest.MultiCFWALWriteError` by injecting sleep like below so that it's more likely for `assert(versions_->io_status().ok());` to execute after [reopening the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494) in the foreground (i.e, testing) thread
```
sleep(1);
assert(versions_->io_status().ok());
```
   `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
   ```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBErrorHandlingFSTest
[ RUN      ] DBErrorHandlingFSTest.MultiCFWALWriteError
Received signal 11 (Segmentation fault)
#0   rocksdb/error_handler_fs_test() [0x5818a4] rocksdb::DBImpl::ResumeImpl(rocksdb::DBRecoverContext)  /data/users/huixiao/rocksdb/db/db_impl/db_impl.cc:421
https://github.com/facebook/rocksdb/issues/1   rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600
https://github.com/facebook/rocksdb/issues/2   rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError()       /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310
https://github.com/facebook/rocksdb/issues/3   rocksdb/error_handler_fs_test()
   ```
- The assertion failure does not happen with PR
`python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
`[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms)  `

Reviewed By: riversand963, anand1976

Differential Revision: D33990099

Pulled By: hx235

fbshipit-source-id: 2e0259a471fa8892ff177da91b3e1c0792dd7bab
2022-02-25 14:44:46 -08:00
Siddhartha Roychowdhury
21345d2823 Streaming Compression API for WAL compression. (#9619)
Summary:
Implement a streaming compression API (compress/uncompress) to use for WAL compression. The log_writer would use the compress class/API to compress a record before writing it out in chunks. The log_reader would use the uncompress class/API to uncompress the chunks and combine into a single record.

Added unit test to verify the API for different sizes/compression types.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9619

Test Plan: make -j24 check

Reviewed By: anand1976

Differential Revision: D34437346

Pulled By: sidroyc

fbshipit-source-id: b180569ad2ddcf3106380f8758b556cc0ad18382
2022-02-23 23:45:04 -08:00
Yanqin Jin
6f12599863 Support WBWI for keys having timestamps (#9603)
Summary:
This PR supports inserting keys to a `WriteBatchWithIndex` for column families that enable user-defined timestamps
and reading the keys back. **The index does not have timestamps.**

Writing a key to WBWI is unchanged, because the underlying WriteBatch already supports it.
When reading the keys back, we need to make sure to distinguish between keys with and without timestamps before
comparison.

When user calls `GetFromBatchAndDB()`, no timestamp is needed to query the batch, but a timestamp has to be
provided to query the db. The assumption is that data in the batch must be newer than data from the db.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9603

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D34354849

Pulled By: riversand963

fbshipit-source-id: d25d1f84e2240ce543e521fa30595082fb8db9a0
2022-02-22 14:23:01 -08:00