Commit Graph

4852 Commits

Author SHA1 Message Date
Peter Dillinger
91687d70ea Fix a major performance bug in 7.0 re: filter compatibility (#9736)
Summary:
Bloom filters generated by pre-7.0 releases are not read by
7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name()
in https://github.com/facebook/rocksdb/issues/9590. This can severely impact read performance and read I/O on
upgrade or downgrade with existing DB, but not data correctness.

To fix, we go back using the old, unified name in SST metadata but (for
a while anyway) recognize the aliases that could be generated by early
7.0.x releases. This unfortunately requires a public API change to avoid
interfering with all the good changes from https://github.com/facebook/rocksdb/issues/9590, but the API change
only affects users with custom FilterPolicy, which should be very few.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9736

Test Plan:
manual

Generate DBs with
```
./db_bench.7.0 -db=/dev/shm/rocksdb.7.0 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0
```
and similar. Compare with
```
for IMPL in 6.29 7.0 fixed; do for DB in 6.29 7.0 fixed; do echo "Testing $IMPL on $DB:"; ./db_bench.$IMPL -db=/dev/shm/rocksdb.$DB -use_existing_db -readonly -bloom_bits=10 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=10 2>&1 | grep micros/op; done; done
```

Results:
```
Testing 6.29 on 6.29:
readrandom   :      34.381 micros/op 29085 ops/sec;    3.2 MB/s (291999 of 291999 found)
Testing 6.29 on 7.0:
readrandom   :     190.443 micros/op 5249 ops/sec;    0.6 MB/s (52999 of 52999 found)
Testing 6.29 on fixed:
readrandom   :      40.148 micros/op 24907 ops/sec;    2.8 MB/s (249999 of 249999 found)
Testing 7.0 on 6.29:
readrandom   :     229.430 micros/op 4357 ops/sec;    0.5 MB/s (43999 of 43999 found)
Testing 7.0 on 7.0:
readrandom   :      33.348 micros/op 29986 ops/sec;    3.3 MB/s (299999 of 299999 found)
Testing 7.0 on fixed:
readrandom   :     152.734 micros/op 6546 ops/sec;    0.7 MB/s (65999 of 65999 found)
Testing fixed on 6.29:
readrandom   :      32.024 micros/op 31224 ops/sec;    3.5 MB/s (312999 of 312999 found)
Testing fixed on 7.0:
readrandom   :      33.990 micros/op 29390 ops/sec;    3.3 MB/s (294999 of 294999 found)
Testing fixed on fixed:
readrandom   :      28.714 micros/op 34825 ops/sec;    3.9 MB/s (348999 of 348999 found)
```

Just paying attention to order of magnitude of ops/sec (short test
durations, lots of noise), it's clear that with the fix we can read <= 6.29
& >= 7.0 at full speed, where neither 6.29 nor 7.0 can on both. And 6.29
release can properly read fixed DB at full speed.

Reviewed By: siying, ajkr

Differential Revision: D35057844

Pulled By: pdillinger

fbshipit-source-id: a46893a6af4bf084375ebe4728066d00eb08f050
2022-03-23 10:00:54 -07:00
Yanqin Jin
3bd150c442 Print information about all column families when using ldb (#9719)
Summary:
Before this PR, the following command prints only the default column
family's information in the end:
```
ldb --db=. --hex manifest_dump --verbose
```

We should print all column families instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9719

Test Plan:
`make check` makes sure nothing breaks.

Generate a DB, use the above command to verify all column families are
printed.

Reviewed By: akankshamahajan15

Differential Revision: D34992453

Pulled By: riversand963

fbshipit-source-id: de1d38c4539cd89f74e1a6240ad7a6e2416bf198
2022-03-22 20:29:01 -07:00
gitbw95
8102690a52 Update Cache::Release param from force_erase to erase_if_last_ref (#9728)
Summary:
The param name force_erase may be misleading, since the handle is erased only if it has last reference even if the param is set true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9728

Reviewed By: pdillinger

Differential Revision: D35038673

Pulled By: gitbw95

fbshipit-source-id: 0d16d1e8fed17b97eba7fb53207119332f659a5f
2022-03-22 10:22:18 -07:00
KNOEEE
cb4d188a34 Fix a bug in PosixClock (#9695)
Summary:
Multiplier here should be 1e6 to get microseconds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9695

Reviewed By: ajkr

Differential Revision: D34897086

Pulled By: jay-zhuang

fbshipit-source-id: 9c1d0811ea740ba0a007edc2da199edbd000b88b
2022-03-21 16:11:02 -07:00
duyuqi
cbe303c19b fix a bug, c api, if enable inplace_update_support, and use create sn… (#9471)
Summary:
c api release snapshot will core dump when enable inplace_update_support and create snapshot

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9471

Reviewed By: akankshamahajan15

Differential Revision: D34965103

Pulled By: riversand963

fbshipit-source-id: c3aeeb9ea7126c2eda1466102794fecf57b6ab77
2022-03-21 12:04:33 -07:00
Peter Dillinger
a8a422e962 Add manifest fix-up utility for file temperatures (#9683)
Summary:
The goal of this change is to allow changes to the "current" (in
FileSystem) file temperatures to feed back into DB metadata, so that
they can inform decisions and stats reporting. In part because of
modular code factoring, it doesn't seem easy to do this automagically,
where opening an SST file and observing current Temperature different
from expected would trigger a change in metadata and DB manifest write
(essentially giving the deep read path access to the write path). It is also
difficult to do this while the DB is open because of the limitations of
LogAndApply.

This change allows updating file temperature metadata on a closed DB
using an experimental utility function UpdateManifestForFilesState()
or `ldb update_manifest --update_temperatures`. This should suffice for
"migration" scenarios where outside tooling has placed or re-arranged DB
files into a (different) tiered configuration without going through
RocksDB itself (currently, only compaction can change temperature
metadata).

Some details:
* Refactored and added unit test for `ldb unsafe_remove_sst_file` because
of shared functionality
* Pulled in autovector.h changes from https://github.com/facebook/rocksdb/issues/9546 to fix SuperVersionContext
move constructor (related to an older draft of this change)

Possible follow-up work:
* Support updating manifest with file checksums, such as when a
new checksum function is used and want existing DB metadata updated
for it.
* It's possible that for some repair scenarios, lighter weight than
full repair, we might want to support UpdateManifestForFilesState() to
modify critical file details like size or checksum using same
algorithm. But let's make sure these are differentiated from modifying
file details in ways that don't suspect corruption (or require extreme
trust).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9683

Test Plan: unit tests added

Reviewed By: jay-zhuang

Differential Revision: D34798828

Pulled By: pdillinger

fbshipit-source-id: cfd83e8fb10761d8c9e7f9c020d68c9106a95554
2022-03-18 16:35:51 -07:00
Yanqin Jin
b2aacaf923 Fix assertion error by doing comparison with mutex (#9717)
Summary:
On CircleCI MacOS instances, we have been seeing the following assertion error:
```
Assertion failed: (alive_log_files_tail_ == alive_log_files_.rbegin()), function WriteToWAL, file /Users/distiller/project/db/db_impl/db_impl_write.cc, line 1213.
Received signal 6 (Abort trap: 6)
#0   0x1
https://github.com/facebook/rocksdb/issues/1   abort (in libsystem_c.dylib) + 120
https://github.com/facebook/rocksdb/issues/2   err (in libsystem_c.dylib) + 0
https://github.com/facebook/rocksdb/issues/3   rocksdb::DBImpl::WriteToWAL(rocksdb::WriteBatch const&, rocksdb::log::Writer*, unsigned long long*, unsigned long long*, rocksdb::Env::IOPriority, bool, bool) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:1213)
https://github.com/facebook/rocksdb/issues/4   rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long long*, bool, bool, unsigned long long) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:1251)
https://github.com/facebook/rocksdb/issues/5   rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long long*, unsigned long long, bool, unsigned long long*, unsigned long, rocksdb::PreReleaseCallback*) (in librocksdb.7.0.0.dylib) (db_impl_	rite.cc:421)
https://github.com/facebook/rocksdb/issues/6   rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:109)
https://github.com/facebook/rocksdb/issues/7   rocksdb::DB::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:2159)
https://github.com/facebook/rocksdb/issues/8   rocksdb::DBImpl::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:37)
https://github.com/facebook/rocksdb/issues/9   rocksdb::DB::Put(rocksdb::WriteOptions const&, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db.h:382)
https://github.com/facebook/rocksdb/issues/10  rocksdb::DBBasicTestWithTimestampPrefixSeek_IterateWithPrefix_Test::TestBody() (in db_with_timestamp_basic_test) (db_with_timestamp_basic_test.cc:2926)
https://github.com/facebook/rocksdb/issues/11  void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3899)
https://github.com/facebook/rocksdb/issues/12  void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3935)
https://github.com/facebook/rocksdb/issues/13  testing::Test::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:3980)
https://github.com/facebook/rocksdb/issues/14  testing::TestInfo::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:4153)
https://github.com/facebook/rocksdb/issues/15  testing::TestCase::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:4266)
https://github.com/facebook/rocksdb/issues/16  testing::internal::UnitTestImpl::RunAllTests() (in db_with_timestamp_basic_test) (gtest-all.cc:6632)
https://github.com/facebook/rocksdb/issues/17  bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3899)
https://github.com/facebook/rocksdb/issues/18  bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3935)
https://github.com/facebook/rocksdb/issues/19  testing::UnitTest::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:6242)
https://github.com/facebook/rocksdb/issues/20  RUN_ALL_TESTS() (in db_with_timestamp_basic_test) (gtest.h:22110)
https://github.com/facebook/rocksdb/issues/21  main (in db_with_timestamp_basic_test) (db_with_timestamp_basic_test.cc:3150)
https://github.com/facebook/rocksdb/issues/22  start (in libdyld.dylib) + 1
```

It's likely caused by concurrent, unprotected access to the deque, even though `back()` is never popped,
and we are comparing `rbegin()` with a cached `riterator`. To be safe, do the comparison only if we have mutex.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9717

Test Plan:
One example
Ssh to one CircleCI MacOS instance.
```
gtest-parallel -r 1000 -w 8 ./db_test --gtest_filter=DBTest.FlushesInParallelWithCompactRange
```

Reviewed By: pdillinger

Differential Revision: D34990696

Pulled By: riversand963

fbshipit-source-id: 62dd48ae6fedbda53d0a64d73de9b948b4c26eee
2022-03-18 13:11:57 -07:00
Peter Dillinger
cff0d1e8e6 New backup meta schema, with file temperatures (#9660)
Summary:
The primary goal of this change is to add support for backing up and
restoring (applying on restore) file temperature metadata, without
committing to either the DB manifest or the FS reported "current"
temperatures being exclusive "source of truth".

To achieve this goal, we need to add temperature information to backup
metadata, which requires updated backup meta schema. Fortunately I
prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version
6.19.0 for this kind of schema update. (Previously, backup meta schema
was not extensible! Making this schema update public will allow some
other "nice to have" features like taking backups with hard links, and
avoiding crc32c checksum computation when another checksum is already
available.) While schema version 2 is newly public, the default schema
version is still 1. Until we change the default, users will need to set
to 2 to enable features like temperature data backup+restore. New
metadata like temperature information will be ignored with a warning
in versions before this change and since 6.19.0. The metadata is
considered ignorable because a functioning DB can be restored without
it.

Some detail:
* Some renaming because "future schema" is now just public schema 2.
* Initialize some atomics in TestFs (linter reported)
* Add temperature hint support to SstFileDumper (used by BackupEngine)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660

Test Plan:
related unit test majorly updated for the new functionality,
including some shared testing support for tracking temperatures in a FS.

Some other tests and testing hooks into production code also updated for
making the backup meta schema change public.

Reviewed By: ajkr

Differential Revision: D34686968

Pulled By: pdillinger

fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a
2022-03-18 11:06:17 -07:00
Yanqin Jin
5894761056 Improve stress test for transactions (#9568)
Summary:
Test only, no change to functionality.
Extremely low risk of library regression.

Update test key generation by maintaining existing and non-existing keys.
Update db_crashtest.py to drive multiops_txn stress test for both write-committed and write-prepared.
Add a make target 'blackbox_crash_test_with_multiops_txn'.

Running the following commands caught the bug exposed in https://github.com/facebook/rocksdb/issues/9571.
```
$rm -rf /tmp/rocksdbtest/*
$./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0 \
    -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 \
   -readpercent=20 -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 \
   -ub_c=100 -destroy_db_initially=0 -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0
$./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0
   -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 -readpercent=20 \
   -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 -ub_c=100 -destroy_db_initially=0 \
   -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0
```

Running the following command caught a bug which will be fixed in https://github.com/facebook/rocksdb/issues/9648 .
```
$TEST_TMPDIR=/dev/shm make blackbox_crash_test_with_multiops_wc_txn
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9568

Reviewed By: jay-zhuang

Differential Revision: D34308154

Pulled By: riversand963

fbshipit-source-id: 99ff1b65c19b46c471d2f2d3b47adcd342a1b9e7
2022-03-16 19:00:04 -07:00
anand76
a88d8795ec Expand auto recovery to background read errors (#9679)
Summary:
Fix and enhance the background error recovery logic to handle the
following situations -
1. Background read errors during flush/compaction (previously was
resulting in unrecoverable state)
2. Fix auto recovery failure on read/write errors during atomic flush.
It was failing due to a bug in setting the resuming_from_bg_err variable
in AtomicFlushMemTablesToOutputFiles.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9679

Test Plan: Add new unit tests in error_handler_fs_test

Reviewed By: riversand963

Differential Revision: D34770097

Pulled By: anand1976

fbshipit-source-id: 136da973a28d684b9c74bdf668519b0cbbbe1742
2022-03-15 14:45:34 -07:00
Jay Zhuang
2c8100e60e Fix a race condition when disable and enable manual compaction (#9694)
Summary:
In https://github.com/facebook/rocksdb/issues/9659, when `DisableManualCompaction()` is issued, the foreground
manual compaction thread does not have to wait background compaction
thread to finish. Which could be a problem that the user re-enable
manual compaction with `EnableManualCompaction()`, it may re-enable the
BG compaction which supposed be cancelled.
This patch makes the FG compaction wait on
`manual_compaction_state.done`, which either be set by BG compaction or
Unschedule callback. Then when FG manual compaction thread returns, it
should not have BG compaction running. So shared_ptr is no longer needed
for `manual_compaction_state`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9694

Test Plan: a StressTest and unittest

Reviewed By: ajkr

Differential Revision: D34885472

Pulled By: jay-zhuang

fbshipit-source-id: e6476175b43e8c59cd49f5c09241036a0716c274
2022-03-15 12:31:14 -07:00
Yanqin Jin
6a76008369 Fix TSAN caused by calling rend() and pop_front(). (#9698)
Summary:
PR9686 makes `WriteToWAL()` call `assert(...!=rend())` while not holding
db mutex or log mutex. Another thread may concurrently call
`pop_front()`, causing race condition.
To fix, assert only if mutex is held.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9698

Test Plan: COMPILE_WITH_TSAN=1 make check

Reviewed By: jay-zhuang

Differential Revision: D34898535

Pulled By: riversand963

fbshipit-source-id: 1ddfa5bf1b6ae8d409cab6ff6e1b5321c6803da9
2022-03-15 12:16:40 -07:00
gukaifeng
89429a9081 fix a bug of the ticker NO_FILE_OPENS (#9677)
Summary:
In the original code, the value of `NO_FILE_OPENS` corresponding to the Ticker item will be increased regardless of whether the file is successfully opened or not. Even counts are repeated, which can lead to skewed counts.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9677

Reviewed By: jay-zhuang

Differential Revision: D34725733

Pulled By: ajkr

fbshipit-source-id: 841234ed03802c0105fd2107d82a740265ead576
2022-03-15 09:55:49 -07:00
Jermy Li
3da8236837 fix: Reusing-Iterator reads stale keys after DeleteRange() performed (#9258)
Summary:
fix https://github.com/facebook/rocksdb/issues/9255

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9258

Reviewed By: pdillinger

Differential Revision: D34879684

Pulled By: ajkr

fbshipit-source-id: 5934f4b7524dc27ecdf1430e0456a0fc02958fc7
2022-03-15 09:50:21 -07:00
Yanqin Jin
bbdaf63d0f Fix a TSAN-reported bug caused by concurrent accesss to std::deque (#9686)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9686

According to https://www.cplusplus.com/reference/deque/deque/back/,
"
The container is accessed (neither the const nor the non-const versions modify the container).
The last element is potentially accessed or modified by the caller. Concurrently accessing or modifying other elements is safe.
"

Also according to https://www.cplusplus.com/reference/deque/deque/pop_front/,
"
The container is modified.
The first element is modified. Concurrently accessing or modifying other elements is safe (although see iterator validity above).
"
In RocksDB, we never pop the last element of `DBImpl::alive_log_files_`. We have been
exploiting this fact and the above two properties when ensuring correctness when
`DBImpl::alive_log_files_` may be accessed concurrently. Specifically, it can be accessed
in the write path when db mutex is released. Sometimes, the log_mute_ is held. It can also be accessed in `FindObsoleteFiles()`
when db mutex is always held. It can also be accessed
during recovery when db mutex is also held.
Given the fact that we never pop the last element of alive_log_files_, we currently do not
acquire additional locks when accessing it in `WriteToWAL()` as follows
```
alive_log_files_.back().AddSize(log_entry.size());
```

This is problematic.

Check source code of deque.h
```
  back() _GLIBCXX_NOEXCEPT
  {
__glibcxx_requires_nonempty();
...
  }

  pop_front() _GLIBCXX_NOEXCEPT
  {
...
  if (this->_M_impl._M_start._M_cur
      != this->_M_impl._M_start._M_last - 1)
    {
      ...
      ++this->_M_impl._M_start._M_cur;
    }
  ...
  }
```

`back()` will actually call `__glibcxx_requires_nonempty()` first.
If `__glibcxx_requires_nonempty()` is enabled and not an empty macro,
it will call `empty()`
```
bool empty() {
return this->_M_impl._M_finish == this->_M_impl._M_start;
}
```
You can see that it will access `this->_M_impl._M_start`, racing with `pop_front()`.
Therefore, TSAN will actually catch the bug in this case.

To be able to use TSAN on our library and unit tests, we should always coordinate
concurrent accesses to STL containers properly.

We need to pass information about db mutex and log mutex into `WriteToWAL()`, otherwise
it's impossible to know which mutex to acquire inside the function.

To fix this, we can catch the tail of `alive_log_files_` by reference, so that we do not have to call `back()` in `WriteToWAL()`.

Reviewed By: pdillinger

Differential Revision: D34780309

fbshipit-source-id: 1def9821f0c437f2736c6a26445d75890377889b
2022-03-14 18:49:55 -07:00
Jay Zhuang
4dff279b19 DisableManualCompaction may fail to cancel an unscheduled task (#9659)
Summary:
https://github.com/facebook/rocksdb/issues/9625 didn't change the unschedule condition which was waiting for the background thread to clean-up the compaction.
make sure we only unschedule the task when it's scheduled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9659

Reviewed By: ajkr

Differential Revision: D34651820

Pulled By: jay-zhuang

fbshipit-source-id: 23f42081b15ec8886cd81cbf131b116e0c74dc2f
2022-03-12 20:07:04 -08:00
Jay Zhuang
09b0e8f2c7 Fix a timer crash caused by invalid memory management (#9656)
Summary:
Timer crash when multiple DB instances doing heavy DB open and close
operations concurrently. Which is caused by adding a timer task with
smaller timestamp than the current running task. Fix it by moving the
getting new task timestamp part within timer mutex protection.
And other fixes:
- Disallow adding duplicated function name to timer
- Fix a minor memory leak in timer when a running task is cancelled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9656

Reviewed By: ajkr

Differential Revision: D34626296

Pulled By: jay-zhuang

fbshipit-source-id: 6b6d96a5149746bf503546244912a9e41a0c5f6b
2022-03-12 11:45:56 -08:00
slk
95305c44a1 Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410)
Summary:
As disscussed in (https://github.com/facebook/rocksdb/issues/9223), Here added a new API  named DB::OpenAndTrimHistory, this API will open DB and trim data to the timestamp specofied by **trim_ts** (The data with newer timestamp than specified trim bound will be removed). This API should only be used at a timestamp-enabled db instance recovery.

And this PR implemented a new iterator named HistoryTrimmingIterator to support trimming history with a new API named DB::OpenAndTrimHistory. HistoryTrimmingIterator wrapped around the underlying InternalITerator such that keys whose timestamps newer than **trim_ts** should not be returned to the compaction iterator while **trim_ts** is not null.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9410

Reviewed By: ltamasi

Differential Revision: D34410207

Pulled By: riversand963

fbshipit-source-id: e54049dc234eccd673244c566b15df58df5a6236
2022-03-11 16:13:23 -08:00
Baptiste Lemaire
7bed6595f3 Fix mempurge crash reported in #8958 (#9671)
Summary:
Change the `MemPurge` code to address a failure during a crash test reported in https://github.com/facebook/rocksdb/issues/8958.

### Details and results of the crash investigation:
These failures happened in a specific scenario where the list of immutable tables was composed of 2 or more memtables, and the last memtable was the output of a previous `Mempurge` operation. Because the `PickMemtablesToFlush` function included a sorting of the memtables (previous PR related to the Mempurge project), and because the `VersionEdit` of the flush class is piggybacked onto a single one of these memtables, the `VersionEdit` was not properly selected and applied to the `VersionSet` of the DB. Since the `VersionSet` was not edited properly, the database was losing track of the SST file created during the flush process, which was subsequently deleted (and as you can expect, caused the tests to crash).
The following command consistently failed, which was quite convenient to investigate the issue:
`$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done`

### Solution proposed
The memtables are no longer sorted based on their `memtableID` in the `PickMemtablesToFlush` function. Additionally, the `next_log_number` of the memtable created as an output of the `Mempurge` function now takes in the correct value (the log number of the first memtable being mempurged). Finally, the VersionEdit object of the flush class now takes the maximum `next_log_number` of the stack of memtables being flushed, which doesnt change anything when Mempurge is `off` but becomes necessary when Mempurge is `on`.

### Testing of the solution
The following command no longer fails:
``$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done``
Additionally, I ran `db_crashtest` (`whitebox` and `blackbox`) for 2.5 hours with MemPurge on and did not observe any crash.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9671

Reviewed By: pdillinger

Differential Revision: D34697424

Pulled By: bjlemaire

fbshipit-source-id: d1ab675b361904351ac81a35c184030e52222874
2022-03-10 15:16:55 -08:00
Siddhartha Roychowdhury
fec4403ff1 Integrate WAL compression into log reader/writer. (#9642)
Summary:
Integrate the streaming compress/uncompress API into WAL compression.
The streaming compression object is stored in the log_writer along with a reusable output buffer to store the compressed buffer(s).
The streaming uncompress object is stored in the log_reader along with a reusable output buffer to store the uncompressed buffer(s).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9642

Test Plan:
Added unit tests to verify different scenarios - large buffers, split compressed buffers, etc.

Future optimizations:
The overhead for small records is quite high, so it makes sense to compress only buffers above a certain threshold and use a separate record type to indicate that those records are compressed.

Reviewed By: anand1976

Differential Revision: D34709167

Pulled By: sidroyc

fbshipit-source-id: a37a3cd1301adff6152fb3fcd23726106af07dd4
2022-03-09 15:49:53 -08:00
Yanqin Jin
3b6dc049f7 Support user-defined timestamps in write-committed txns (#9629)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9629

Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are
locked upon first operation that writes the key or has the intention of writing. For example,
`PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`,
`PessimisticTransaction::SingleDelete()` will write to or delete a key, while
`PessimisticTransaction::GetForUpdate()` is used by application to indicate
to RocksDB that the transaction has the intention of performing write operation later
in the same transaction.
Pessimistic transactions support two-phase commit (2PC). A transaction can be
`Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once
`Prepare()` succeeds, the transaction has acquired the necessary resources to commit.
The resources include locks, persistence of WAL, etc.
Write-committed transaction is the default pessimistic transaction implementation. In
RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare
section. `Commit()` will write a commit marker to the WAL and then write data to the
memtables. While writing to the memtables, different keys in the transaction's write batch
will be assigned different sequence numbers in ascending order.
Until commit/rollback, the transaction holds locks on the keys so that no other transaction
can write to the same keys. Furthermore, the keys' sequence numbers represent the order
in which they are committed and should be made visible. This is convenient for us to
implement support for user-defined timestamps.
Since column families with and without timestamps can co-exist in the same database,
a transaction may or may not involve timestamps. Based on this observation, we add two
optional members to each `PessimisticTransaction`, `read_timestamp_` and
`commit_timestamp_`. If no key in the transaction's write batch has timestamp, then
setting these two variables do not have any effect. For the rest of this commit, we discuss
only the cases when these two variables are meaningful.

read_timestamp_ is used mainly for validation, and should be set before first call to
`GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls
`TryLock()` that can verify if another transaction has written the same key since
`read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed
written the same key, then validation fails, and RocksDB allows this transaction to
refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()`
with a different timestamp to read, but the result of the read should not be used to
determine data that will be written later.

commit_timestamp_ must be set after finishing writing and before transaction commit.
This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after
prepare phase succeeds.

We currently require that the commit timestamp be chosen after all keys are locked. This
means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used
by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`,
`PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`,
etc. will return non-ok status because they specify timestamps before locking the keys.
Users are also prompted to use the `Transaction` APIs when they receive the non-ok status.

Reviewed By: ltamasi

Differential Revision: D31822445

fbshipit-source-id: b82abf8e230216dc89cc519564a588224a88fd43
2022-03-08 16:20:59 -08:00
Hui Xiao
ca0ef54f16 Rate-limit automatic WAL flush after each user write (#9607)
Summary:
**Context:**
WAL flush is currently not rate-limited by `Options::rate_limiter`. This PR is to provide rate-limiting to auto WAL flush, the one that automatically happen after each user write operation (i.e, `Options::manual_wal_flush == false`), by adding `WriteOptions::rate_limiter_options`.

Note that we are NOT rate-limiting WAL flush that do NOT automatically happen after each user write, such as  `Options::manual_wal_flush == true + manual FlushWAL()` (rate-limiting multiple WAL flushes),  for the benefits of:
- being consistent with [ReadOptions::rate_limiter_priority](https://github.com/facebook/rocksdb/blob/7.0.fb/include/rocksdb/options.h#L515)
- being able to turn off some WAL flush's rate-limiting but not all (e.g, turn off specific the WAL flush of a critical user write like a service's heartbeat)

`WriteOptions::rate_limiter_options` only accept `Env::IO_USER` and `Env::IO_TOTAL` currently due to an implementation constraint.
- The constraint is that we currently queue parallel writes (including WAL writes) based on FIFO policy which does not factor rate limiter priority into this layer's scheduling. If we allow lower priorities such as `Env::IO_HIGH/MID/LOW` and such writes specified with lower priorities occurs before ones specified with higher priorities (even just by a tiny bit in arrival time), the former would have blocked the latter, leading to a "priority inversion" issue and contradictory to what we promise for rate-limiting priority. Therefore we only allow `Env::IO_USER` and `Env::IO_TOTAL`  right now before improving that scheduling.

A pre-requisite to this feature is to support operation-level rate limiting in `WritableFileWriter`, which is also included in this PR.

**Summary:**
- Renamed test suite `DBRateLimiterTest to DBRateLimiterOnReadTest` for adding a new test suite
- Accept `rate_limiter_priority` in `WritableFileWriter`'s private and public write functions
- Passed `WriteOptions::rate_limiter_options` to `WritableFileWriter` in the path of automatic WAL flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9607

Test Plan:
- Added new unit test to verify existing flush/compaction rate-limiting does not break, since `DBTest, RateLimitingTest` is disabled and current db-level rate-limiting tests focus on read only (e.g, `db_rate_limiter_test`, `DBTest2, RateLimitedCompactionReads`).
- Added new unit test `DBRateLimiterOnWriteWALTest, AutoWalFlush`
- `strace -ftt -e trace=write ./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -rate_limit_auto_wal_flush=1 -rate_limiter_bytes_per_sec=15 -rate_limiter_refill_period_us=1000000 -write_buffer_size=100000000 -disable_auto_compactions=1 -num=100`
   - verified that WAL flush(i.e, system-call _write_) were chunked into 15 bytes and each _write_ was roughly 1 second apart
   - verified the chunking disappeared when `-rate_limit_auto_wal_flush=0`
- crash test: `python3 tools/db_crashtest.py blackbox --disable_wal=0  --rate_limit_auto_wal_flush=1 --rate_limiter_bytes_per_sec=10485760 --interval=10` killed as normal

**Benchmarked on flush/compaction to ensure no performance regression:**
- compaction with rate-limiting  (see table 1, avg over 1280-run):  pre-change: **915635 micros/op**; post-change:
   **907350 micros/op (improved by 0.106%)**
```
#!/bin/bash
TEST_TMPDIR=/dev/shm/testdb
START=1
NUM_DATA_ENTRY=8
N=10

rm -f compact_bmk_output.txt compact_bmk_output_2.txt dont_care_output.txt
for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
do
    NUM_RUN=$(($N*(2**($i-1))))
    for j in $(eval echo "{$START..$NUM_RUN}")
    do
       ./db_bench --benchmarks=fillrandom -db=$TEST_TMPDIR -disable_auto_compactions=1 -write_buffer_size=6710886 > dont_care_output.txt && ./db_bench --benchmarks=compact -use_existing_db=1 -db=$TEST_TMPDIR -level0_file_num_compaction_trigger=1 -rate_limiter_bytes_per_sec=100000000 | egrep 'compact'
    done > compact_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' compact_bmk_output.txt >> compact_bmk_output_2.txt
done
```
- compaction w/o rate-limiting  (see table 2, avg over 640-run):  pre-change: **822197 micros/op**; post-change: **823148 micros/op (regressed by 0.12%)**
```
Same as above script, except that -rate_limiter_bytes_per_sec=0
```
- flush with rate-limiting (see table 3, avg over 320-run, run on the [patch](ee5c6023a9) to augment current db_bench ): pre-change: **745752 micros/op**; post-change: **745331 micros/op (regressed by 0.06 %)**
```
 #!/bin/bash
TEST_TMPDIR=/dev/shm/testdb
START=1
NUM_DATA_ENTRY=8
N=10

rm -f flush_bmk_output.txt flush_bmk_output_2.txt

for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
do
    NUM_RUN=$(($N*(2**($i-1))))
    for j in $(eval echo "{$START..$NUM_RUN}")
    do
       ./db_bench -db=$TEST_TMPDIR -write_buffer_size=1048576000 -num=1000000 -rate_limiter_bytes_per_sec=100000000 -benchmarks=fillseq,flush | egrep 'flush'
    done > flush_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' flush_bmk_output.txt >> flush_bmk_output_2.txt
done

```
- flush w/o rate-limiting (see table 4, avg over 320-run, run on the [patch](ee5c6023a9) to augment current db_bench): pre-change: **487512 micros/op**, post-change: **485856 micors/ops (improved by 0.34%)**
```
Same as above script, except that -rate_limiter_bytes_per_sec=0
```

| table 1 - compact with rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
-- | -- | -- | -- | -- | --
10 | 896978 | 16046.9 | 901242 | 15670.9 | 0.475373978
20 | 893718 | 15813 | 886505 | 17544.7 | -0.8070778478
40 | 900426 | 23882.2 | 894958 | 15104.5 | -0.6072681153
80 | 906635 | 21761.5 | 903332 | 23948.3 | -0.3643141948
160 | 898632 | 21098.9 | 907583 | 21145 | 0.9960695813
3.20E+02 | 905252 | 22785.5 | 908106 | 25325.5 | 0.3152713278
6.40E+02 | 905213 | 23598.6 | 906741 | 21370.5 | 0.1688000504
**1.28E+03** | **908316** | **23533.1** | **907350** | **24626.8** | **-0.1063506533**
average over #-run | 901896.25 | 21064.9625 | 901977.125 | 20592.025 | 0.008967217682

| table 2 - compact w/o rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
-- | -- | -- | -- | -- | --
10 | 811211 | 26996.7 | 807586 | 28456.4 | -0.4468627768
20 | 815465 | 14803.7 | 814608 | 28719.7 | -0.105093413
40 | 809203 | 26187.1 | 797835 | 25492.1 | -1.404839082
80 | 822088 | 28765.3 | 822192 | 32840.4 | 0.01265071379
160 | 821719 | 36344.7 | 821664 | 29544.9 | -0.006693285661
3.20E+02 | 820921 | 27756.4 | 821403 | 28347.7 | 0.05871454135
**6.40E+02** | **822197** | **28960.6** | **823148** | **30055.1** | **0.1156657103**
average over #-run | 8.18E+05 | 2.71E+04 | 8.15E+05 | 2.91E+04 |  -0.25

| table 3 - flush with rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
-- | -- | -- | -- | -- | --
10 | 741721 | 11770.8 | 740345 | 5949.76 | -0.1855144994
20 | 735169 | 3561.83 | 743199 | 9755.77 | 1.09226586
40 | 743368 | 8891.03 | 742102 | 8683.22 | -0.1703059588
80 | 742129 | 8148.51 | 743417 | 9631.58| 0.1735547324
160 | 749045 | 9757.21 | 746256 | 9191.86 | -0.3723407806
**3.20E+02** | **745752** | **9819.65** | **745331** | **9840.62** | **-0.0564530836**
6.40E+02 | 749006 | 11080.5 | 748173 | 10578.7 | -0.1112140624
average over #-run | 743741.4286 | 9004.218571 | 744117.5714 | 9090.215714 | 0.05057441238

| table 4 - flush w/o rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op (%)
-- | -- | -- | -- | -- | --
10 | 477283 | 24719.6 | 473864 | 12379 | -0.7163464863
20 | 486743 | 20175.2 | 502296 | 23931.3 | 3.195320734
40 | 482846 | 15309.2 | 489820 | 22259.5 | 1.444352858
80 | 491490 | 21883.1 | 490071 | 23085.7 | -0.2887139108
160 | 493347 | 28074.3 | 483609 | 21211.7 | -1.973864238
**3.20E+02** | **487512** | **21401.5** | **485856** | **22195.2** | **-0.3396839462**
6.40E+02 | 490307 | 25418.6 | 485435 | 22405.2 | -0.9936631539
average over #-run | 4.87E+05 | 2.24E+04 | 4.87E+05 | 2.11E+04 | 0.00E+00

Reviewed By: ajkr

Differential Revision: D34442441

Pulled By: hx235

fbshipit-source-id: 4790f13e1e5c0a95ae1d1cc93ffcf69dc6e78bdd
2022-03-08 13:19:39 -08:00
Ezgi Çiçek
27d6ef8e60 Rename mutable_cf_options to signify explicity copy (#9666)
Summary:
Signify explicit copy with comment and better name for variable `mutable_cf_options`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9666

Reviewed By: riversand963

Differential Revision: D34680934

Pulled By: ezgicicek

fbshipit-source-id: b64ef18725fe523835d14ceb4b29bcdfe493f8ed
2022-03-08 11:26:40 -08:00
Jay Zhuang
36aec94d85 compression_per_level should be used for flush and changeable (#9658)
Summary:
- Make `compression_per_level` dynamical changeable with `SetOptions`;
- Fix a bug that `compression_per_level` is not used for flush;

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9658

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D34700749

Pulled By: jay-zhuang

fbshipit-source-id: a23b9dfa7ad03d393c1d71781d19e91de796f49c
2022-03-07 18:06:19 -08:00
Peter Dillinger
ce60d0cbe5 Test refactoring for Backups+Temperatures (#9655)
Summary:
In preparation for more support for file Temperatures in BackupEngine,
this change does some test refactoring:
* Move DBTest2::BackupFileTemperature test to
BackupEngineTest::FileTemperatures, with some updates to make it work
in the new home. This test will soon be expanded for deeper backup work.
* Move FileTemperatureTestFS from db_test2.cc to db_test_util.h, to
support sharing because of above moved test, but split off the "no link"
part to the test needing it.
* Use custom FileSystems in backupable_db_test rather than custom Envs,
because going through Env file interfaces doesn't support temperatures.
* Fix RemapFileSystem to map DirFsyncOptions::renamed_new_name
parameter to FsyncWithDirOptions, which was required because this
limitation caused a crash only after moving to higher fidelity of
FileSystem interface (vs. LegacyDirectoryWrapper throwing away some
parameter details)
* `backupable_options_` -> `engine_options_` as part of the ongoing
work to get rid of the obsolete "backupable" naming.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9655

Test Plan: test code updates only

Reviewed By: jay-zhuang

Differential Revision: D34622183

Pulled By: pdillinger

fbshipit-source-id: f24b7a596a89b9e089e960f4e5d772575513e93f
2022-03-04 12:32:30 -08:00
Hui Xiao
fc61e98ae6 Attempt to deflake DBLogicalBlockSizeCacheTest.CreateColumnFamilies (#9516)
Summary:
**Context:**
`DBLogicalBlockSizeCacheTest.CreateColumnFamilies` is flaky on a rare occurrence of assertion failure below
```
db/db_logical_block_size_cache_test.cc:210
Expected equality of these values:
  1
  cache_->GetRefCount(cf_path_0_)
    Which is: 2
```

Root-cause: `ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[0]));` in the test may not successfully decrease the ref count of `cf_path_0_` since the decreasing only happens in the clean-up of `ColumnFamilyData` when `ColumnFamilyData` has no referencing to it, which may not be true when `db->DestroyColumnFamilyHandle(cfs[0])` is called since background work such as `DumpStats()` can hold reference to that `ColumnFamilyData` (suggested and repro-d by ajkr ). Similar case `ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[1]));`.

See following for a deterministic repro:
```
 diff --git a/db/db_impl/db_impl.cc b/db/db_impl/db_impl.cc
index 196b428a3..4e7a834c4 100644
 --- a/db/db_impl/db_impl.cc
+++ b/db/db_impl/db_impl.cc
@@ -956,10 +956,16 @@ void DBImpl::DumpStats() {
         // near-atomically.
         // Get a ref before unlocking
         cfd->Ref();
+        if (cfd->GetName() == "cf1" || cfd->GetName() == "cf2") {
+          TEST_SYNC_POINT("DBImpl::DumpStats:PostCFDRef");
+        }
         {
           InstrumentedMutexUnlock u(&mutex_);
           cfd->internal_stats()->CollectCacheEntryStats(/*foreground=*/false);
         }
+        if (cfd->GetName() == "cf1" || cfd->GetName() == "cf2") {
+          TEST_SYNC_POINT("DBImpl::DumpStats::PreCFDUnrefAndTryDelete");
+        }
         cfd->UnrefAndTryDelete();
       }
     }
 diff --git a/db/db_logical_block_size_cache_test.cc b/db/db_logical_block_size_cache_test.cc
index 1057871c9..c3872c036 100644
 --- a/db/db_logical_block_size_cache_test.cc
+++ b/db/db_logical_block_size_cache_test.cc
@@ -9,6 +9,7 @@
 #include "env/io_posix.h"
 #include "rocksdb/db.h"
 #include "rocksdb/env.h"
+#include "test_util/sync_point.h"

 namespace ROCKSDB_NAMESPACE {
 class EnvWithCustomLogicalBlockSizeCache : public EnvWrapper {
@@ -183,6 +184,15 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
   ASSERT_EQ(1, cache_->GetRefCount(dbname_));

   std::vector<ColumnFamilyHandle*> cfs;
+  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
+  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
+      {{"DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PostSetupTwoCFH",
+        "DBImpl::DumpStats:StartRunning"},
+       {"DBImpl::DumpStats:PostCFDRef",
+        "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PreDeleteTwoCFH"},
+       {"DBLogicalBlockSizeCacheTest::CreateColumnFamilies::"
+        "PostFinishCheckingRef",
+        "DBImpl::DumpStats::PreCFDUnrefAndTryDelete"}});
   ASSERT_OK(db->CreateColumnFamilies(cf_options, {"cf1", "cf2"}, &cfs));
   ASSERT_EQ(2, cache_->Size());
   ASSERT_TRUE(cache_->Contains(dbname_));
@@ -190,7 +200,7 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
   ASSERT_TRUE(cache_->Contains(cf_path_0_));
   ASSERT_EQ(2, cache_->GetRefCount(cf_path_0_));
   }

    // Delete one handle will not drop cache because another handle is still
   // referencing cf_path_0_.
+  TEST_SYNC_POINT(
+      "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PostSetupTwoCFH");
+  TEST_SYNC_POINT(
+      "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PreDeleteTwoCFH");
   ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[0]));
   ASSERT_EQ(2, cache_->Size());
   ASSERT_TRUE(cache_->Contains(dbname_));
@@ -209,16 +221,20 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
   ASSERT_TRUE(cache_->Contains(cf_path_0_));
    // Will fail
   ASSERT_EQ(1, cache_->GetRefCount(cf_path_0_));

   // Delete the last handle will drop cache.
   ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[1]));
   ASSERT_EQ(1, cache_->Size());
   ASSERT_TRUE(cache_->Contains(dbname_));
   // Will fail
   ASSERT_EQ(1, cache_->GetRefCount(dbname_));

+  TEST_SYNC_POINT(
+      "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::"
+      "PostFinishCheckingRef");
   delete db;
   ASSERT_EQ(0, cache_->Size());
   ASSERT_OK(DestroyDB(dbname_, options,
       {{"cf1", cf_options}, {"cf2", cf_options}}));
+  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
 }
```

**Summary**
- Removed the flaky assertion
- Clarified the comments for the test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9516

Test Plan:
- CI
- Monitor for future flakiness

Reviewed By: ajkr

Differential Revision: D34055232

Pulled By: hx235

fbshipit-source-id: 9bf83ae5fa88bf6fc829876494d4692082e4c357
2022-03-04 11:35:28 -08:00
Hui Xiao
4a776d81cc Dynamic toggling of BlockBasedTableOptions::detect_filter_construct_corruption (#9654)
Summary:
**Context/Summary:**
As requested, `BlockBasedTableOptions::detect_filter_construct_corruption` can now be dynamically configured using `DB::SetOptions` after this PR

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9654

Test Plan: - New unit test

Reviewed By: pdillinger

Differential Revision: D34622609

Pulled By: hx235

fbshipit-source-id: c06773ef3d029e6bf1724d3a72dffd37a8ec66d9
2022-03-04 10:35:08 -08:00
Yanqin Jin
659a16d52b Fix bug causing incorrect data returned by snapshot read (#9648)
Summary:
This bug affects use cases that meet the following conditions
- (has only the default column family or disables WAL) and
- has at least one event listener
- atomic flush is NOT affected.

If the above conditions meet, then RocksDB can release the db mutex before picking all the
existing memtables to flush. In the meantime, a snapshot can be created and db's sequence
number can still be incremented. The upcoming flush will ignore this snapshot.
A later read using this snapshot can return incorrect result.

To fix this issue, we call the listeners callbacks after picking the memtables so that we avoid
creating snapshots during this interval.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9648

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D34555456

Pulled By: riversand963

fbshipit-source-id: 1438981e9f069a5916686b1a0ad7627f734cf0ee
2022-03-02 21:03:14 -08:00
Yuriy Chernyshov
73fd589b1a Do not rely on ADL when invoking std::max_element (#9608)
Summary:
Certain STLs use raw pointers and ADL does not work for them.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9608

Reviewed By: ajkr

Differential Revision: D34583012

Pulled By: riversand963

fbshipit-source-id: 7de6bbc8a080c3e7243ce0d758fe83f1663168aa
2022-03-02 17:41:02 -08:00
Jay Zhuang
db8647969d Unschedule manual compaction from thread-pool queue (#9625)
Summary:
PR https://github.com/facebook/rocksdb/issues/9557 introduced a race condition between manual compaction
foreground thread and background compaction thread.
This PR adds the ability to really unschedule manual compaction from
thread-pool queue by differentiate tag name for manual compaction and
other tasks.
Also fix an issue that db `close()` didn't cancel the manual compaction thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9625

Test Plan: unittest not hang

Reviewed By: ajkr

Differential Revision: D34410811

Pulled By: jay-zhuang

fbshipit-source-id: cb14065eabb8cf1345fa042b5652d4f788c0c40c
2022-03-02 13:43:00 -08:00
sdong
33742c2a9f Remove BlockBasedTableOptions.hash_index_allow_collision (#9454)
Summary:
BlockBasedTableOptions.hash_index_allow_collision is already deprecated and has no effect. Delete it for preparing 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9454

Test Plan: Run all existing tests.

Reviewed By: ajkr

Differential Revision: D33805827

fbshipit-source-id: ed8a436d1d083173ec6aef2a762ba02e1eefdc9d
2022-03-01 13:58:02 -08:00
Andrew Kryczka
9983eecdfb Dedicate cacheline for DB mutex (#9637)
Summary:
We found a case of cacheline bouncing due to writers locking/unlocking `mutex_` and readers accessing `block_cache_tracer_`. We discovered it only after the issue was fixed by https://github.com/facebook/rocksdb/issues/9462 shifting the `DBImpl` members such that `mutex_` and `block_cache_tracer_` were naturally placed in separate cachelines in our regression testing setup. This PR forces the cacheline alignment of `mutex_` so we don't accidentally reintroduce the problem.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9637

Reviewed By: riversand963

Differential Revision: D34502233

Pulled By: ajkr

fbshipit-source-id: 46aa313b7fe83e80c3de254e332b6fb242434c07
2022-02-27 11:36:54 -08:00
Hui Xiao
87a8b3c8af Deflake DBErrorHandlingFSTest.MultiCFWALWriteError (#9496)
Summary:
**Context:**
As part of https://github.com/facebook/rocksdb/pull/6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed](e66199d848 (diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R396)). Before re-enabling file deletion, it `assert(versions_->io_status().ok());`, which IMO assumes `versions_` is **the** `version_` in the recovery process.

However, this is not necessarily true due to `s = error_handler_.ClearBGError();` happening before that assertion can unblock some foreground thread by [`EventHelpers::NotifyOnErrorRecoveryEnd()`](3122cb4358/db/error_handler.cc (L552-L553)) as part of the `ClearBGError()`. That foreground thread can do whatever it wants including closing/reopening the db and clean up that same `versions_`.

As a consequence,  `assert(versions_->io_status().ok());`, will access `io_status()` of a nullptr and test like `DBErrorHandlingFSTest.MultiCFWALWriteError` becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to [reopen the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494), where [`versions_` gets reset to nullptr](https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl.cc?fbclid=IwAR2uRhwBiPKgmE9q_6CM2mzbfwjoRgsGpXOrHruSJUDcAKc9rYZtVSvKdOY#L678) as part of the old db clean-up. If this happens right before `assert(versions_->io_status().ok()); ` gets excuted in the background thread, then we can see error like
```
db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet'
assert(versions_->io_status().ok());
```

**Summary:**
- I proposed to call `s = error_handler_.ClearBGError();` after we know it's fine to wake up foreground, which I think is right before we LOG `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");`
   - As the context,  the orignal https://github.com/facebook/rocksdb/pull/3997  introducing `DBImpl::Resume()` calls `s = error_handler_.ClearBGError();` very close to calling `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` while the later https://github.com/facebook/rocksdb/pull/6949 distances these two calls a bit.
   - And it seems fine to me that `s = error_handler_.ClearBGError();` happens after `EnableFileDeletions(/*force=*/true);` at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before `s = error_handler_.ClearBGError();`, which basically is resetting some state variables.
- In addition, to preserve the previous behavior of  https://github.com/facebook/rocksdb/pull/6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated `enable_file_deletion_s` from the general `s`
- In addition, to make `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` more clear, I separated it into its own if-block.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9496

Test Plan:
- Manually reproduce the assertion failure in`DBErrorHandlingFSTest.MultiCFWALWriteError` by injecting sleep like below so that it's more likely for `assert(versions_->io_status().ok());` to execute after [reopening the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494) in the foreground (i.e, testing) thread
```
sleep(1);
assert(versions_->io_status().ok());
```
   `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
   ```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBErrorHandlingFSTest
[ RUN      ] DBErrorHandlingFSTest.MultiCFWALWriteError
Received signal 11 (Segmentation fault)
#0   rocksdb/error_handler_fs_test() [0x5818a4] rocksdb::DBImpl::ResumeImpl(rocksdb::DBRecoverContext)  /data/users/huixiao/rocksdb/db/db_impl/db_impl.cc:421
https://github.com/facebook/rocksdb/issues/1   rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600
https://github.com/facebook/rocksdb/issues/2   rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError()       /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310
https://github.com/facebook/rocksdb/issues/3   rocksdb/error_handler_fs_test()
   ```
- The assertion failure does not happen with PR
`python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
`[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms)  `

Reviewed By: riversand963, anand1976

Differential Revision: D33990099

Pulled By: hx235

fbshipit-source-id: 2e0259a471fa8892ff177da91b3e1c0792dd7bab
2022-02-25 14:44:46 -08:00
Siddhartha Roychowdhury
21345d2823 Streaming Compression API for WAL compression. (#9619)
Summary:
Implement a streaming compression API (compress/uncompress) to use for WAL compression. The log_writer would use the compress class/API to compress a record before writing it out in chunks. The log_reader would use the uncompress class/API to uncompress the chunks and combine into a single record.

Added unit test to verify the API for different sizes/compression types.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9619

Test Plan: make -j24 check

Reviewed By: anand1976

Differential Revision: D34437346

Pulled By: sidroyc

fbshipit-source-id: b180569ad2ddcf3106380f8758b556cc0ad18382
2022-02-23 23:45:04 -08:00
Yanqin Jin
6f12599863 Support WBWI for keys having timestamps (#9603)
Summary:
This PR supports inserting keys to a `WriteBatchWithIndex` for column families that enable user-defined timestamps
and reading the keys back. **The index does not have timestamps.**

Writing a key to WBWI is unchanged, because the underlying WriteBatch already supports it.
When reading the keys back, we need to make sure to distinguish between keys with and without timestamps before
comparison.

When user calls `GetFromBatchAndDB()`, no timestamp is needed to query the batch, but a timestamp has to be
provided to query the db. The assumption is that data in the batch must be newer than data from the db.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9603

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D34354849

Pulled By: riversand963

fbshipit-source-id: d25d1f84e2240ce543e521fa30595082fb8db9a0
2022-02-22 14:23:01 -08:00
Andrew Kryczka
8ca433f912 Fix test race conditions with OnFlushCompleted() (#9617)
Summary:
We often see flaky tests due to `DB::Flush()` or `DBImpl::TEST_WaitForFlushMemTable()` not waiting until event listeners complete. For example, https://github.com/facebook/rocksdb/issues/9084, https://github.com/facebook/rocksdb/issues/9400, https://github.com/facebook/rocksdb/issues/9528, plus two new ones this week: "EventListenerTest.OnSingleDBFlushTest" and "DBFlushTest.FireOnFlushCompletedAfterCommittedResult". I ran a `make check` with the below race condition-coercing patch and fixed  issues it found besides old BlobDB.

```
 diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 0e1864788..aaba68c4a 100644
 --- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -861,6 +861,8 @@ void DBImpl::NotifyOnFlushCompleted(
        mutable_cf_options.level0_stop_writes_trigger);
   // release lock while notifying events
   mutex_.Unlock();
+  bg_cv_.SignalAll();
+  sleep(1);
   {
     for (auto& info : *flush_jobs_info) {
       info->triggered_writes_slowdown = triggered_writes_slowdown;
```

The reason I did not fix old BlobDB issues is because it appears to have a fundamental (non-test) issue. In particular, it uses an EventListener to keep track of the files. OnFlushCompleted() could be delayed until even after a compaction involving that flushed file completes, causing the compaction to unexpectedly delete an untracked file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9617

Test Plan: `make check` including the race condition coercing patch

Reviewed By: hx235

Differential Revision: D34384022

Pulled By: ajkr

fbshipit-source-id: 2652ded39b415277c5d6a628414345223930514e
2022-02-22 12:23:00 -08:00
Andrew Kryczka
3379d1466f Fix DBTest2.BackupFileTemperature memory leak (#9610)
Summary:
Valgrind was failing with the below error because we forgot to destroy
the `BackupEngine` object:

```
==421173== Command: ./db_test2 --gtest_filter=DBTest2.BackupFileTemperature
==421173==
Note: Google Test filter = DBTest2.BackupFileTemperature
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBTest2
[ RUN      ] DBTest2.BackupFileTemperature
--421173-- WARNING: unhandled amd64-linux syscall: 425
--421173-- You may be able to write your own handler.
--421173-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--421173-- Nevertheless we consider this a bug.  Please report
--421173-- it at http://valgrind.org/support/bug_reports.html.
[       OK ] DBTest2.BackupFileTemperature (3366 ms)
[----------] 1 test from DBTest2 (3371 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3413 ms total)
[  PASSED  ] 1 test.
==421173==
==421173== HEAP SUMMARY:
==421173==     in use at exit: 13,042 bytes in 195 blocks
==421173==   total heap usage: 26,022 allocs, 25,827 frees, 27,555,265 bytes allocated
==421173==
==421173== 8 bytes in 1 blocks are possibly lost in loss record 6 of 167
==421173==    at 0x4838DBF: operator new(unsigned long) (vg_replace_malloc.c:344)
==421173==    by 0x8D4606: allocate (new_allocator.h:114)
==421173==    by 0x8D4606: allocate (alloc_traits.h:445)
==421173==    by 0x8D4606: _M_allocate (stl_vector.h:343)
==421173==    by 0x8D4606: reserve (vector.tcc:78)
==421173==    by 0x8D4606: rocksdb::BackupEngineImpl::Initialize() (backupable_db.cc:1174)
==421173==    by 0x8D5473: Initialize (backupable_db.cc:918)
==421173==    by 0x8D5473: rocksdb::BackupEngine::Open(rocksdb::BackupEngineOptions const&, rocksdb::Env*, rocksdb::BackupEngine**) (backupable_db.cc:937)
==421173==    by 0x50AC8F: Open (backup_engine.h:585)
==421173==    by 0x50AC8F: rocksdb::DBTest2_BackupFileTemperature_Test::TestBody() (db_test2.cc:6996)
...
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9610

Test Plan:
```
$ make -j24 ROCKSDBTESTS_SUBSET=db_test2 valgrind_check_some
```

Reviewed By: akankshamahajan15

Differential Revision: D34371210

Pulled By: ajkr

fbshipit-source-id: 68154fcb0c51b28222efa23fa4ee02df8d925a18
2022-02-21 19:23:19 -08:00
Jay Zhuang
d3a2f284d9 Add Temperature info in NewSequentialFile() (#9499)
Summary:
Add Temperature hints information from RocksDB in API
`NewSequentialFile()`. backup and checkpoint operations need to open the
source files with `NewSequentialFile()`, which will have the temperature
hints. Other operations are not covered.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9499

Test Plan: Added unittest

Reviewed By: pdillinger

Differential Revision: D34006115

Pulled By: jay-zhuang

fbshipit-source-id: 568b34602b76520e53128672bd07e9d886786a2f
2022-02-18 18:23:07 -08:00
Bo Wang
67f071fade Fixes #9565 (#9586)
Summary:
[Compaction::IsTrivialMove](a2b9be42b6/db/compaction/compaction.cc (L318)) checks whether allow_trivial_move is set, and if so it returns the value of is_trivial_move_. The allow_trivial_move option is there for universal compaction. So when this is set and leveled compaction is enabled, then useful code that follows this block never gets a chance to run.

A check that [compaction_style == kCompactionStyleUniversal](320d9a8e8a/db/db_impl/db_impl_compaction_flush.cc (L1030)) should be added to avoid doing the wrong thing for leveled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9586

Test Plan:
To reproduce this:
First edit db/compaction/compaction.cc with
```
 diff --git a/db/compaction/compaction.cc b/db/compaction/compaction.cc
index 7ae50b91e..52dd489b1 100644
 --- a/db/compaction/compaction.cc
+++ b/db/compaction/compaction.cc
@@ -319,6 +319,8 @@ bool Compaction::IsTrivialMove() const {
   // input files are non overlapping
   if ((mutable_cf_options_.compaction_options_universal.allow_trivial_move) &&
       (output_level_ != 0)) {
+    printf("IsTrivialMove:: return %d because universal allow_trivial_move\n", (int) is_trivial_move_);
+    // abort();
     return is_trivial_move_;
   }
```

And then run
```
./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1641328309 --universal_allow_trivial_move=1
```
Example output with the debug code added

```
IsTrivialMove:: return 0 because universal allow_trivial_move
IsTrivialMove:: return 0 because universal allow_trivial_move
```

After this PR, the bug is fixed.

Reviewed By: ajkr

Differential Revision: D34350451

Pulled By: gitbw95

fbshipit-source-id: 3232005cc47c40a7e75d316cfc7960beb5bdff3a
2022-02-18 14:23:07 -08:00
Jay Zhuang
f4b2500e12 Add last level and non-last level read statistics (#9519)
Summary:
Add last level and non-last level read statistics:
```
LAST_LEVEL_READ_BYTES,
LAST_LEVEL_READ_COUNT,
NON_LAST_LEVEL_READ_BYTES,
NON_LAST_LEVEL_READ_COUNT,
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9519

Test Plan: added unittest

Reviewed By: siying

Differential Revision: D34062539

Pulled By: jay-zhuang

fbshipit-source-id: 908644c3050878b4234febdc72e3e19d89af38cd
2022-02-18 14:23:07 -08:00
mrambacher
30b08878d8 Make FilterPolicy Customizable (#9590)
Summary:
Make FilterPolicy into a Customizable class.  Allow new FilterPolicy to be discovered through the ObjectRegistry

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9590

Reviewed By: pdillinger

Differential Revision: D34327367

Pulled By: mrambacher

fbshipit-source-id: 37e7edac90ec9457422b72f359ab8ef48829c190
2022-02-18 13:22:31 -08:00
Jay Zhuang
2fbc672732 Add temperature information to the event listener callbacks (#9591)
Summary:
RocksDB try to provide temperature information in the event
listener callbacks. The information is not guaranteed, as some operation
like backup won't have these information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9591

Test Plan: Added unittest

Reviewed By: siying, pdillinger

Differential Revision: D34309339

Pulled By: jay-zhuang

fbshipit-source-id: 4aca4f270f99fa49186d85d300da42594663d6d7
2022-02-18 11:23:18 -08:00
anand76
627deb7ceb Fix some MultiGet batching stats (#9583)
Summary:
The NUM_INDEX_AND_FILTER_BLOCKS_READ_PER_LEVEL, NUM_DATA_BLOCKS_READ_PER_LEVEL, and NUM_SST_READ_PER_LEVEL stats were being recorded only when the last file in a level happened to have hits. They are supposed to be updated for every level. Also, there was some overcounting of GetContextStats. This PR fixes both the problems.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9583

Test Plan: Update the unit test in db_basic_test

Reviewed By: akankshamahajan15

Differential Revision: D34308044

Pulled By: anand1976

fbshipit-source-id: b3b36020fda26ba91bc6e0e47d52d58f4d7f656e
2022-02-17 16:31:41 -08:00
Siddhartha Roychowdhury
39b0d92153 Add record to set WAL compression type if enabled (#9556)
Summary:
When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs.
Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556

Reviewed By: anand1976

Differential Revision: D34308216

Pulled By: sidroyc

fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705
2022-02-17 16:19:31 -08:00
Jay Zhuang
f092f0fa5d Add subcompaction event API (#9311)
Summary:
Add event callback for subcompaction and adds a sub_job_id to identify it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9311

Reviewed By: ajkr

Differential Revision: D33892707

Pulled By: jay-zhuang

fbshipit-source-id: 57b5e5e594d61b2112d480c18a79a36751f65a4e
2022-02-17 15:47:10 -08:00
Andrew Kryczka
babe56ddba Add rate limiter priority to ReadOptions (#9424)
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.

`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.

There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).

The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424

Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
  - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
  - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`

Reviewed By: hx235

Differential Revision: D33747386

Pulled By: ajkr

fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
2022-02-16 23:18:14 -08:00
Yanqin Jin
1cda273dc3 Fix a silent data loss for write-committed txn (#9571)
Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time    thread 1                                       bg flush
 |   db->Put("a")
 |   txn = NewTxn()
 |   txn->Put("b", "v")
 |   txn->Prepare()       // writes only to 5.log
 |   db->SwitchMemtable() // memtable 1 has "a"
 |                        // close 5.log,
 |                        // creates 8.log
 |   trigger flush
 |                                                  pick memtable 1
 |                                                  unlock db mutex
 |                                                  write new sst
 |   txn->ctwb->Put("gtid", "1") // writes 8.log
 |   txn->Commit() // writes to 8.log
 |                 // writes to memtable 2
 |                                               compute min_log_number_to_keep_2pc, this
 |                                               will be 8 (incorrect).
 |
 |                                             Purge obsolete wals, including 5.log
 |
 V
```

At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log

The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.

The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571

Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```

Reviewed By: siying

Differential Revision: D34235236

Pulled By: riversand963

fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
2022-02-16 23:08:58 -08:00
Peter Dillinger
1e403a0c6c Fix assertion failure in FastLocalBloomBitsBuilder (#9585)
Summary:
As in

```
db_stress: table/block_based/filter_policy.cc:316: rocksdb::{anonymous}::FastLocalBloomBitsBuilder::FastLocalBloomBitsBuilder(int, std::atomic<long int>*, std::shared_ptr<rocksdb::CacheReservationManager>, bool): Assertion `millibits_per_key >= 1000' failed.
```

This assertion failure was actually happening with our RibbonFilterPolicy
which falls back to Bloom for some cases, often for flush, but was
missing new special logic to skip generating filter for 0 bits per key
case. Fixed by adding the logic in other builtin FilterPolicy
implementations.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9585

Test Plan:
Updated db_bloom_filter_test to do more integration testing
of the RibbonFilterPolicy ("auto Ribbon") class, incl regression test
this with SkipFilterOnEssentiallyZeroBpk

Reviewed By: ajkr

Differential Revision: D34295101

Pulled By: pdillinger

fbshipit-source-id: 3488eb207fc1d67bbbd1301313714aa1b6406e6e
2022-02-16 22:43:34 -08:00
Jay Zhuang
31031c0210 Remove deprecated RemoteCompaction API (#9570)
Summary:
Remove deprecated remote compaction APIs
`CompactionService::Start()` and `CompactionService::WaitForComplete()`.
Please use `CompactionService::StartV2()`,
`CompactionService::WaitForCompleteV2()` instead, which provides the
same information plus extra data like priority, db_id, etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9570

Test Plan: CI

Reviewed By: riversand963

Differential Revision: D34255969

Pulled By: jay-zhuang

fbshipit-source-id: c6376eccdd1123f1c42ab53771b5f65f8160c325
2022-02-16 13:25:28 -08:00
Peter Dillinger
8c681087c7 Refactor FilterPolicies toward Customizable (#9567)
Summary:
Some changes to make it easier to make FilterPolicy
customizable. Especially, create distinct classes for the different
testing-only and user-facing built-in FilterPolicy modes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9567

Test Plan:
tests updated, with no intended difference in functionality
tested. No difference in test performance seen as a result of moving to
string-based filter type configuration.

Reviewed By: mrambacher

Differential Revision: D34234694

Pulled By: pdillinger

fbshipit-source-id: 8a94931a9e04c3bcca863a4f524cfd064aaf0122
2022-02-16 08:30:03 -08:00
Jay Zhuang
a0c569ee1d Cancel manual compaction in thread-pool queue (#9557)
Summary:
Fix `DisableManualCompaction()` has to wait scheduled manual compaction to start the execution to cancel the job.
When a manual compaction in thread-pool queue is cancel, set the job is_canceled to true and clean the resource.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9557

Test Plan: added unittest that will hang without the change

Reviewed By: ajkr

Differential Revision: D34214910

Pulled By: jay-zhuang

fbshipit-source-id: 89dbaee78ddf26eb13ce862c2b15f4a098b36a78
2022-02-15 19:23:01 -08:00
Hui Xiao
443d8ef094 Fix PinSelf() read-after-free in DB::GetMergeOperands() (#9507)
Summary:
**Context:**
Running the new test `DBMergeOperandTest.MergeOperandReadAfterFreeBug` prior to this fix surfaces the read-after-free bug of PinSef() as below:
```
READ of size 8 at 0x60400002529d thread T0
    https://github.com/facebook/rocksdb/issues/5 0x7f199a in rocksdb::PinnableSlice::PinSelf(rocksdb::Slice const&) include/rocksdb/slice.h:171
    https://github.com/facebook/rocksdb/issues/6 0x7f199a in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1919
    https://github.com/facebook/rocksdb/issues/7 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203

freed by thread T0 here:
    https://github.com/facebook/rocksdb/issues/3 0x1191399 in rocksdb::cache_entry_roles_detail::RegisteredDeleter<rocksdb::Block, (rocksdb::CacheEntryRole)0>::Delete(rocksdb::Slice const&, void*) cache/cache_entry_roles.h:99
    https://github.com/facebook/rocksdb/issues/4 0x719348 in rocksdb::LRUHandle::Free() cache/lru_cache.h:205
    https://github.com/facebook/rocksdb/issues/5 0x71047f in rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*, bool) cache/lru_cache.cc:547
    https://github.com/facebook/rocksdb/issues/6 0xa78f0a in rocksdb::Cleanable::DoCleanup() include/rocksdb/cleanable.h:60
    https://github.com/facebook/rocksdb/issues/7 0xa78f0a in rocksdb::Cleanable::Reset() include/rocksdb/cleanable.h:38
    https://github.com/facebook/rocksdb/issues/8 0xa78f0a in rocksdb::PinnedIteratorsManager::ReleasePinnedData() db/pinned_iterators_manager.h:71
    https://github.com/facebook/rocksdb/issues/9 0xd0c21b in rocksdb::PinnedIteratorsManager::~PinnedIteratorsManager() db/pinned_iterators_manager.h:24
    https://github.com/facebook/rocksdb/issues/10 0xd0c21b in rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool) db/pinned_iterators_manager.h:22
    https://github.com/facebook/rocksdb/issues/11 0x7f0fdf in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1886
    https://github.com/facebook/rocksdb/issues/12 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203

previously allocated by thread T0 here:
    https://github.com/facebook/rocksdb/issues/1 0x1239896 in rocksdb::AllocateBlock(unsigned long, **rocksdb::MemoryAllocator*)** memory/memory_allocator.h:35
    https://github.com/facebook/rocksdb/issues/2 0x1239896 in rocksdb::BlockFetcher::CopyBufferToHeapBuf() table/block_fetcher.cc:171
    https://github.com/facebook/rocksdb/issues/3 0x1239896 in rocksdb::BlockFetcher::GetBlockContents() table/block_fetcher.cc:206
    https://github.com/facebook/rocksdb/issues/4 0x122eae5 in rocksdb::BlockFetcher::ReadBlockContents() table/block_fetcher.cc:325
    https://github.com/facebook/rocksdb/issues/5 0x11b1f45 in rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, bool, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const table/block_based/block_based_table_reader.cc:1503
```
Here is the analysis:
- We have [PinnedIteratorsManager](https://github.com/facebook/rocksdb/blob/6.28.fb/db/version_set.cc#L1980) with `Cleanable` capability in our `Version::Get()` path. It's responsible for managing the life-time of pinned iterator and invoking registered cleanup functions during its own destruction.
  - For example in case above, the merge operands's clean-up gets associated with this manger in [GetContext::push_operand](https://github.com/facebook/rocksdb/blob/6.28.fb/table/get_context.cc#L405). During PinnedIteratorsManager's [destruction](https://github.com/facebook/rocksdb/blob/6.28.fb/db/pinned_iterators_manager.h#L67), the release function associated with those merge operand data is invoked.
**And that's what we see in "freed by thread T955 here" in ASAN.**
- Bug 🐛: `PinnedIteratorsManager` is local to `Version::Get()`  while the data of merge operands need to outlive `Version::Get` and stay till they get [PinSelf()](https://github.com/facebook/rocksdb/blob/6.28.fb/db/db_impl/db_impl.cc#L1905), **which is the read-after-free in ASAN.**
  - This bug is likely to be an overlook of `PinnedIteratorsManager` when developing the API `DB::GetMergeOperands` cuz the current logic works fine with the existing case of getting the *merged value* where the operands do not need to live that long.
- This bug was not surfaced much (even in its unit test) due to the release function associated with the merge operands (which are actually blocks put in cache as you can see in `BlockBasedTable::MaybeReadBlockAndLoadToCache` **in "previously allocated by" in ASAN report**) is a cache entry deleter.
The deleter will call `Cache::Release()` which, for LRU cache, won't immediately deallocate the block based on LRU policy [unless the cache is full or being instructed to force erase](https://github.com/facebook/rocksdb/blob/6.28.fb/cache/lru_cache.cc#L521-L531)
  - `DBMergeOperandTest.MergeOperandReadAfterFreeBug` makes the cache extremely small to force cache full.

**Summary:**
- Fix the bug by align `PinnedIteratorsManager`'s lifetime with the merge operands

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9507

Test Plan:
- New test `DBMergeOperandTest.MergeOperandReadAfterFreeBug`
- db bench on read path
  - Setup (LSM tree with several levels, cache the whole db to avoid read IO, warm cache with readseq to avoid read IO): `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq  -num=1000000 -cache_size=100000000  -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1``TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="readrandom" -num=1000000 -cache_size=100000000 `
  - Actual command run (run 20-run for 20 times and then average the 20-run's average micros/op)
     - `for j in {1..20}; do (for i in {1..20}; do rm -rf /dev/shm/rocksdb/ && TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq,readrandom" -num=1000000 -cache_size=100000000  -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1 | egrep 'readrandom'; done > rr_output_pre.txt && (awk '{sum+=$3; sum_sqrt+=$3^2}END{print sum/20, sqrt(sum_sqrt/20-(sum/20)^2)}' rr_output_pre.txt) >> rr_output_pre_2.txt); done`
  - **Result: Pre-change: 3.79193 micros/op;   Post-change: 3.79528 micros/op (+0.09%)**

(pre-change)sorted avg micros/op of each 20-run | std of micros/op of each 20-run | (post-change) sorted avg micros/op of each 20-run | std of micros/op of each 20-run
-- | -- | -- | --
3.58355 | 0.265209 | 3.48715 | 0.382076
3.58845 | 0.519927 | 3.5832 | 0.382726
3.66415 | 0.452097 | 3.677 | 0.563831
3.68495 | 0.430897 | 3.68405 | 0.495355
3.70295 | 0.482893 | 3.68465 | 0.431438
3.719 | 0.463806 | 3.71945 | 0.457157
3.7393 | 0.453423 | 3.72795 | 0.538604
3.7806 | 0.527613 | 3.75075 | 0.444509
3.7817 | 0.426704 | 3.7683 | 0.468065
3.809 | 0.381033 | 3.8086 | 0.557378
3.80985 | 0.466011 | 3.81805 | 0.524833
3.8165 | 0.500351 | 3.83405 | 0.529339
3.8479 | 0.430326 | 3.86285 | 0.44831
3.85125 | 0.434108 | 3.8717 | 0.544098
3.8556 | 0.524602 | 3.895 | 0.411679
3.8656 | 0.476383 | 3.90965 | 0.566636
3.8911 | 0.488477 | 3.92735 | 0.608038
3.898 | 0.493978 | 3.9439 | 0.524511
3.97235 | 0.515008 | 3.9623 | 0.477416
3.9768 | 0.519993 | 3.98965 | 0.521481

- CI

Reviewed By: ajkr

Differential Revision: D34030519

Pulled By: hx235

fbshipit-source-id: a99ac585c11704c5ed93af033cb29ba0a7b16ae8
2022-02-15 12:25:18 -08:00
Peter Dillinger
e24734f843 Use -Wno-invalid-offsetof instead of dangerous offset_of hack (#9563)
Summary:
After https://github.com/facebook/rocksdb/issues/9515 added a unique_ptr to Status, we see some
warnings-as-error in some internal builds like this:

```
stderr: rocksdb/src/db/compaction/compaction_job.cc:2839:7: error:
offset of on non-standard-layout type 'struct CompactionServiceResult'
[-Werror,-Winvalid-offsetof]
     {offsetof(struct CompactionServiceResult, status),
      ^                                        ~~~~~~
```

I see three potential solutions to resolving this:

* Expand our use of an idiom that works around the warning (see offset_of
functions removed in this change, inspired by
https://gist.github.com/graphitemaster/494f21190bb2c63c5516)  However,
this construction is invoking undefined behavior that assumes consistent
layout with no compiler-introduced indirection. A compiler incompatible
with our assumptions will likely compile the code and exhibit undefined
behavior.
* Migrate to something in place of offset, like a function mapping
CompactionServiceResult* to Status* (for the `status` field). This might
be required in the long term.
* **Selected:** Use our new C++17 dependency to use offsetof in a well-defined way
when the compiler allows it. From a comment on
https://gist.github.com/graphitemaster/494f21190bb2c63c5516:

> A final note: in C++17, offsetof is conditionally supported, which
> means that you can use it on any type (not just standard layout
> types) and the compiler will error if it can't compile it correctly.
> That appears to be the best option if you can live with C++17 and
> don't need constexpr support.

The C++17 semantics are confirmed on
https://en.cppreference.com/w/cpp/types/offsetof, so we can suppress the
warning as long as we accept that we might run into a compiler that
rejects the code, and at that point we will find a solution, such as
the more intrusive "migrate" solution above.

Although this is currently only showing in our buck build, it will
surely show up also with make and cmake, so I have updated those
configurations as well.

Also in the buck build, -Wno-expansion-to-defined does not appear to be
needed anymore (both current compiler configurations) so I
removed it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9563

Test Plan: Tried out buck builds with both current compiler configurations

Reviewed By: riversand963

Differential Revision: D34220931

Pulled By: pdillinger

fbshipit-source-id: d39436008259bd1eaaa87c77be69fb2a5b559e1f
2022-02-15 09:19:19 -08:00
Peter Dillinger
479eb1aad6 Hide deprecated, inefficient block-based filter from public API (#9535)
Summary:
This change removes the ability to configure the deprecated,
inefficient block-based filter in the public API. Options that would
have enabled it now use "full" (and optionally partitioned) filters.
Existing block-based filters can still be read and used, and a "back
door" way to build them still exists, for testing and in case of trouble.

About the only way this removal would cause an issue for users is if
temporary memory for filter construction greatly increases. In
HISTORY.md we suggest a few possible mitigations: partitioned filters,
smaller SST files, or setting reserve_table_builder_memory=true.

Or users who have customized a FilterPolicy using the
CreateFilter/KeyMayMatch mechanism removed in https://github.com/facebook/rocksdb/issues/9501 will have to upgrade
their code. (It's long past time for people to move to the new
builder/reader customization interface.)

This change also introduces some internal-use-only configuration strings
for testing specific filter implementations while bypassing some
compatibility / intelligence logic. This is intended to hint at a path
toward making FilterPolicy Customizable, but it also gives us a "back
door" way to configure block-based filter.

Aside: updated db_bench so that -readonly implies -use_existing_db

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9535

Test Plan:
Unit tests updated. Specifically,

* BlockBasedTableTest.BlockReadCountTest is tweaked to validate the back
door configuration interface and ignoring of `use_block_based_builder`.
* BlockBasedTableTest.TracingGetTest is migrated from testing
block-based filter access pattern to full filter access patter, by
re-ordering some things.
* Options test (pretty self-explanatory)

Performance test - create with `./db_bench -db=/dev/shm/rocksdb1 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0` with and without `-use_block_based_filter`, which creates a DB with 21 SST files in L0. Read with `./db_bench -db=/dev/shm/rocksdb1 -readonly -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=30`

Without -use_block_based_filter: readrandom 464 ops/sec, 689280 KB DB
With -use_block_based_filter: readrandom 169 ops/sec, 690996 KB DB
No consistent difference with fillrandom

Reviewed By: jay-zhuang

Differential Revision: D34153871

Pulled By: pdillinger

fbshipit-source-id: 31f4a933c542f8f09aca47fa64aec67832a69738
2022-02-12 07:05:57 -08:00
Akanksha Mahajan
5c53b9008f Fix failure in c_test (#9547)
Summary:
When tests are run with TMPD, c_test may fail because TMPD
is not created by the test. It results in IO error: No such file
or directory: While mkdir if missing:
/tmp/rocksdb_test_tmp/rocksdb_c_test-0: No such file or directory

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9547

Test Plan:
make -j32 c_test;
 TEST_TMPDIR=/tmp/rocksdb_test  ./c_test

Reviewed By: riversand963

Differential Revision: D34173298

Pulled By: akankshamahajan15

fbshipit-source-id: 5b5a01f5b842c2487b05b0708c8e9532241db7f8
2022-02-11 10:31:41 -08:00
Ezgi Çiçek
95d9cb8357 Avoid unnecessary copy of sample_slice map (#9551)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9551

Reviewed By: riversand963

Differential Revision: D34169574

Pulled By: ezgicicek

fbshipit-source-id: 2e88db59b65bda269917a9b0bed17181a4afd281
2022-02-11 09:15:27 -08:00
Levi Tamasi
a1203edca4 Rework VersionStorageInfo::ComputeFilesMarkedForForcedBlobGC a bit (#9548)
Summary:
We had a bug in `VersionStorageInfo::ComputeFilesMarkedForForcedBlobGC`
related to the edge case where all blob files are part of the "oldest batch",
i.e. where only the very oldest file has any linked SSTs. (See https://github.com/facebook/rocksdb/issues/9542)
This PR tries to make the logic in this method clearer and also adds a unit test
for the problematic case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9548

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D34158959

Pulled By: ltamasi

fbshipit-source-id: fbab6d749c569728382aa04f7b7c60c92cca7650
2022-02-11 08:41:41 -08:00
Levi Tamasi
073ac54739 Log blob file space amp and expose it via the rocksdb.blob-stats DB property (#9538)
Summary:
Extend the periodic statistics in the info log with the total amount of garbage
in blob files and the space amplification pertaining to blob files, where the
latter is defined as `total_blob_file_size / (total_blob_file_size - total_blob_garbage_size)`.
Also expose the space amp via the `rocksdb.blob-stats` DB property.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9538

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D34126855

Pulled By: ltamasi

fbshipit-source-id: 3153e7a0fe0eca440322db273f4deaabaccc51b2
2022-02-10 12:42:11 -08:00
Levi Tamasi
b2423f8dde Fix off-by-one bug in VersionStorageInfo::ComputeFilesMarkedForForcedBlobGC (#9542)
Summary:
Fixes a bug introduced in https://github.com/facebook/rocksdb/issues/9526 where we index one position past the
end of a `vector`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9542

Test Plan:
`make asan_check`

Will add a unit test in a separate PR.

Reviewed By: akankshamahajan15

Differential Revision: D34145825

Pulled By: ltamasi

fbshipit-source-id: 4e87c948407dee489d669a3e41f59e2fcc1228d8
2022-02-10 11:13:25 -08:00
Hui Xiao
c5cd31c12b Fix TSAN data race in EventListenerTest.MultiCF (#9528)
Summary:
**Context:**
`EventListenerTest.MultiCF` occasionally failed on TSAN data race as below:
```
WARNING: ThreadSanitizer: data race (pid=2047633)
  Read of size 8 at 0x7b6000001440 by main thread:
    #0 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::size() const /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_vector.h:916:40 (listener_test+0x52337c)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() /home/circleci/project/db/listener_test.cc:384:7 (listener_test+0x52337c)

  Previous write of size 8 at 0x7b6000001440 by thread T2:
    #0 void std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::_M_realloc_insert<rocksdb::DB* const&>(__gnu_cxx::__normal_iterator<rocksdb::DB**, std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> > >, rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/vector.tcc:503:31 (listener_test+0x550654)
    https://github.com/facebook/rocksdb/issues/1 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::push_back(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_vector.h:1195:4 (listener_test+0x550654)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) /home/circleci/project/db/listener_test.cc:255:18 (listener_test+0x550654)
```

After investigation, it is due to the following:
(1) `ASSERT_OK(Flush(i));` before the read `std::vector::size()` is supposed to be [blocked on `DB::Impl::bg_cv_` for memtable flush to finish](320d9a8e8a/db/db_impl/db_impl_compaction_flush.cc (L2319)) and get signaled [at the end of background flush ](320d9a8e8a/db/db_impl/db_impl_compaction_flush.cc (L2830)), which happens after the write `std::vector::push_back()` . So the sequence of execution should have been synchronized as `call flush() -> write -> return from flush() -> read` and would not cause any TSAN data race.
- The subsequent `ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable());` serves a similar purpose based on [the previous attempt to deflake the test.](https://github.com/facebook/rocksdb/pull/9084)

(2) However, there are multiple places in the code can signal this `DB::Impl::bg_cv_` and mistakenly wake up `ASSERT_OK(Flush(i));`  (or `ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable());`) too early (and with the lock available to them), resulting in non-synchronized read and write thus a TSAN data race.
- Reproduced by the following, suggested by ajkr:
```
 diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 4ff87c1e4..52492e9cf 100644
 --- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -22,7 +22,7 @@
 #include "test_util/sync_point.h"
 #include "util/cast_util.h"
 #include "util/concurrent_task_limiter_impl.h"
 namespace ROCKSDB_NAMESPACE {

 bool DBImpl::EnoughRoomForCompaction(
@@ -855,6 +855,7 @@ void DBImpl::NotifyOnFlushCompleted(
        mutable_cf_options.level0_stop_writes_trigger);
   // release lock while notifying events
   mutex_.Unlock();
+  bg_cv_.SignalAll();
```

**Summary:**
- Added synchornization between read and write by ` ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency()` mechanism

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9528

Test Plan:
`./listener_test --gtest_filter=EventListenerTest.MultiCF --gtest_repeat=10`
- pre-fix:
```
Repeating all tests (iteration 3)
Note: Google Test filter = EventListenerTest.MultiCF
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from EventListenerTest
[ RUN      ] EventListenerTest.MultiCF
==================
WARNING: ThreadSanitizer: data race (pid=3377137)
  Read of size 8 at 0x7b6000000840 by main thread:
    #0 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::size()
    https://github.com/facebook/rocksdb/issues/1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() db/listener_test.cc:384 (listener_test+0x4bb300)

  Previous write of size 8 at 0x7b6000000840 by thread T2:
    #0 void std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::_M_realloc_insert<rocksdb::DB* const&>(__gnu_cxx::__normal_iterator<rocksdb::DB**, std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> > >, rocksdb::DB* const&)
    https://github.com/facebook/rocksdb/issues/1 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::push_back(rocksdb::DB* const&)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) db/listener_test.cc:255 (listener_test+0x4e820f)
```
- post-fix: `All passed`

Reviewed By: ajkr

Differential Revision: D34085791

Pulled By: hx235

fbshipit-source-id: f877aa687ea1d5cb6f31ef8c4772625d22868e8b
2022-02-10 10:21:25 -08:00
Levi Tamasi
320d9a8e8a Use a sorted vector instead of a map to store blob file metadata (#9526)
Summary:
The patch replaces `std::map` with a sorted `std::vector` for
`VersionStorageInfo::blob_files_` and preallocates the space
for the `vector` before saving the `BlobFileMetaData` into the
new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`.
These changes reduce the time the DB mutex is held while
saving new `Version`s, and using a sorted `vector` also makes
lookups faster thanks to better memory locality.

In addition, the patch introduces helper methods
`VersionStorageInfo::GetBlobFileMetaData` and
`VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by
clients to perform lookups in the `vector`, and does some general
cleanup in the parts of code where blob file metadata are used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526

Test Plan:
Ran `make check` and the crash test script for a while.

Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced:

```
numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value>
```

Final statistics before the patch:

```
Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s
Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s
```

With the patch:

```
Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s
Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s
```

Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs.

Reviewed By: riversand963

Differential Revision: D34082728

Pulled By: ltamasi

fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 12:36:43 -08:00
Yanqin Jin
685044dff2 Remove timestamp from key in expected state (#9525)
Summary:
The keys as part of write batch read from trace file can contain trailing timestamps.
This PR removes them before calling `ExpectedState`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9525

Test Plan:
make check
make crash_test_with_ts

Reviewed By: ajkr

Differential Revision: D34082358

Pulled By: riversand963

fbshipit-source-id: 78c925659e2a19e4a8278fb4a8ddf5070e265c04
2022-02-09 09:50:54 -08:00
Akanksha Mahajan
9745c68eb1 Remove deprecated option new_table_reader_for_compaction_inputs (#9443)
Summary:
In RocksDB option new_table_reader_for_compaction_inputs has
not effect on Compaction or on the behavior of RocksDB library.
Therefore, we are removing it in the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443

Test Plan: CircleCI

Reviewed By: ajkr

Differential Revision: D33788508

Pulled By: akankshamahajan15

fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d
2022-02-08 19:31:28 -08:00
Peter Dillinger
68a9c186d0 FilterPolicy API changes for 7.0 (#9501)
Summary:
* Inefficient block-based filter is no longer customizable in the public
API, though (for now) can still be enabled.
  * Removed deprecated FilterPolicy::CreateFilter() and
  FilterPolicy::KeyMayMatch()
  * Removed `rocksdb_filterpolicy_create()` from C API
* Change meaning of nullptr return from GetBuilderWithContext() from "use
block-based filter" to "generate no filter in this case." This is a
cleaner solution to the proposal in https://github.com/facebook/rocksdb/issues/8250.
  * Also, when user specifies bits_per_key < 0.5, we now round this down
  to "no filter" because we expect a filter with >= 80% FP rate is
  unlikely to be worth the CPU cost of accessing it (esp with
  cache_index_and_filter_blocks=1 or partition_filters=1).
  * bits_per_key >= 0.5 and < 1.0 is still rounded up to 1.0 (for 62% FP
  rate)
  * This also gives us some support for configuring filters from OPTIONS
  file as currently saved: `filter_policy=rocksdb.BuiltinBloomFilter`.
  Opening from such an options file will enable reading filters (an
  improvement) but not writing new ones. (See Customizable follow-up
  below.)
* Also removed deprecated functions
  * FilterBitsBuilder::CalculateNumEntry()
  * FilterPolicy::GetFilterBitsBuilder()
  * NewExperimentalRibbonFilterPolicy()
* Remove default implementations of
  * FilterBitsBuilder::EstimateEntriesAdded()
  * FilterBitsBuilder::ApproximateNumEntries()
  * FilterPolicy::GetBuilderWithContext()
* Remove support for "filter_policy=experimental_ribbon" configuration
string.
* Allow "filter_policy=bloomfilter:n" without bool to discourage use of
block-based filter.

Some pieces for https://github.com/facebook/rocksdb/issues/9389

Likely follow-up (later PRs):
* Refactoring toward FilterPolicy Customizable, so that we can generate
filters with same configuration as before when configuring from options
file.
* Remove support for user enabling block-based filter (ignore `bool
use_block_based_builder`)
  * Some months after this change, we could even remove read support for
  block-based filter, because it is not critical to DB data
  preservation.
* Make FilterBitsBuilder::FinishV2 to avoid `using
FilterBitsBuilder::Finish` mess and add support for specifying a
MemoryAllocator (for cache warming)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9501

Test Plan:
A number of obsolete tests deleted and new tests or test
cases added or updated.

Reviewed By: hx235

Differential Revision: D34008011

Pulled By: pdillinger

fbshipit-source-id: a39a720457c354e00d5b59166b686f7f59e392aa
2022-02-08 13:56:46 -08:00
satyajanga
036bbab6f7 Use the comparator from the sst file table properties in sst_dump_tool (#9491)
Summary:
We introduced a new Comparator for timestamp in user keys. In the sst_dump_tool by default we use BytewiseComparator to read sst files. This change allows us to read comparator_name from table properties in meta data block and use it to read.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9491

Test Plan:
added unittests for new functionality.
make check
![image](https://user-images.githubusercontent.com/4923556/152915444-28b88a1f-7b4e-47d0-815f-7011552bd9a2.png)
![image](https://user-images.githubusercontent.com/4923556/152916196-bea3d2a1-a3d5-4362-b911-036131b83e8d.png)

Reviewed By: riversand963

Differential Revision: D33993614

Pulled By: satyajanga

fbshipit-source-id: 4b5cf938e6d2cb3931d763bef5baccc900b8c536
2022-02-08 12:15:35 -08:00
Peter Dillinger
5cb137a860 Work around some new clang-analyze failures (#9515)
Summary:
... seen only in internal clang-analyze runs after https://github.com/facebook/rocksdb/issues/9481

* Mostly, this works around falsely reported leaks by using
std::unique_ptr in some places where clang-analyze was getting
confused. (I didn't see any changes in C++17 that could make our Status
implementation leak memory.)
* Also fixed SetBGError returning address of a stack variable.
* Also fixed another false null deref report by adding an assert.

Also, use SKIP_LINK=1 to speed up `make analyze`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9515

Test Plan:
Was able to reproduce the reported errors locally and verify
they're fixed (except SetBGError). Otherwise, existing tests

Reviewed By: hx235

Differential Revision: D34054630

Pulled By: pdillinger

fbshipit-source-id: 38600ef3da75ddca307dff96b7a1a523c2885c2e
2022-02-07 18:24:36 -08:00
Levi Tamasi
0cc0543893 Mitigate the overhead of building the hash of file locations (#9504)
Summary:
The patch builds on the refactoring done in https://github.com/facebook/rocksdb/issues/9494
and improves the performance of building the hash of file
locations in `VersionStorageInfo` in two ways. First, the hash
building is moved from `AddFile` (which is called under the DB mutex)
to a separate post-processing step done as part of `PrepareForVersionAppend`
(during which the mutex is *not* held). Second, the space necessary
for the hash is preallocated to prevent costly reallocation/rehashing
operations. These changes mitigate the overhead of the file location hash,
which can be significant with certain workloads where the baseline CPU usage
is low (see https://github.com/facebook/rocksdb/issues/9351,
which is a workload where keys are sorted, WAL is turned
off, the vector memtable implementation is used, and there are lots of small
SST files).

Fixes https://github.com/facebook/rocksdb/issues/9351

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9504

Test Plan:
`make check`

```
numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --disable_wal=1 --seed=<some_seed>
```

Final statistics before this patch:
```
Cumulative writes: 0 writes, 697M keys, 0 commit groups, 0.0 writes per commit group, ingest: 283.25 GB, 241.08 MB/s
Interval writes: 0 writes, 1264K keys, 0 commit groups, 0.0 writes per commit group, ingest: 525.69 MB, 176.67 MB/s
```

With the patch:
```
Cumulative writes: 0 writes, 759M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.57 GB, 262.63 MB/s
Interval writes: 0 writes, 1555K keys, 0 commit groups, 0.0 writes per commit group, ingest: 646.61 MB, 215.11 MB/s
```

Reviewed By: riversand963

Differential Revision: D34014734

Pulled By: ltamasi

fbshipit-source-id: acb2703677451d5ccaa7e9d950844b33d240695b
2022-02-07 09:17:14 -08:00
Peter Dillinger
afc280fdfd Enhance new cache key testing & comments (#9329)
Summary:
Follow-up to https://github.com/facebook/rocksdb/issues/9126

Added new unit tests to validate some of the claims of guaranteed uniqueness
within certain large bounds.

Also cleaned up the cache_bench -stress-cache-key tool with better comments
and description.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9329

Test Plan: no changes to production code

Reviewed By: mrambacher

Differential Revision: D33269328

Pulled By: pdillinger

fbshipit-source-id: 3a2b684a6b2b15f79dc872e563e3d16563be26de
2022-02-04 14:15:58 -08:00
Levi Tamasi
42e0751b3a Clean up VersionStorageInfo a bit (#9494)
Summary:
The patch does some cleanup in and around `VersionStorageInfo`:
* Renames the method `PrepareApply` to `PrepareAppend` in `Version`
to make it clear that it is to be called before appending the `Version` to
`VersionSet` (via `AppendVersion`), not before applying any `VersionEdit`s.
* Introduces a helper method `VersionStorageInfo::PrepareForVersionAppend`
(called by `Version::PrepareAppend`) that encapsulates the population of the
various derived data structures in `VersionStorageInfo`, and turns the
methods computing the derived structures (`UpdateNumNonEmptyLevels`,
`CalculateBaseBytes` etc.) into private helpers.
* Changes `Version::PrepareAppend` so it only calls `UpdateAccumulatedStats`
if the `update_stats` flag is set. (Earlier, this was checked by the callee.)
Related to this, it also moves the call to `ComputeCompensatedSizes` to
`VersionStorageInfo::PrepareForVersionAppend`.
* Updates and cleans up `version_builder_test`, `version_set_test`, and
`compaction_picker_test` so `PrepareForVersionAppend` is called anytime
a new `VersionStorageInfo` is set up or saved. This cleanup also involves
splitting `VersionStorageInfoTest.MaxBytesForLevelDynamic`
into multiple smaller test cases.
* Fixes up a bunch of comments that were outdated or just plain incorrect.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9494

Test Plan: Ran `make check` and the crash test script for a while.

Reviewed By: riversand963

Differential Revision: D33971666

Pulled By: ltamasi

fbshipit-source-id: fda52faac7783041126e4f8dec0fe01bdcadf65a
2022-02-04 08:19:20 -08:00
Baptiste Lemaire
bec9ab4316 Remove deprecated option DBOptions::max_mem_compaction_level (#9446)
Summary:
In RocksDB, this option was already marked as "NOT SUPPORTED" for a long time, and setting this option does not have any effect on the behavior of RocksDB library. Therefore, we are removing it in the preparations of the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9446

Reviewed By: ajkr

Differential Revision: D33793048

Pulled By: bjlemaire

fbshipit-source-id: 73316efdb194e90225005246673dae99e65577ae
2022-02-04 05:32:28 -08:00
Yanqin Jin
629e3e1d77 Fix spelling in public API (#9490)
Summary:
I feel it would be nice if we can fix this spelling error.

In `SizeApproximationOptions`, the `include_memtabtles` should be `include_memtables`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9490

Test Plan: make check

Reviewed By: hx235

Differential Revision: D33949862

Pulled By: riversand963

fbshipit-source-id: b2be67501b65d4aabb6b8df1bf25eb8d54cc1466
2022-02-03 15:15:23 -08:00
Yanqin Jin
3122cb4358 Revise APIs related to user-defined timestamp (#8946)
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".

According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).

For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.

Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.

The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.

Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946

Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0

Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```

Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom   :       1.831 micros/op 546235 ops/sec;   60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom   :       1.820 micros/op 549404 ops/sec;   60.8 MB/s
```

Reviewed By: ltamasi

Differential Revision: D33721359

Pulled By: riversand963

fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
2022-02-01 22:19:01 -08:00
Hui Xiao
920386f2b7 Detect (new) Bloom/Ribbon Filter construction corruption (#9342)
Summary:
Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, https://github.com/facebook/rocksdb/pull/9345, (optional) https://github.com/facebook/rocksdb/pull/9393
**Context:**
(Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following:
a) set of keys to add to filter
b) set of hashes to add to filter (64-bit hash applied to each key)
c) set of Bloom indices to set in filter, with duplicates
d) set of Bloom indices to set in filter, deduplicated
e) final filter and its checksum

This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level.
- b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`)
- c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO.

Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default.

**Summary:**
- Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()`
- Ensured hash entries, final filter and banding and their [cache reservation ](https://github.com/facebook/rocksdb/issues/9073) are released properly despite corruption
   - See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design
   -  Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries`
- Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()`
- When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9342

Test Plan:
- Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption`
- Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl
   - For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true  -quick -runs 10 | grep 'Build avg'` to make sure nothing break.
- Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()`
   -  FastLocalBloom
       - Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s**
       - After change:
          -  `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)**
          - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)**
          - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)**
    -  Standard128Ribbon
       - Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s**
       - After change:
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)**
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)**
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)**
- Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true`
- Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status.

Reviewed By: pdillinger

Differential Revision: D33746928

Pulled By: hx235

fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
2022-02-01 17:42:35 -08:00
Peter Dillinger
a495448eea Revisit #9118 for compaction outputs (#9480)
Summary:
Crash test recently started showing failures as in https://github.com/facebook/rocksdb/issues/9118 but
for files created by compaction. This change applies a similar fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9480

Test Plan:
Updated / extended unit test. (Some re-arranging to do the
simpler compaction testing before this special case.)

Reviewed By: ltamasi

Differential Revision: D33909835

Pulled By: pdillinger

fbshipit-source-id: 58e4b44e4ecc2d21e4df2c2d8440ec0633aa1f6c
2022-02-01 11:08:34 -08:00
Jay Zhuang
980b9ff385 Add more micro-benchmark tests (#9436)
Summary:
* Add more micro-benchmark tests
* Expose an API in DBImpl for waiting for compactions (still not visible to the user)
* Add argument name for ribbon_bench
* remove benchmark run from CI, as it runs too long.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9436

Test Plan: CI

Reviewed By: riversand963

Differential Revision: D33777836

Pulled By: jay-zhuang

fbshipit-source-id: c05de3bc082cc05b5d019f00b324e774bf4bbd96
2022-02-01 09:01:55 -08:00
Peter Dillinger
f6d7ec1d02 Ignore total_order_seek in DB::Get (#9427)
Summary:
Apparently setting total_order_seek=true for DB::Get was
intended to allow accurate read semantics if the current prefix
extractor doesn't match what was used to generate SST files on
disk. But since prefix_extractor was made a mutable option in 5.14.0, we
have been able to detect this case and provide the correct semantics
regardless of the total_order_seek option. Since that time, the option
has only made Get() slower in a reasonably common case: prefix_extractor
unchanged and whole_key_filtering=false.

So this change primarily removes unnecessary effect of
total_order_seek on Get. Also cleans up some related comments.

Also adds a -total_order_seek option to db_bench and canonicalizes
handling of ReadOptions in db_bench so that command line options have
the expected association with library features. (There is potential
for change in regression test behavior, but the old behavior is likely
indefensible, or some other inconsistency would need to be fixed.)

TODO in follow-up work: there should be no reason for Get() to depend on
current prefix extractor at all.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9427

Test Plan:
Unit tests updated.

Performance (using db_bench update)

Create DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12 -whole_key_filtering=0`

Test with and without `-total_order_seek` on `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=readrandom -num=10000000 -duration=40 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`

Before this change, total_order_seek=false: 25188 ops/sec
Before this change, total_order_seek=true:   1222 ops/sec (~20x slower)

After this change, total_order_seek=false: 24570 ops/sec
After this change, total_order_seek=true:  25012 ops/sec (indistinguishable)

Reviewed By: siying

Differential Revision: D33753458

Pulled By: pdillinger

fbshipit-source-id: bf892f34907a5e407d9c40bd4d42f0adbcbe0014
2022-01-31 19:46:42 -08:00
Hui Xiao
42cca28ebb Remove deprecated API AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds (#9455)
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds has been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9455

Test Plan: Rely on my eyeball and CI

Reviewed By: ajkr

Differential Revision: D33811664

Pulled By: hx235

fbshipit-source-id: 866859427fe710354a90f1095057f80116365ff0
2022-01-28 16:47:08 -08:00
Yanqin Jin
d10c5c08d3 Remove iter_start_seqnum and preserve_deletes (#9430)
Summary:
According to https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl.cc#L2896:L2911 and https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl_open.cc#L203:L208,
we are going to remove `iter_start_seqnum` and `preserve_deletes` starting from RocksDB 7.0

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9430

Test Plan: make check and CI

Reviewed By: ajkr

Differential Revision: D33753639

Pulled By: riversand963

fbshipit-source-id: c80aab8e8d8fc33e52472fed524ed703d0ffc8b6
2022-01-28 13:28:38 -08:00
Akanksha Mahajan
74ccd1931e Remove deprecated option DBOptions::skip_log_error_on_recovery (#9434)
Summary:
In  RocksDB DBOptions::skip_log_error_on_recovery is marked as
"NOT SUPPORTED" for a long time, and setting this option does not have
any effect on the behavior of RocksDB library. Therefore, we are removing it
in the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9434

Test Plan: CircleCI

Reviewed By: ajkr

Differential Revision: D33763015

Pulled By: akankshamahajan15

fbshipit-source-id: 11f09643298da6c02d3dcdb090b996f4c3cfdd76
2022-01-28 01:46:04 -08:00
Jay Zhuang
22321e1027 Remove unused API base_background_compactions (#9462)
Summary:
The API is deprecated long time ago. Clean up the codebase by
removing it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9462

Test Plan: CI, fake release: D33835220

Reviewed By: riversand963

Differential Revision: D33835103

Pulled By: jay-zhuang

fbshipit-source-id: 6d2dc12c8e7fdbe2700865a3e61f0e3f78bd8184
2022-01-27 21:05:18 -08:00
Yanqin Jin
dd203ed604 Disallow a combination of options (#9348)
Summary:
Disallow `immutable_db_opts.use_direct_io_for_flush_and_compaction == true` and
`mutable_db_opts.writable_file_max_buffer_size == 0`, since it causes `WritableFileWriter::Append()`
to loop forever and does not make much sense in direct IO.

This combination of options itself does not make much sense: asking RocksDB to do direct IO but not allowing
RocksDB to allocate a buffer. We should detect this false combination and warn user early, no matter whether
the application is running on a platform that supports direct IO or not. In the case of platform **not** supporting
direct IO, it's ok if the user learns about this and then finds that direct IO is not supported.

One tricky thing: the constructor of `WritableFileWriter` is being used in our unit tests, and it's impossible
to return status code from constructor. Since we do not throw, I put an assertion for now. Fortunately,
the constructor is not exposed to external applications.

Closing https://github.com/facebook/rocksdb/issues/7109

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9348

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D33371924

Pulled By: riversand963

fbshipit-source-id: 2a3701ab541cee23bffda8a36cdf37b2d235edfa
2022-01-27 19:30:24 -08:00
Peter Dillinger
78aee6fedc Remove obsolete backupable_db.h, utility_db.h (#9438)
Summary:
This also removes the obsolete names BackupableDBOptions
and UtilityDB. API users must now use BackupEngineOptions and
DBWithTTL::Open. In C API, `rocksdb_backupable_db_*` is replaced
`rocksdb_backup_engine_*`. Similar renaming in Java API.

In reference to https://github.com/facebook/rocksdb/issues/9389

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438

Test Plan: CI

Reviewed By: mrambacher

Differential Revision: D33780269

Pulled By: pdillinger

fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac
2022-01-27 15:45:30 -08:00
Peter Dillinger
ea89c77f27 Fix major bug with MultiGet, DeleteRange, and memtable Bloom (#9453)
Summary:
MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.

This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)

Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9453

Test Plan:
new unit test added. Added memtable Bloom to crash test.

Performance testing
--------------------

Build WAL-only DB (recovers to memtable):
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000
```

Query test command, to maximize sensitivity to the changed code:
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS
```
(Note -num here is 10x larger for mostly memtable misses)

Before & after run simultaneously, average over 10 iterations per data point, ops/sec.

MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066

MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990

MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before:  12081715
After: 12117603

MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591

Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.

Reviewed By: ajkr

Differential Revision: D33805025

Pulled By: pdillinger

fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
2022-01-27 14:55:04 -08:00
Hui Xiao
1e0e883ca5 Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit (#9452)
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452

Test Plan: Rely on my eyeball and CI

Reviewed By: ajkr

Differential Revision: D33804938

Pulled By: hx235

fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945
2022-01-27 13:01:09 -08:00
Baptiste Lemaire
92822655fd Remove deprecated table_cache_remove_scan_count_limit option. (#9450)
Summary:
In RocksDB, this option was already marked as "NOT SUPPORTED" for a long time, and setting this option does not have any effect on the behavior of RocksDB library. Therefore, we are removing it in the preparations of the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9450

Reviewed By: ajkr

Differential Revision: D33802466

Pulled By: bjlemaire

fbshipit-source-id: 97570985f1400525304053476450f7ef504c0cd5
2022-01-27 09:33:31 -08:00
Siddhartha Roychowdhury
c27ca23644 Add option for WAL compression algorithm (#9432)
Summary:
Add an option to set the WAL compression algorithm - wal_compression.

TODO: WAL compression is not implemented and will only support zstd initially. Will be added in subsequent diffs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9432

Reviewed By: pdillinger

Differential Revision: D33797275

Pulled By: sidroyc

fbshipit-source-id: 8db81d9c9cea5e2e4f1445d3aecad8106137b8e7
2022-01-26 14:23:00 -08:00
Jay Zhuang
961d8dacf2 Remove unused option purge_redundant_kvs_while_flush (#9429)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9429

Test Plan: fake release for test: D33754513

Reviewed By: riversand963

Differential Revision: D33753637

Pulled By: jay-zhuang

fbshipit-source-id: 18db4701e8f28dda8f1ab660c2be9890a8312c12
2022-01-26 10:24:16 -08:00
Jay Zhuang
022b400cba Make bottommost_temperature dynamically changeable (#9402)
Summary:
Make `AdvancedColumnFamilyOptions.bottommost_temperature`
dynamically changeable with `SetOptions` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9402

Test Plan: added unittest

Reviewed By: siying

Differential Revision: D33674487

Pulled By: jay-zhuang

fbshipit-source-id: 8943768156aa6197c63850a64238a8092527d517
2022-01-25 15:23:04 -08:00
Peter Dillinger
fc9d4071f0 Fast path for detecting unchanged prefix_extractor (#9407)
Summary:
Fixes a major performance regression in 6.26, where
extra CPU is spent in SliceTransform::AsString when reads involve
a prefix_extractor (Get, MultiGet, Seek). Common case performance
is now better than 6.25.

This change creates a "fast path" for verifying that the current prefix
extractor is unchanged and compatible with what was used to
generate a table file. This fast path detects the common case by
pointer comparison on the current prefix_extractor and a "known
good" prefix extractor (if applicable) that is saved at the time the
table reader is opened. The "known good" prefix extractor is saved
as another shared_ptr copy (in an existing field, however) to ensure
the pointer is not recycled.

When the prefix_extractor has changed to a different instance but
same compatible configuration (rare, odd), performance is still a
regression compared to 6.25, but this is likely acceptable because
of the oddity of such a case. The performance of incompatible
prefix_extractor is essentially unchanged.

Also fixed a minor case (ForwardIterator) where a prefix_extractor
could be used via a raw pointer after being freed as a shared_ptr,
if replaced via SetOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9407

Test Plan:
## Performance
Populate DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`

Running head-to-head comparisons simultaneously with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=seekrandom -num=10000000 -duration=20 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`

Below each is compared by ops/sec vs. baseline which is version 6.25 (multiple baseline runs because of variable machine load)

v6.26: 4833 vs. 6698 (<- major regression!)
v6.27: 4737 vs. 6397 (still)
New: 6704 vs. 6461 (better than baseline in common case)
Disabled fastpath: 4843 vs. 6389 (e.g. if prefix extractor instance changes but is still compatible)
Changed prefix size (no usable filter) in new: 787 vs. 5927
Changed prefix size (no usable filter) in new & baseline: 773 vs. 784

Reviewed By: mrambacher

Differential Revision: D33677812

Pulled By: pdillinger

fbshipit-source-id: 571d9711c461fb97f957378a061b7e7dbc4d6a76
2022-01-21 11:37:46 -08:00
Peter Dillinger
8064a3ac31 Fix flaky EventListenerTest.DisableBGCompaction (#9400)
Summary:
Wasn't able to easily reproduce error, but easy to see a race
condition between TestFlushListener::OnFlushCompleted and
DBTestBase::Close(), which frees CF handles before closing DB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9400

Test Plan: CI etc.

Reviewed By: riversand963

Differential Revision: D33645134

Pulled By: pdillinger

fbshipit-source-id: d0ec914cc43c9e14f53da633876b95b61995138d
2022-01-21 08:25:09 -08:00
Peter Dillinger
5576ded762 Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363)
Summary:
In response to https://github.com/facebook/rocksdb/issues/9354, this PR adds a way for users to "opt out"
of extra checks that can impact peak write performance, which
currently only includes force_consistency_checks. I considered including
some other options but did not see a db_bench performance difference.

Also clarify in comment for force_consistency_checks that it can "slow
down saturated writing."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9363

Test Plan:
basic coverage in unit tests

Using my perf test in https://github.com/facebook/rocksdb/issues/9354 comment, I see

force_consistency_checks=true -> 725360 ops/s
force_consistency_checks=false -> 783072 ops/s

Reviewed By: mrambacher

Differential Revision: D33636559

Pulled By: pdillinger

fbshipit-source-id: 25bfd006f4844675e7669b342817dd4c6a641e84
2022-01-18 17:31:03 -08:00
Si Ke
93b1de4f45 Enable db_test running in Centos 32 bit OS and Alpine 32 bit OS (#9294)
Summary:
Closes https://github.com/facebook/rocksdb/issues/9271

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9294

Reviewed By: riversand963, hx235

Differential Revision: D33586002

Pulled By: pdillinger

fbshipit-source-id: 3d1a2fa71023e108613ff03dbd37a5f954fc4920
2022-01-14 11:58:18 -08:00
zhuchong0329
5f2b661f54 FlushMemTable return ok but memtable does not synchronize flush (#8173)
Summary:
Fix https://github.com/facebook/rocksdb/issues/8046 : FlushMemTable return ok but memtable does not synchronize flush. The way to fix it is to expose RecoveryError.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8173

Reviewed By: ajkr

Differential Revision: D31674552

Pulled By: jay-zhuang

fbshipit-source-id: 9d16b69ba12a196bb429332ec8224754de97773d
2022-01-12 13:21:49 -08:00
Yanqin Jin
0376869f05 Remove using namespace (#9369)
Summary:
As title.
This is part of an fb-internal task.
First, remove all `using namespace` statements if applicable.
Next, utilize multiple build platforms and see if anything is broken.
Should anything become broken, fix the compilation errors with as little extra change as possible.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9369

Test Plan:
internal build and make check
make clean && make static_lib && cd examples && make all

Reviewed By: pdillinger

Differential Revision: D33517260

Pulled By: riversand963

fbshipit-source-id: 3fc4ce6402a073421dfd9a9b2d1c79441dca7a40
2022-01-12 09:31:12 -08:00
Niklas Fiekas
f8bdd5797f Take compression level_values as const pointer (#9376)
Summary:
Compatible change, more natural (especially in generated Rust bindings), no risk that the API will ever need mutable access because it has to make a copy anyway.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9376

Reviewed By: ajkr

Differential Revision: D33541435

Pulled By: pdillinger

fbshipit-source-id: 15c512a0d70b6e8694fa99d598b7d022751c1e59
2022-01-12 08:34:53 -08:00
Jay Zhuang
9c6fb26033 Fix clang13 build error (#9374)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9374

Test Plan: Add CI for clang13 build

Reviewed By: riversand963

Differential Revision: D33522867

Pulled By: jay-zhuang

fbshipit-source-id: 642756825cf0b51e35861fb847ebaee4611b76ca
2022-01-11 10:36:22 -08:00
mrambacher
1973fcba11 Restore Regex support for ObjectLibrary::Register, rename new APIs to allow old one to be deprecated in the future (#9362)
Summary:
In order to support old-style regex function registration, restored the original "Register<T>(string, Factory)" method using regular expressions.  The PatternEntry methods were left in place but renamed to AddFactory.  The goal is to allow for the deprecation of the original regex Registry method in an upcoming release.

Added modes to the PatternEntry kMatchZeroOrMore and kMatchAtLeastOne to match * or +, respectively (kMatchAtLeastOne was the original behavior).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9362

Reviewed By: pdillinger

Differential Revision: D33432562

Pulled By: mrambacher

fbshipit-source-id: ed88ab3f9a2ad0d525c7bd1692873f9bb3209d02
2022-01-11 06:33:48 -08:00
jsteemann
255aefb628 Add filename to several Corruption messages (#9239)
Summary:
This change adds the filename of the offending filen to several place that produce Status objects with code `kCorruption`.
This is not an attempt to have every Corruption message in the codebase extended with the filename, but it is a start.
The motivation for the change was to quickly diagnose which file is corrupted when a large database is openend and there is not option to copy it offsite for analysis, run strace or install the ldb tool.
In the particular case in question, the error message improved from a mere
```
Corruption: checksum mismatch
```
to
```
Corruption: checksum mismatch in file /path/to/db/engine-rocksdb/MANIFEST-000171
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9239

Reviewed By: jay-zhuang

Differential Revision: D33237742

Pulled By: riversand963

fbshipit-source-id: bd42559cfbf786a0a674d091671d1a2bf07bdd31
2022-01-07 18:09:48 -08:00
Youngjae Lee
3dfee770c6 Remove obsolete function declaration (#8724)
Summary:
Function `Version::UpdateFilesByCompactionPri()` is never called and not implemented.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8724

Reviewed By: ajkr

Differential Revision: D30643943

Pulled By: riversand963

fbshipit-source-id: 174b2d9a2a42e286222909a035cc74a7b5602335
2022-01-07 18:06:10 -08:00
Yanqin Jin
b2e53ab2d8 Add checking for DB::DestroyColumnFamilyHandle() (#9347)
Summary:
Closing https://github.com/facebook/rocksdb/issues/5006

Calling `DB::DestroyColumnFamilyHandle(column_family)` with `column_family` being the return value of
`DB::DefaultColumnFamily()` will return `Status::InvalidArgument()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9347

Test Plan: make check

Reviewed By: akankshamahajan15

Differential Revision: D33369675

Pulled By: riversand963

fbshipit-source-id: a8266a4daddf2b7a773c2dc7f3eb9a4adfb6b6dd
2022-01-05 20:26:53 -08:00