Commit Graph

4725 Commits

Author SHA1 Message Date
gukaifeng
4f37eb4db2 fix a bug of the ticker NO_FILE_OPENS (#9677)
Summary:
In the original code, the value of `NO_FILE_OPENS` corresponding to the Ticker item will be increased regardless of whether the file is successfully opened or not. Even counts are repeated, which can lead to skewed counts.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9677

Reviewed By: jay-zhuang

Differential Revision: D34725733

Pulled By: ajkr

fbshipit-source-id: 841234ed03802c0105fd2107d82a740265ead576
2022-03-22 18:08:57 -07:00
Jermy Li
446a152aa6 fix: Reusing-Iterator reads stale keys after DeleteRange() performed (#9258)
Summary:
fix https://github.com/facebook/rocksdb/issues/9255

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9258

Reviewed By: pdillinger

Differential Revision: D34879684

Pulled By: ajkr

fbshipit-source-id: 5934f4b7524dc27ecdf1430e0456a0fc02958fc7
2022-03-22 18:08:51 -07:00
Jay Zhuang
f94fce8623 DisableManualCompaction may fail to cancel an unscheduled task (#9659)
Summary:
https://github.com/facebook/rocksdb/issues/9625 didn't change the unschedule condition which was waiting for the background thread to clean-up the compaction.
make sure we only unschedule the task when it's scheduled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9659

Reviewed By: ajkr

Differential Revision: D34651820

Pulled By: jay-zhuang

fbshipit-source-id: 23f42081b15ec8886cd81cbf131b116e0c74dc2f
2022-03-22 18:08:09 -07:00
anand76
2e9a9f04d7 Fix some MultiGet batching stats (#9583)
Summary:
The NUM_INDEX_AND_FILTER_BLOCKS_READ_PER_LEVEL, NUM_DATA_BLOCKS_READ_PER_LEVEL, and NUM_SST_READ_PER_LEVEL stats were being recorded only when the last file in a level happened to have hits. They are supposed to be updated for every level. Also, there was some overcounting of GetContextStats. This PR fixes both the problems.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9583

Test Plan: Update the unit test in db_basic_test

Reviewed By: akankshamahajan15

Differential Revision: D34308044

Pulled By: anand1976

fbshipit-source-id: b3b36020fda26ba91bc6e0e47d52d58f4d7f656e
2022-03-07 15:41:00 -08:00
Hui Xiao
e40bbc57db Fix PinSelf() read-after-free in DB::GetMergeOperands() (#9507)
Summary:
**Context:**
Running the new test `DBMergeOperandTest.MergeOperandReadAfterFreeBug` prior to this fix surfaces the read-after-free bug of PinSef() as below:
```
READ of size 8 at 0x60400002529d thread T0
    https://github.com/facebook/rocksdb/issues/5 0x7f199a in rocksdb::PinnableSlice::PinSelf(rocksdb::Slice const&) include/rocksdb/slice.h:171
    https://github.com/facebook/rocksdb/issues/6 0x7f199a in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1919
    https://github.com/facebook/rocksdb/issues/7 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203

freed by thread T0 here:
    https://github.com/facebook/rocksdb/issues/3 0x1191399 in rocksdb::cache_entry_roles_detail::RegisteredDeleter<rocksdb::Block, (rocksdb::CacheEntryRole)0>::Delete(rocksdb::Slice const&, void*) cache/cache_entry_roles.h:99
    https://github.com/facebook/rocksdb/issues/4 0x719348 in rocksdb::LRUHandle::Free() cache/lru_cache.h:205
    https://github.com/facebook/rocksdb/issues/5 0x71047f in rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*, bool) cache/lru_cache.cc:547
    https://github.com/facebook/rocksdb/issues/6 0xa78f0a in rocksdb::Cleanable::DoCleanup() include/rocksdb/cleanable.h:60
    https://github.com/facebook/rocksdb/issues/7 0xa78f0a in rocksdb::Cleanable::Reset() include/rocksdb/cleanable.h:38
    https://github.com/facebook/rocksdb/issues/8 0xa78f0a in rocksdb::PinnedIteratorsManager::ReleasePinnedData() db/pinned_iterators_manager.h:71
    https://github.com/facebook/rocksdb/issues/9 0xd0c21b in rocksdb::PinnedIteratorsManager::~PinnedIteratorsManager() db/pinned_iterators_manager.h:24
    https://github.com/facebook/rocksdb/issues/10 0xd0c21b in rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool) db/pinned_iterators_manager.h:22
    https://github.com/facebook/rocksdb/issues/11 0x7f0fdf in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1886
    https://github.com/facebook/rocksdb/issues/12 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203

previously allocated by thread T0 here:
    https://github.com/facebook/rocksdb/issues/1 0x1239896 in rocksdb::AllocateBlock(unsigned long, **rocksdb::MemoryAllocator*)** memory/memory_allocator.h:35
    https://github.com/facebook/rocksdb/issues/2 0x1239896 in rocksdb::BlockFetcher::CopyBufferToHeapBuf() table/block_fetcher.cc:171
    https://github.com/facebook/rocksdb/issues/3 0x1239896 in rocksdb::BlockFetcher::GetBlockContents() table/block_fetcher.cc:206
    https://github.com/facebook/rocksdb/issues/4 0x122eae5 in rocksdb::BlockFetcher::ReadBlockContents() table/block_fetcher.cc:325
    https://github.com/facebook/rocksdb/issues/5 0x11b1f45 in rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, bool, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const table/block_based/block_based_table_reader.cc:1503
```
Here is the analysis:
- We have [PinnedIteratorsManager](https://github.com/facebook/rocksdb/blob/6.28.fb/db/version_set.cc#L1980) with `Cleanable` capability in our `Version::Get()` path. It's responsible for managing the life-time of pinned iterator and invoking registered cleanup functions during its own destruction.
  - For example in case above, the merge operands's clean-up gets associated with this manger in [GetContext::push_operand](https://github.com/facebook/rocksdb/blob/6.28.fb/table/get_context.cc#L405). During PinnedIteratorsManager's [destruction](https://github.com/facebook/rocksdb/blob/6.28.fb/db/pinned_iterators_manager.h#L67), the release function associated with those merge operand data is invoked.
**And that's what we see in "freed by thread T955 here" in ASAN.**
- Bug 🐛: `PinnedIteratorsManager` is local to `Version::Get()`  while the data of merge operands need to outlive `Version::Get` and stay till they get [PinSelf()](https://github.com/facebook/rocksdb/blob/6.28.fb/db/db_impl/db_impl.cc#L1905), **which is the read-after-free in ASAN.**
  - This bug is likely to be an overlook of `PinnedIteratorsManager` when developing the API `DB::GetMergeOperands` cuz the current logic works fine with the existing case of getting the *merged value* where the operands do not need to live that long.
- This bug was not surfaced much (even in its unit test) due to the release function associated with the merge operands (which are actually blocks put in cache as you can see in `BlockBasedTable::MaybeReadBlockAndLoadToCache` **in "previously allocated by" in ASAN report**) is a cache entry deleter.
The deleter will call `Cache::Release()` which, for LRU cache, won't immediately deallocate the block based on LRU policy [unless the cache is full or being instructed to force erase](https://github.com/facebook/rocksdb/blob/6.28.fb/cache/lru_cache.cc#L521-L531)
  - `DBMergeOperandTest.MergeOperandReadAfterFreeBug` makes the cache extremely small to force cache full.

**Summary:**
- Fix the bug by align `PinnedIteratorsManager`'s lifetime with the merge operands

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9507

Test Plan:
- New test `DBMergeOperandTest.MergeOperandReadAfterFreeBug`
- db bench on read path
  - Setup (LSM tree with several levels, cache the whole db to avoid read IO, warm cache with readseq to avoid read IO): `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq  -num=1000000 -cache_size=100000000  -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1``TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="readrandom" -num=1000000 -cache_size=100000000 `
  - Actual command run (run 20-run for 20 times and then average the 20-run's average micros/op)
     - `for j in {1..20}; do (for i in {1..20}; do rm -rf /dev/shm/rocksdb/ && TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq,readrandom" -num=1000000 -cache_size=100000000  -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1 | egrep 'readrandom'; done > rr_output_pre.txt && (awk '{sum+=$3; sum_sqrt+=$3^2}END{print sum/20, sqrt(sum_sqrt/20-(sum/20)^2)}' rr_output_pre.txt) >> rr_output_pre_2.txt); done`
  - **Result: Pre-change: 3.79193 micros/op;   Post-change: 3.79528 micros/op (+0.09%)**

(pre-change)sorted avg micros/op of each 20-run | std of micros/op of each 20-run | (post-change) sorted avg micros/op of each 20-run | std of micros/op of each 20-run
-- | -- | -- | --
3.58355 | 0.265209 | 3.48715 | 0.382076
3.58845 | 0.519927 | 3.5832 | 0.382726
3.66415 | 0.452097 | 3.677 | 0.563831
3.68495 | 0.430897 | 3.68405 | 0.495355
3.70295 | 0.482893 | 3.68465 | 0.431438
3.719 | 0.463806 | 3.71945 | 0.457157
3.7393 | 0.453423 | 3.72795 | 0.538604
3.7806 | 0.527613 | 3.75075 | 0.444509
3.7817 | 0.426704 | 3.7683 | 0.468065
3.809 | 0.381033 | 3.8086 | 0.557378
3.80985 | 0.466011 | 3.81805 | 0.524833
3.8165 | 0.500351 | 3.83405 | 0.529339
3.8479 | 0.430326 | 3.86285 | 0.44831
3.85125 | 0.434108 | 3.8717 | 0.544098
3.8556 | 0.524602 | 3.895 | 0.411679
3.8656 | 0.476383 | 3.90965 | 0.566636
3.8911 | 0.488477 | 3.92735 | 0.608038
3.898 | 0.493978 | 3.9439 | 0.524511
3.97235 | 0.515008 | 3.9623 | 0.477416
3.9768 | 0.519993 | 3.98965 | 0.521481

- CI

Reviewed By: ajkr

Differential Revision: D34030519

Pulled By: hx235

fbshipit-source-id: a99ac585c11704c5ed93af033cb29ba0a7b16ae8
2022-03-07 15:40:07 -08:00
Hui Xiao
39af4e9565 Deflake DBErrorHandlingFSTest.MultiCFWALWriteError (#9496)
Summary:
**Context:**
As part of https://github.com/facebook/rocksdb/pull/6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed](e66199d848 (diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R396)). Before re-enabling file deletion, it `assert(versions_->io_status().ok());`, which IMO assumes `versions_` is **the** `version_` in the recovery process.

However, this is not necessarily true due to `s = error_handler_.ClearBGError();` happening before that assertion can unblock some foreground thread by [`EventHelpers::NotifyOnErrorRecoveryEnd()`](3122cb4358/db/error_handler.cc (L552-L553)) as part of the `ClearBGError()`. That foreground thread can do whatever it wants including closing/reopening the db and clean up that same `versions_`.

As a consequence,  `assert(versions_->io_status().ok());`, will access `io_status()` of a nullptr and test like `DBErrorHandlingFSTest.MultiCFWALWriteError` becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to [reopen the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494), where [`versions_` gets reset to nullptr](https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl.cc?fbclid=IwAR2uRhwBiPKgmE9q_6CM2mzbfwjoRgsGpXOrHruSJUDcAKc9rYZtVSvKdOY#L678) as part of the old db clean-up. If this happens right before `assert(versions_->io_status().ok()); ` gets excuted in the background thread, then we can see error like
```
db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet'
assert(versions_->io_status().ok());
```

**Summary:**
- I proposed to call `s = error_handler_.ClearBGError();` after we know it's fine to wake up foreground, which I think is right before we LOG `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");`
   - As the context,  the orignal https://github.com/facebook/rocksdb/pull/3997  introducing `DBImpl::Resume()` calls `s = error_handler_.ClearBGError();` very close to calling `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` while the later https://github.com/facebook/rocksdb/pull/6949 distances these two calls a bit.
   - And it seems fine to me that `s = error_handler_.ClearBGError();` happens after `EnableFileDeletions(/*force=*/true);` at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before `s = error_handler_.ClearBGError();`, which basically is resetting some state variables.
- In addition, to preserve the previous behavior of  https://github.com/facebook/rocksdb/pull/6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated `enable_file_deletion_s` from the general `s`
- In addition, to make `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` more clear, I separated it into its own if-block.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9496

Test Plan:
- Manually reproduce the assertion failure in`DBErrorHandlingFSTest.MultiCFWALWriteError` by injecting sleep like below so that it's more likely for `assert(versions_->io_status().ok());` to execute after [reopening the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494) in the foreground (i.e, testing) thread
```
sleep(1);
assert(versions_->io_status().ok());
```
   `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
   ```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBErrorHandlingFSTest
[ RUN      ] DBErrorHandlingFSTest.MultiCFWALWriteError
Received signal 11 (Segmentation fault)
https://github.com/facebook/rocksdb/issues/1   rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600
https://github.com/facebook/rocksdb/issues/2   rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError()       /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310
https://github.com/facebook/rocksdb/issues/3   rocksdb/error_handler_fs_test()
   ```
- The assertion failure does not happen with PR
`python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
`[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms)  `

Reviewed By: riversand963, anand1976

Differential Revision: D33990099

Pulled By: hx235

fbshipit-source-id: 2e0259a471fa8892ff177da91b3e1c0792dd7bab
2022-03-07 15:35:15 -08:00
Jay Zhuang
529efcc5b2 Unschedule manual compaction from thread-pool queue (#9625)
Summary:
PR https://github.com/facebook/rocksdb/issues/9557 introduced a race condition between manual compaction
foreground thread and background compaction thread.
This PR adds the ability to really unschedule manual compaction from
thread-pool queue by differentiate tag name for manual compaction and
other tasks.
Also fix an issue that db `close()` didn't cancel the manual compaction thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9625

Test Plan: unittest not hang

Reviewed By: ajkr

Differential Revision: D34410811

Pulled By: jay-zhuang

fbshipit-source-id: cb14065eabb8cf1345fa042b5652d4f788c0c40c
2022-03-07 15:34:09 -08:00
Yanqin Jin
d6bb43202e Fix bug causing incorrect data returned by snapshot read (#9648)
Summary:
This bug affects use cases that meet the following conditions
- (has only the default column family or disables WAL) and
- has at least one event listener
- atomic flush is NOT affected.

If the above conditions meet, then RocksDB can release the db mutex before picking all the
existing memtables to flush. In the meantime, a snapshot can be created and db's sequence
number can still be incremented. The upcoming flush will ignore this snapshot.
A later read using this snapshot can return incorrect result.

To fix this issue, we call the listeners callbacks after picking the memtables so that we avoid
creating snapshots during this interval.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9648

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D34555456

Pulled By: riversand963

fbshipit-source-id: 1438981e9f069a5916686b1a0ad7627f734cf0ee
2022-03-07 15:32:50 -08:00
Yanqin Jin
e5451b30db Fix a silent data loss for write-committed txn (#9571)
Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time    thread 1                                       bg flush
 |   db->Put("a")
 |   txn = NewTxn()
 |   txn->Put("b", "v")
 |   txn->Prepare()       // writes only to 5.log
 |   db->SwitchMemtable() // memtable 1 has "a"
 |                        // close 5.log,
 |                        // creates 8.log
 |   trigger flush
 |                                                  pick memtable 1
 |                                                  unlock db mutex
 |                                                  write new sst
 |   txn->ctwb->Put("gtid", "1") // writes 8.log
 |   txn->Commit() // writes to 8.log
 |                 // writes to memtable 2
 |                                               compute min_log_number_to_keep_2pc, this
 |                                               will be 8 (incorrect).
 |
 |                                             Purge obsolete wals, including 5.log
 |
 V
```

At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log

The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.

The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571

Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```

Reviewed By: siying

Differential Revision: D34235236

Pulled By: riversand963

fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
2022-02-17 15:37:59 -08:00
Jay Zhuang
05769ea7fb Cancel manual compaction in thread-pool queue (#9557)
Summary:
Fix `DisableManualCompaction()` has to wait scheduled manual compaction to start the execution to cancel the job.
When a manual compaction in thread-pool queue is cancel, set the job is_canceled to true and clean the resource.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9557

Test Plan: added unittest that will hang without the change

Reviewed By: ajkr

Differential Revision: D34214910

Pulled By: jay-zhuang

fbshipit-source-id: 89dbaee78ddf26eb13ce862c2b15f4a098b36a78
2022-02-15 19:50:25 -08:00
Peter Dillinger
da478f3eae Fix major bug with MultiGet, DeleteRange, and memtable Bloom (#9453)
Summary:
MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.

This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)

Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9453

Test Plan:
new unit test added. Added memtable Bloom to crash test.

Performance testing
--------------------

Build WAL-only DB (recovers to memtable):
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000
```

Query test command, to maximize sensitivity to the changed code:
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS
```
(Note -num here is 10x larger for mostly memtable misses)

Before & after run simultaneously, average over 10 iterations per data point, ops/sec.

MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066

MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990

MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before:  12081715
After: 12117603

MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591

Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.

Reviewed By: ajkr

Differential Revision: D33805025

Pulled By: pdillinger

fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
2022-01-31 11:32:04 -08:00
Peter Dillinger
fc9d4071f0 Fast path for detecting unchanged prefix_extractor (#9407)
Summary:
Fixes a major performance regression in 6.26, where
extra CPU is spent in SliceTransform::AsString when reads involve
a prefix_extractor (Get, MultiGet, Seek). Common case performance
is now better than 6.25.

This change creates a "fast path" for verifying that the current prefix
extractor is unchanged and compatible with what was used to
generate a table file. This fast path detects the common case by
pointer comparison on the current prefix_extractor and a "known
good" prefix extractor (if applicable) that is saved at the time the
table reader is opened. The "known good" prefix extractor is saved
as another shared_ptr copy (in an existing field, however) to ensure
the pointer is not recycled.

When the prefix_extractor has changed to a different instance but
same compatible configuration (rare, odd), performance is still a
regression compared to 6.25, but this is likely acceptable because
of the oddity of such a case. The performance of incompatible
prefix_extractor is essentially unchanged.

Also fixed a minor case (ForwardIterator) where a prefix_extractor
could be used via a raw pointer after being freed as a shared_ptr,
if replaced via SetOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9407

Test Plan:
## Performance
Populate DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`

Running head-to-head comparisons simultaneously with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=seekrandom -num=10000000 -duration=20 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`

Below each is compared by ops/sec vs. baseline which is version 6.25 (multiple baseline runs because of variable machine load)

v6.26: 4833 vs. 6698 (<- major regression!)
v6.27: 4737 vs. 6397 (still)
New: 6704 vs. 6461 (better than baseline in common case)
Disabled fastpath: 4843 vs. 6389 (e.g. if prefix extractor instance changes but is still compatible)
Changed prefix size (no usable filter) in new: 787 vs. 5927
Changed prefix size (no usable filter) in new & baseline: 773 vs. 784

Reviewed By: mrambacher

Differential Revision: D33677812

Pulled By: pdillinger

fbshipit-source-id: 571d9711c461fb97f957378a061b7e7dbc4d6a76
2022-01-21 11:37:46 -08:00
Peter Dillinger
8064a3ac31 Fix flaky EventListenerTest.DisableBGCompaction (#9400)
Summary:
Wasn't able to easily reproduce error, but easy to see a race
condition between TestFlushListener::OnFlushCompleted and
DBTestBase::Close(), which frees CF handles before closing DB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9400

Test Plan: CI etc.

Reviewed By: riversand963

Differential Revision: D33645134

Pulled By: pdillinger

fbshipit-source-id: d0ec914cc43c9e14f53da633876b95b61995138d
2022-01-21 08:25:09 -08:00
Peter Dillinger
5576ded762 Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363)
Summary:
In response to https://github.com/facebook/rocksdb/issues/9354, this PR adds a way for users to "opt out"
of extra checks that can impact peak write performance, which
currently only includes force_consistency_checks. I considered including
some other options but did not see a db_bench performance difference.

Also clarify in comment for force_consistency_checks that it can "slow
down saturated writing."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9363

Test Plan:
basic coverage in unit tests

Using my perf test in https://github.com/facebook/rocksdb/issues/9354 comment, I see

force_consistency_checks=true -> 725360 ops/s
force_consistency_checks=false -> 783072 ops/s

Reviewed By: mrambacher

Differential Revision: D33636559

Pulled By: pdillinger

fbshipit-source-id: 25bfd006f4844675e7669b342817dd4c6a641e84
2022-01-18 17:31:03 -08:00
Si Ke
93b1de4f45 Enable db_test running in Centos 32 bit OS and Alpine 32 bit OS (#9294)
Summary:
Closes https://github.com/facebook/rocksdb/issues/9271

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9294

Reviewed By: riversand963, hx235

Differential Revision: D33586002

Pulled By: pdillinger

fbshipit-source-id: 3d1a2fa71023e108613ff03dbd37a5f954fc4920
2022-01-14 11:58:18 -08:00
zhuchong0329
5f2b661f54 FlushMemTable return ok but memtable does not synchronize flush (#8173)
Summary:
Fix https://github.com/facebook/rocksdb/issues/8046 : FlushMemTable return ok but memtable does not synchronize flush. The way to fix it is to expose RecoveryError.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8173

Reviewed By: ajkr

Differential Revision: D31674552

Pulled By: jay-zhuang

fbshipit-source-id: 9d16b69ba12a196bb429332ec8224754de97773d
2022-01-12 13:21:49 -08:00
Yanqin Jin
0376869f05 Remove using namespace (#9369)
Summary:
As title.
This is part of an fb-internal task.
First, remove all `using namespace` statements if applicable.
Next, utilize multiple build platforms and see if anything is broken.
Should anything become broken, fix the compilation errors with as little extra change as possible.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9369

Test Plan:
internal build and make check
make clean && make static_lib && cd examples && make all

Reviewed By: pdillinger

Differential Revision: D33517260

Pulled By: riversand963

fbshipit-source-id: 3fc4ce6402a073421dfd9a9b2d1c79441dca7a40
2022-01-12 09:31:12 -08:00
Niklas Fiekas
f8bdd5797f Take compression level_values as const pointer (#9376)
Summary:
Compatible change, more natural (especially in generated Rust bindings), no risk that the API will ever need mutable access because it has to make a copy anyway.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9376

Reviewed By: ajkr

Differential Revision: D33541435

Pulled By: pdillinger

fbshipit-source-id: 15c512a0d70b6e8694fa99d598b7d022751c1e59
2022-01-12 08:34:53 -08:00
Jay Zhuang
9c6fb26033 Fix clang13 build error (#9374)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9374

Test Plan: Add CI for clang13 build

Reviewed By: riversand963

Differential Revision: D33522867

Pulled By: jay-zhuang

fbshipit-source-id: 642756825cf0b51e35861fb847ebaee4611b76ca
2022-01-11 10:36:22 -08:00
mrambacher
1973fcba11 Restore Regex support for ObjectLibrary::Register, rename new APIs to allow old one to be deprecated in the future (#9362)
Summary:
In order to support old-style regex function registration, restored the original "Register<T>(string, Factory)" method using regular expressions.  The PatternEntry methods were left in place but renamed to AddFactory.  The goal is to allow for the deprecation of the original regex Registry method in an upcoming release.

Added modes to the PatternEntry kMatchZeroOrMore and kMatchAtLeastOne to match * or +, respectively (kMatchAtLeastOne was the original behavior).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9362

Reviewed By: pdillinger

Differential Revision: D33432562

Pulled By: mrambacher

fbshipit-source-id: ed88ab3f9a2ad0d525c7bd1692873f9bb3209d02
2022-01-11 06:33:48 -08:00
jsteemann
255aefb628 Add filename to several Corruption messages (#9239)
Summary:
This change adds the filename of the offending filen to several place that produce Status objects with code `kCorruption`.
This is not an attempt to have every Corruption message in the codebase extended with the filename, but it is a start.
The motivation for the change was to quickly diagnose which file is corrupted when a large database is openend and there is not option to copy it offsite for analysis, run strace or install the ldb tool.
In the particular case in question, the error message improved from a mere
```
Corruption: checksum mismatch
```
to
```
Corruption: checksum mismatch in file /path/to/db/engine-rocksdb/MANIFEST-000171
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9239

Reviewed By: jay-zhuang

Differential Revision: D33237742

Pulled By: riversand963

fbshipit-source-id: bd42559cfbf786a0a674d091671d1a2bf07bdd31
2022-01-07 18:09:48 -08:00
Youngjae Lee
3dfee770c6 Remove obsolete function declaration (#8724)
Summary:
Function `Version::UpdateFilesByCompactionPri()` is never called and not implemented.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8724

Reviewed By: ajkr

Differential Revision: D30643943

Pulled By: riversand963

fbshipit-source-id: 174b2d9a2a42e286222909a035cc74a7b5602335
2022-01-07 18:06:10 -08:00
Yanqin Jin
b2e53ab2d8 Add checking for DB::DestroyColumnFamilyHandle() (#9347)
Summary:
Closing https://github.com/facebook/rocksdb/issues/5006

Calling `DB::DestroyColumnFamilyHandle(column_family)` with `column_family` being the return value of
`DB::DefaultColumnFamily()` will return `Status::InvalidArgument()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9347

Test Plan: make check

Reviewed By: akankshamahajan15

Differential Revision: D33369675

Pulled By: riversand963

fbshipit-source-id: a8266a4daddf2b7a773c2dc7f3eb9a4adfb6b6dd
2022-01-05 20:26:53 -08:00
Andrew Kryczka
b860a42158 Recover to exact latest seqno of data committed to MANIFEST (#9305)
Summary:
The LastSequence field in the MANIFEST file is the baseline seqno for a recovered DB. Recovering WAL entries might cause the recovered DB's seqno to advance above this baseline, but the recovered DB will never use a smaller seqno.

Before this PR, we were writing the DB's seqno at the time of LogAndApply() as the LastSequence value. This works in the sense that it is a large enough baseline for the recovered DB that it'll never overwrite any records in existing SST files. At the same time, it's arbitrarily larger than what's needed. This behavior comes from LevelDB, where there was no tracking of largest seqno in an SST file.

Now we know the largest seqno of newly written SST files, so we can write an exact value in LastSequence that actually reflects the largest seqno in any file referred to by the MANIFEST. This is primarily useful for correctness testing with unsynced data loss, where the recovered DB's seqno needs to indicate what records were recovered.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9305

Test Plan:
- https://github.com/facebook/rocksdb/issues/9338 adds crash-recovery correctness testing coverage for WAL disabled use cases
- https://github.com/facebook/rocksdb/issues/9357 will extend that testing to cover file ingestion
- Added assertion at end of LogAndApply() for `VersionSet::descriptor_last_sequence_` consistency with files
- Manually tested upgrade/downgrade compatibility with a custom crash test that randomly picks between a `db_stress` built with and without this PR (for old code it must run with `-disable_wal=0`)

Reviewed By: riversand963

Differential Revision: D33182770

Pulled By: ajkr

fbshipit-source-id: 0bfafaf685f347cc8cb0e1d62e0186340a738f7d
2022-01-05 16:02:21 -08:00
mrambacher
fe31dc53ca Make the Env class Customizable (#9293)
Summary:
Allows the Env to have options (Configurable) and loads like other Customizable classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293

Reviewed By: pdillinger, zhichao-cao

Differential Revision: D33181591

Pulled By: mrambacher

fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7
2022-01-04 16:45:49 -08:00
Yanqin Jin
677d2b4a8f Fix a bug in C-binding causing iterator to return incorrect result (#9343)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/9339

When writing SST file, the name, computed as `prefix_extractor->GetId()` will be written to the properties block.
When the SST is opened again in the future, `CreateFromString()` will take the name as argument and try
to create a prefix extractor object. Without this fix, the C API will pass a `Wrapper` pointer to the underlying
DB's `prefix_extractor`. `Wrapper::GetId()`, in this case, will be missing the prefix length component, causing a
prefix extractor of length 0 to be silently created and used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9343

Test Plan:
```
make c_test
./c_test
```

Reviewed By: mrambacher

Differential Revision: D33355549

Pulled By: riversand963

fbshipit-source-id: c92c3acd8be262c3bff8794b4229e42b9ee31203
2021-12-30 12:48:07 -08:00
Andrew Kryczka
aa2b3bf675 Added TraceOptions::preserve_write_order (#9334)
Summary:
This option causes trace records to be written in the serialized write thread. That way, the write records in the trace must follow the same order as writes that are logged to WAL and writes that are applied to the DB.

By default I left it disabled to match existing behavior. I enabled it in `db_stress`, though, as that use case requires order of write records in trace matches the order in WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9334

Test Plan:
- See if below unsynced data loss crash test can run  for 24h straight. It used to crash after a few hours when reaching an unlucky trace ordering.

```
DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm /usr/local/bin/python3 -u tools/db_crashtest.py blackbox --interval=10 --max_key=100000 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --value_size_mult=33 --sync_fault_injection=1 --test_batches_snapshots=0 --duration=86400
```

Reviewed By: zhichao-cao

Differential Revision: D33301990

Pulled By: ajkr

fbshipit-source-id: 82d97559727adb4462a7af69758449c8725b22d3
2021-12-28 15:04:26 -08:00
slk
2e5f764294 Make IncreaseFullHistoryTsLow to a public API (#9221)
Summary:
As (https://github.com/facebook/rocksdb/issues/9210) discussed, the **full_history_ts_low** is a member of CompactRangeOptions currently, which means a CF's fullHistoryTsLow is advanced only when users submit a CompactRange request.
However, users may want to advance the fllHistoryTsLow without an immediate compact.
This merge make IncreaseFullHistoryTsLow to a public API so users can advance each CF's fullHistoryTsLow seperately.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9221

Reviewed By: akankshamahajan15

Differential Revision: D33201106

Pulled By: riversand963

fbshipit-source-id: 9cb1d013ba93260f72e16353e693ffee167b47ee
2021-12-23 11:03:51 -08:00
Andrew Kryczka
538d2365e9 Fix race condition in BackupEngineTest.ChangeManifestDuringBackupCreation (#9327)
Summary:
The failure looked like this:

```
utilities/backupable/backupable_db_test.cc:3161: Failure
Value of: db_chroot_env_->FileExists(prev_manifest_path).IsNotFound()
  Actual: false
Expected: true
```

The failure could be coerced consistently with the following patch:

```
 diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 80410f671..637636791 100644
 --- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -2772,6 +2772,8 @@ void DBImpl::BackgroundCallFlush(Env::Priority thread_pri) {
     if (job_context.HaveSomethingToClean() ||
         job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
       mutex_.Unlock();
+      bg_cv_.SignalAll();
+      sleep(1);
       TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:FilesFound");
       // Have to flush the info logs before bg_flush_scheduled_--
       // because if bg_flush_scheduled_ becomes 0 and the lock is
```

The cause was a familiar problem, which is manual flush/compaction may
return before files they obsoleted are removed. The solution is just to
wait for "scheduled" work to complete, which includes all phases
including cleanup.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9327

Test Plan:
after this PR, even the above patch to coerce the bug cannot
cause the test to fail.

Reviewed By: riversand963

Differential Revision: D33252208

Pulled By: ajkr

fbshipit-source-id: 720a7eaca58c7247d221911fffe3d5e1dbf581e9
2021-12-22 21:59:53 -08:00
Andrew Kryczka
393fc231af More asserts in listener_test for debuggability (#9320)
Summary:
We ran into a flake I could not debug so instead added assertions in
case it happens again.

Command was:

```
TEST_TMPDIR=/dev/shm/rocksdb COMPILE_WITH_UBSAN=1 USE_CLANG=1 OPT=-g SKIP_FORMAT_BUCK_CHECKS=1 make J=80 -j80 ubsan_check
```

Failure output was:

```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from EventListenerTest
[ RUN      ] EventListenerTest.DisableBGCompaction
UndefinedBehaviorSanitizer:DEADLYSIGNAL
==1558126==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000031 (pc 0x7fd9c04dda22 bp 0x7fd9bf8aa580 sp 0x7fd9bf8aa540 T1558147)
==1558126==The signal is caused by a READ memory access.
==1558126==Hint: address points to the zero page.
    #0 0x7fd9c04dda21 in __dynamic_cast /home/engshare/third-party2/libgcc/9.x/src/gcc-9.x/x86_64-facebook-linux/libstdc++-v3/libsupc++/../../.././libstdc++-v3/libsupc++/dyncast.cc:49:3
    https://github.com/facebook/rocksdb/issues/1 0x510c53 in __ubsan::checkDynamicType(void*, void*, unsigned long) (/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/listener_test+0x510c53)
    https://github.com/facebook/rocksdb/issues/2 0x50fb32 in HandleDynamicTypeCacheMiss(__ubsan::DynamicTypeCacheMissData*, unsigned long, unsigned long, __ubsan::ReportOptions) (/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/listener_test+0x50fb32)
    https://github.com/facebook/rocksdb/issues/3 0x510230 in __ubsan_handle_dynamic_type_cache_miss_abort (/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/listener_test+0x510230)
    https://github.com/facebook/rocksdb/issues/4 0x63221a in rocksdb::ColumnFamilyHandleImpl* rocksdb::static_cast_with_check<rocksdb::ColumnFamilyHandleImpl, rocksdb::ColumnFamilyHandle>(rocksdb::ColumnFamilyHandle*) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/./util/cast_util.h:19:20
    https://github.com/facebook/rocksdb/issues/5 0x71cafa in rocksdb::DBImpl::TEST_GetFilesMetaData(rocksdb::ColumnFamilyHandle*, std::vector<std::vector<rocksdb::FileMetaData, std::allocator<rocksdb::FileMetaData> >, std::allocator<std::vector<rocksdb::FileMetaData, std::allocator<rocksdb::FileMetaData> > > >*, std::vector<std::shared_ptr<rocksdb::BlobFileMetaData>, std::allocator<std::shared_ptr<rocksdb::BlobFileMetaData> > >*) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_debug.cc:63:14
    https://github.com/facebook/rocksdb/issues/6 0x53f6b4 in rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/listener_test.cc:277:24
    https://github.com/facebook/rocksdb/issues/7 0x6e2f7d in rocksdb::DBImpl::NotifyOnFlushCompleted(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions const&, std::__cxx11::list<std::unique_ptr<rocksdb::FlushJobInfo, std::default_delete<rocksdb::FlushJobInfo> >, std::allocator<std::unique_ptr<rocksdb::FlushJobInfo, std::default_delete<rocksdb::FlushJobInfo> > > >*) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:863:19
    https://github.com/facebook/rocksdb/issues/8 0x6e1074 in rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, rocksdb::SuperVersionContext*, std::vector<unsigned long, std::allocator<unsigned long> >&, unsigned long, rocksdb::SnapshotChecker*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:314:5
    https://github.com/facebook/rocksdb/issues/9 0x6e3412 in rocksdb::DBImpl::FlushMemTablesToOutputFiles(rocksdb::autovector<rocksdb::DBImpl::BGFlushArg, 8ul> const&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:359:14
    https://github.com/facebook/rocksdb/issues/10 0x700df6 in rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2703:14
    https://github.com/facebook/rocksdb/issues/11 0x6fe1f0 in rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2742:16
    https://github.com/facebook/rocksdb/issues/12 0x6fc732 in rocksdb::DBImpl::BGWorkFlush(void*) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2569:44
    https://github.com/facebook/rocksdb/issues/13 0xb3a820 in void std::_Bind<void (* (void*))(void*)>::operator()<void>() /mnt/gvfs/third-party2/libgcc/4959b39cfbe5965a37c861c4c327fa7c5c759b87/9.x/platform009/9202ce7/include/c++/9.x/functional:482:17
    https://github.com/facebook/rocksdb/issues/14 0xb3a820 in std::_Function_handler<void (), std::_Bind<void (* (void*))(void*)> >::_M_invoke(std::_Any_data const&) /mnt/gvfs/third-party2/libgcc/4959b39cfbe5965a37c861c4c327fa7c5c759b87/9.x/platform009/9202ce7/include/c++/9.x/bits/std_function.h:300:2
    https://github.com/facebook/rocksdb/issues/15 0xb347cc in rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/util/threadpool_imp.cc:266:5
    https://github.com/facebook/rocksdb/issues/16 0xb34a2f in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/util/threadpool_imp.cc:307:7
    https://github.com/facebook/rocksdb/issues/17 0x7fd9c051a660 in execute_native_thread_routine /home/engshare/third-party2/libgcc/9.x/src/gcc-9.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:80:18
    https://github.com/facebook/rocksdb/issues/18 0x7fd9c041e20b in start_thread /home/engshare/third-party2/glibc/2.30/src/glibc-2.30/nptl/pthread_create.c:479:8
    https://github.com/facebook/rocksdb/issues/19 0x7fd9c01dd16e in clone /home/engshare/third-party2/glibc/2.30/src/glibc-2.30/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9320

Reviewed By: jay-zhuang

Differential Revision: D33242185

Pulled By: ajkr

fbshipit-source-id: 741984b10a610e0509e0d4e54c42cdbac03f5285
2021-12-21 12:27:54 -08:00
mrambacher
9a116ab4b4 Add NewMetaDataIterator method (#8692)
Summary:
Fixes a problem where the iterator for metadata was being treated as a non-user key when in fact it was a user key.  This led to a problem where the property keys could not be searched for correctly.

The main exposure of this problem was that the HashIndexReader could not get the "prefixes" property correctly, resulting in the failure of retrieval/creation of the BlockPrefixIndex.

Added BlockBasedTableTest.SeekMetaBlocks test to validate this condition.

Fixing this condition exposed two other tests (SeekWithPrefixLongerThanKey, MultiGetPrefixFilter) that passed incorrectly previously and now failed.  Updated those two tests to pass.  Not sure if the tests are functionally correct/still appropriate, but made them pass...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8692

Reviewed By: riversand963

Differential Revision: D33119539

Pulled By: mrambacher

fbshipit-source-id: 658969fe9265f73dc184dab97cc3f4eaed2d881a
2021-12-21 11:32:49 -08:00
Andrew Kryczka
782fcc44e1 Fix race condition in error_handler_fs_test (#9325)
Summary:
We saw the below assertion failure in `error_handler_fs_test`:

```
db/error_handler_fs_test.cc:2471: Failure
Expected equality of these values:
  listener->new_bg_error()
    Which is: 16-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>
  Status::Aborted()
    Which is: 16-byte object <0A-00 00-00 60-61 00-00 00-00 00-00 00-00 00-00>
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
  what():  db/error_handler_fs_test.cc:2471: Failure
Expected equality of these values:
  listener->new_bg_error()
    Which is: 16-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>
  Status::Aborted()
    Which is: 16-byte object <0A-00 00-00 60-61 00-00 00-00 00-00 00-00 00-00>
Received signal 6 (Aborted)
```

The problem was completing `OnErrorRecoveryCompleted()` would
wake up the main thread and allow it to proceed to that assertion. But
that assertion assumes `OnErrorRecoveryEnd()` has completed since
only `OnErrorRecoveryEnd()` affects `new_bg_error()`.

The fix is just to make `OnErrorRecoveryCompleted()` not wake up the
main thread, by means of not implementing it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9325

Test Plan:
- ran `while TEST_TMPDIR=/dev/shm ./error_handler_fs_test ; do : ; done` for a while
- injected sleep between `OnErrorRecovery{Completed,End}()` callbacks, which guaranteed repro before this PR

Reviewed By: anand1976

Differential Revision: D33249200

Pulled By: ajkr

fbshipit-source-id: 1659ee183cd09f90d4dbd898f65103473fcf84a8
2021-12-20 23:16:52 -08:00
mrambacher
423538a816 Make MemoryAllocator into a Customizable class (#8980)
Summary:
- Make MemoryAllocator and its implementations into a Customizable class.
- Added a "DefaultMemoryAllocator" which uses new and delete
- Added a "CountedMemoryAllocator" that counts the number of allocs and free
- Updated the existing tests to use these new allocators
- Changed the memkind allocator test into a generic test that can test the various allocators.
- Added tests for creating all of the allocators
- Added tests to verify/create the JemallocNodumpAllocator using its options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980

Reviewed By: zhichao-cao

Differential Revision: D32990403

Pulled By: mrambacher

fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76
2021-12-17 04:20:47 -08:00
Peter Dillinger
0050a73a4f New stable, fixed-length cache keys (#9126)
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).

The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.

This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)

The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.

Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.

Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126

Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.

### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)

### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime

We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:

```
Total cache or DBs size: 32TiB  Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```

These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key.  With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.

More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day

After enough data, we get a result at the end:

```
(keep 40 bits)  17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```

If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:

```
(keep 41 bits)  16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits)  19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```

The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:

```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```

I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.

Reviewed By: zhichao-cao

Differential Revision: D33171746

Pulled By: pdillinger

fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
2021-12-16 17:15:13 -08:00
Akanksha Mahajan
96d0773a11 Update prepopulate_block_cache logic to support block-based filter (#9300)
Summary:
Update prepopulate_block_cache logic to support block-based
filter during insertion in block cache

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9300

Test Plan:
CircleCI tests,
make crash_test -j64

Reviewed By: pdillinger

Differential Revision: D33132018

Pulled By: akankshamahajan15

fbshipit-source-id: 241deabab8645bda704728e572d6de6354df18b2
2021-12-15 13:20:27 -08:00
Yanqin Jin
08721293ea Fix a bug causing duplicate trailing entries in WritableFile (buffered IO) (#9236)
Summary:
`db_stress` is a user of `FaultInjectionTestFS`. After injecting a write error, `db_stress` probabilistically determins
data drop (https://github.com/facebook/rocksdb/blob/6.27.fb/db_stress_tool/db_stress_test_base.cc#L2615:L2619).

In some of our recent runs of `db_stress`, we found duplicate trailing entries corresponding to file trivial move in
the MANIFEST, causing the recovery to fail, because the file move operation is not idempotent: you cannot delete a
file from a given level twice.

Investigation suggests that data buffering in both `WritableFileWriter` and `FaultInjectionTestFS` may be the root cause.

WritableFileWriter buffers data to write in a memory buffer, `WritableFileWriter::buf_`. After each
`WriteBuffered()`/`WriteBufferedWithChecksum()` succeeds, the `buf_` is cleared.

If the underlying file `WritableFileWriter::writable_file_` is opened in buffered IO mode, then `FaultInjectionTestFS`
buffers data written for each file until next file sync. After an injected error, user of `FaultInjectionFS` can
choose to drop some or none of previously buffered data. If `db_stress` does not drop any unsynced data, then
such data will still exist in the `FaultInjectionTestFS`'s buffer.

Existing implementation of `WritableileWriter::WriteBuffered()` does not clear `buf_` if there is an error. This may lead
to the data being buffered two copies: one in `WritableFileWriter`, and another in `FaultInjectionTestFS`.
We also know that the `WritableFileWriter` of MANIFEST file will close upon an error.  During `Close()`, it will flush the
content in `buf_`. If no write error is injected to `FaultInjectionTestFS` this time, then we end up with two copies of the
data appended to the file.

To fix, we clear the `WritableFileWriter::buf_` upon failure as well. We focus this PR on files opened in non-direct mode.

This PR includes a unit test to reproduce a case when write error injection
to `WritableFile` can cause duplicate trailing entries.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9236

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D33033984

Pulled By: riversand963

fbshipit-source-id: ebfa5a0db8cbf1ed73100528b34fcba543c5db31
2021-12-13 09:00:36 -08:00
Akanksha Mahajan
eca85cdb66 Fix flaky tests related to Blob file deletions (#9287)
Summary:
CompactRange() only waits for manual.done to be set
which happens as soon as new version is installed. Added TEST_WaitForCompact() which
waits for compaction thread to actually finish which is after
PurgeObsoleteFiles().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9287

Test Plan: Reproducible by adding  `bg_cv_.SignalAll();`  inside if condition 297d913275/db/db_impl/db_impl_compaction_flush.cc (L2876)

Reviewed By: ajkr

Differential Revision: D33051122

Pulled By: akankshamahajan15

fbshipit-source-id: cd793c79efb8cf8587faaf89f7c51f5d8e5bb71d
2021-12-12 15:31:38 -08:00
Yanqin Jin
5455cacd18 Fix link error reported in issue 9272 (#9278)
Summary:
As title, Closes https://github.com/facebook/rocksdb/issues/9272
Since TimestampAssigner-related classes needs to access
`WriteBatch::ProtectionInfo` objects which is for internal use only,
it's difficult to make `AssignTimestamp` methods a template and put them
in the same public header, `include/rocksdb/write_batch.h`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9278

Test Plan:
```
make check
# Also manually test following the repro-steps in issue 9272
```

Reviewed By: ltamasi

Differential Revision: D33012686

Pulled By: riversand963

fbshipit-source-id: 89f24a86a1170125bd0b94ef3b32e69aa08bd949
2021-12-10 20:33:46 -08:00
Yanqin Jin
bd513fd075 Add commit marker with timestamp (#9266)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266

This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.

Reviewed By: ltamasi

Differential Revision: D31721350

fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
2021-12-10 11:05:35 -08:00
Peter Dillinger
653c392e47 More refactoring ahead of footer & meta changes (#9240)
Summary:
I'm working on a new format_version=6 to support context
checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test
updates to support that change.

Test coverage data and manual inspection agree on dead code in
block_based_table_reader.cc (removed).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240

Test Plan:
tests enhanced to cover more cases etc.

Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible.

TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true}

(Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness)
Before w/ wal: 454512
After w/ wal: 444820 (-2.1%)
Before w/o wal: 1004560
After w/o wal: 998897 (-0.6%)

Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case.

This regression will be corrected in a follow-up PR.

Reviewed By: ajkr

Differential Revision: D32813769

Pulled By: pdillinger

fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f
2021-12-10 08:13:26 -08:00
mrambacher
5486717ee2 Fix an issue with MemTableRepFactory::CreateFromString (#9273)
Summary:
If ignore_unsupported_options=true, then it is possible for MemTableRepFactory::CreateFromString to succeed without setting a result (result=nullptr).  This would cause the original value to be overwritten with null and an error would be raised later when PrepareOptions is invoked.

Added unit test for this condition.  Will add (in another PR unless required by reviewers) comparable tests for all of the other Customizable classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9273

Reviewed By: ltamasi

Differential Revision: D32990365

Pulled By: mrambacher

fbshipit-source-id: b150724c3f5ae7346357b3866244fd93466875c7
2021-12-09 12:36:18 -08:00
Si Ke
79f4a04ee3 Get DBTest passing Assert Status Checked (#7737)
Summary:
Closes https://github.com/facebook/rocksdb/pull/7737

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9231

Reviewed By: hx235

Differential Revision: D32978332

Pulled By: pdillinger

fbshipit-source-id: b28900b685d60c668529a90dbaa8e1b357b28f76
2021-12-09 11:00:17 -08:00
anand76
ecf2bec613 Add a listener callback for end of auto error recovery (#9244)
Summary:
Previously, the OnErrorRecoveryCompleted callback was called when
RocksDB was able to successfully recover from a retryable error.
However, if the recovery failed and was eventually stopped, there was no
indication of the status. To fix that, a new OnErrorRecoveryEnd callback
is introduced that deprecates the OnErrorRecoveryCompleted callback. The
new callback is called with the original error and the new error status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9244

Test Plan: Add a new unit test in error_handler_fs_test

Reviewed By: zhichao-cao

Differential Revision: D32922303

Pulled By: anand1976

fbshipit-source-id: f04e77a9cb92c5ea6385590682d3fcf559971b99
2021-12-08 14:30:57 -08:00
Akanksha Mahajan
9e4d56f2c9 Fix segmentation fault in table_options.prepopulate_block_cache when used with partition_filters (#9263)
Summary:
When table_options.prepopulate_block_cache is set to
BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly and
table_options.partition_filters is also set true, then there is
segmentation failure when top level filter is fetched because its
entered with wrong type in cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9263

Test Plan:
Updated unit tests;
Ran db_stress: make crash_test -j32

Reviewed By: pdillinger

Differential Revision: D32936566

Pulled By: akankshamahajan15

fbshipit-source-id: 8bd79e53830d3e3c1bb79787e1ffbc3cb46d4426
2021-12-08 12:44:38 -08:00
Levi Tamasi
94d99400dc Fix a typo in DBSSTTest.DBWithMaxSpaceAllowedWithBlobFiles (#9270)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9270

Test Plan:
```
gtest-parallel --repeat=10000 ./db_sst_test --gtest_filter=DBSSTTest.DBWithMaxSpaceAllowedWithBlobFiles
```

Reviewed By: akankshamahajan15

Differential Revision: D32958154

Pulled By: ltamasi

fbshipit-source-id: b6ec2fbbece80d73c567cec57638dffd3c84a2ba
2021-12-08 12:05:37 -08:00
Levi Tamasi
d1f053b0ae Attempt to deflake DBSSTTest.DestroyDBWithRateLimitedDelete (#9269)
Summary:
This test case seems to be occasionally failing due to the code hitting
the immediate deletion branch in `DeleteScheduler::DeleteFile`. The
patch increases the allowed trash ratio to a huge value to prevent this
from happening.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9269

Test Plan:
```
gtest-parallel --repeat=10000 ./db_sst_test --gtest_filter=DBSSTTest.DestroyDBWithRateLimitedDelete
```

Reviewed By: akankshamahajan15

Differential Revision: D32956596

Pulled By: ltamasi

fbshipit-source-id: 3945e7c1c19ede76698e03c3f133bc1d9fd61b84
2021-12-08 11:16:46 -08:00
sdong
88875df821 File temperature information should be preserved when restart the DB (#9242)
Summary:
Fix a bug that causes file temperature not preserved after DB is restarted, or options.max_manifest_file_size is hit.
Also, pass temperature information to NewRandomAccessFile() to allow users to hack a solution where they don't preserve tiering information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9242

Test Plan: Add a unit test that would fail without the fix.

Reviewed By: jay-zhuang

Differential Revision: D32818150

fbshipit-source-id: 36aa3f148c60107f7b8e9d65b63b039f9e1a1eec
2021-12-03 14:43:14 -08:00
Levi Tamasi
930f2e92e6 Attempt to deflake DBSSTTest.DBWithSFMForBlobFilesAtomicFlush (#9241)
Summary:
When using the SST file manager, the actual deletion of DB files
potentially occurs in the background. The patch adds another call
to `SstFileManagerImpl::WaitForEmptyTrash` to the test case
`DBSSTTest.DBWithSFMForBlobFilesAtomicFlush` to ensure the deletions
are performed before the test checks the number of deleted files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9241

Test Plan:
```
gtest-parallel --repeat=1000 ./db_sst_test --gtest_filter=DBSSTTest.DBWithSFMForBlobFilesAtomicFlush
```

Reviewed By: akankshamahajan15

Differential Revision: D32811427

Pulled By: ltamasi

fbshipit-source-id: 7f2ad649a22bd2d7900e5f132372034093cfcf47
2021-12-02 16:54:21 -08:00
lgqss
77c7085594 MemTableList::TrimHistory now use allocated bytes (#9020)
Summary:
Fix a bug when both max_write_buffer_size_to_maintain and max_write_buffer_number_to_maintain are 0.
The bug was introduced in 6.5.0 and  https://github.com/facebook/rocksdb/issues/5022.
Fix https://github.com/facebook/rocksdb/issues/8371

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9020

Reviewed By: pdillinger

Differential Revision: D32767084

Pulled By: ajkr

fbshipit-source-id: c401ee6e2557230e892d0fe8abb4966cbd18e85f
2021-12-02 11:45:39 -08:00
Hui Xiao
9daf07305c Replace TableProperties::properties_offsets map with external_sst_file_global_seqno_offset (#9212)
Summary:
**Context:**
Searching `TableProperties::properties_offsets` across the codebase reveals that internally it is only used to find the external SST file's global seqno offeset. Therefore we can narrow it down and replace this map property with a uint64_t property `external_sst_file_global_seqno_offset` to save memory usage related to table properties.

Note:
- See PR comments for discussion about potential impact on existing external usage of `TableProperties::properties_offsets`
- See PR comments for discussion on keeping external SST file global seqno's offset VS using a simple flag indicating seqno's existence.

**Summary:**
- Replaced `TableProperties::properties_offsets` with `TableProperties::external_sst_file_global_seqno_offset`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9212

Test Plan: - Relied on existing tests should be sufficient since `TableProperties::properties_offsets` existed before and should already be tested.

Reviewed By: ajkr

Differential Revision: D32665941

Pulled By: hx235

fbshipit-source-id: 718e44617346dc4f3b1276ee953e61c196277795
2021-12-02 08:30:36 -08:00