Commit Graph

2258 Commits

Author SHA1 Message Date
Hui Xiao
920386f2b7 Detect (new) Bloom/Ribbon Filter construction corruption (#9342)
Summary:
Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, https://github.com/facebook/rocksdb/pull/9345, (optional) https://github.com/facebook/rocksdb/pull/9393
**Context:**
(Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following:
a) set of keys to add to filter
b) set of hashes to add to filter (64-bit hash applied to each key)
c) set of Bloom indices to set in filter, with duplicates
d) set of Bloom indices to set in filter, deduplicated
e) final filter and its checksum

This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level.
- b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`)
- c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO.

Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default.

**Summary:**
- Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()`
- Ensured hash entries, final filter and banding and their [cache reservation ](https://github.com/facebook/rocksdb/issues/9073) are released properly despite corruption
   - See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design
   -  Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries`
- Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()`
- When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9342

Test Plan:
- Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption`
- Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl
   - For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true  -quick -runs 10 | grep 'Build avg'` to make sure nothing break.
- Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()`
   -  FastLocalBloom
       - Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s**
       - After change:
          -  `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)**
          - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)**
          - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)**
    -  Standard128Ribbon
       - Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s**
       - After change:
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)**
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)**
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)**
- Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true`
- Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status.

Reviewed By: pdillinger

Differential Revision: D33746928

Pulled By: hx235

fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
2022-02-01 17:42:35 -08:00
Peter Dillinger
f6d7ec1d02 Ignore total_order_seek in DB::Get (#9427)
Summary:
Apparently setting total_order_seek=true for DB::Get was
intended to allow accurate read semantics if the current prefix
extractor doesn't match what was used to generate SST files on
disk. But since prefix_extractor was made a mutable option in 5.14.0, we
have been able to detect this case and provide the correct semantics
regardless of the total_order_seek option. Since that time, the option
has only made Get() slower in a reasonably common case: prefix_extractor
unchanged and whole_key_filtering=false.

So this change primarily removes unnecessary effect of
total_order_seek on Get. Also cleans up some related comments.

Also adds a -total_order_seek option to db_bench and canonicalizes
handling of ReadOptions in db_bench so that command line options have
the expected association with library features. (There is potential
for change in regression test behavior, but the old behavior is likely
indefensible, or some other inconsistency would need to be fixed.)

TODO in follow-up work: there should be no reason for Get() to depend on
current prefix extractor at all.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9427

Test Plan:
Unit tests updated.

Performance (using db_bench update)

Create DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12 -whole_key_filtering=0`

Test with and without `-total_order_seek` on `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=readrandom -num=10000000 -duration=40 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`

Before this change, total_order_seek=false: 25188 ops/sec
Before this change, total_order_seek=true:   1222 ops/sec (~20x slower)

After this change, total_order_seek=false: 24570 ops/sec
After this change, total_order_seek=true:  25012 ops/sec (indistinguishable)

Reviewed By: siying

Differential Revision: D33753458

Pulled By: pdillinger

fbshipit-source-id: bf892f34907a5e407d9c40bd4d42f0adbcbe0014
2022-01-31 19:46:42 -08:00
Hui Xiao
a3de7ae49f Mark virtual ~Env() override (#9467)
Summary:
**Context:**

Compiling RocksDB with -Winconsistent-missing-destructor-override reveals the following :

```
./include/rocksdb/env.h:174:11: error: '~Env' overrides a destructor but is not marked 'override' [-Werror,-Winconsistent-missing-destructor-override]
  virtual ~Env();
          ^
./include/rocksdb/customizable.h:58:3: note: overridden virtual function is here
  ~Customizable() override {}
```

The need of overriding the Env's destructor seems to be introduced by https://github.com/facebook/rocksdb/pull/9293 and surfaced by -Winconsistent-missing-destructor-override, which is not turned on by default.

**Summary:**
Mark  ~Env() override

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9467

Test Plan: - Turn on -Winconsistent-missing-destructor-override and USE_CLANG=1 make -jN env/env.o to see whether the error shows up

Reviewed By: jay-zhuang, riversand963, george-reynya

Differential Revision: D33864985

Pulled By: hx235

fbshipit-source-id: 4a78bd161ff153902b2676829723e9a1c33dd749
2022-01-31 10:14:35 -08:00
Hui Xiao
42cca28ebb Remove deprecated API AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds (#9455)
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds has been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9455

Test Plan: Rely on my eyeball and CI

Reviewed By: ajkr

Differential Revision: D33811664

Pulled By: hx235

fbshipit-source-id: 866859427fe710354a90f1095057f80116365ff0
2022-01-28 16:47:08 -08:00
Yanqin Jin
d10c5c08d3 Remove iter_start_seqnum and preserve_deletes (#9430)
Summary:
According to https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl.cc#L2896:L2911 and https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl_open.cc#L203:L208,
we are going to remove `iter_start_seqnum` and `preserve_deletes` starting from RocksDB 7.0

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9430

Test Plan: make check and CI

Reviewed By: ajkr

Differential Revision: D33753639

Pulled By: riversand963

fbshipit-source-id: c80aab8e8d8fc33e52472fed524ed703d0ffc8b6
2022-01-28 13:28:38 -08:00
anand76
e58cc56fb5 Use == operator for shared_ptr nullptr comparison (#9465)
Summary:
From C++ 20 onwards, the != operator is not supported for a shared_ptr.
So switch to using ==.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9465

Test Plan: make check

Reviewed By: riversand963

Differential Revision: D33850596

Pulled By: anand1976

fbshipit-source-id: eec16d1aa6c39a315ec2d44d233d7518f9c1ddcb
2022-01-28 12:48:52 -08:00
Akanksha Mahajan
74ccd1931e Remove deprecated option DBOptions::skip_log_error_on_recovery (#9434)
Summary:
In  RocksDB DBOptions::skip_log_error_on_recovery is marked as
"NOT SUPPORTED" for a long time, and setting this option does not have
any effect on the behavior of RocksDB library. Therefore, we are removing it
in the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9434

Test Plan: CircleCI

Reviewed By: ajkr

Differential Revision: D33763015

Pulled By: akankshamahajan15

fbshipit-source-id: 11f09643298da6c02d3dcdb090b996f4c3cfdd76
2022-01-28 01:46:04 -08:00
Akanksha Mahajan
ed86cd5e78 Remove deprecated overloads of DB::CompactRange (#9444)
Summary:
In RocksDB few overloads of DB::CompactRange() are marked as DEPRECATED_FUNC, and
we are removing it in the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9444

Test Plan: CircleCI

Reviewed By: ajkr

Differential Revision: D33788520

Pulled By: akankshamahajan15

fbshipit-source-id: 716e0d5f227f791605d4d91626c0cbf5b4571630
2022-01-27 23:12:30 -08:00
Jay Zhuang
22321e1027 Remove unused API base_background_compactions (#9462)
Summary:
The API is deprecated long time ago. Clean up the codebase by
removing it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9462

Test Plan: CI, fake release: D33835220

Reviewed By: riversand963

Differential Revision: D33835103

Pulled By: jay-zhuang

fbshipit-source-id: 6d2dc12c8e7fdbe2700865a3e61f0e3f78bd8184
2022-01-27 21:05:18 -08:00
Peter Dillinger
78aee6fedc Remove obsolete backupable_db.h, utility_db.h (#9438)
Summary:
This also removes the obsolete names BackupableDBOptions
and UtilityDB. API users must now use BackupEngineOptions and
DBWithTTL::Open. In C API, `rocksdb_backupable_db_*` is replaced
`rocksdb_backup_engine_*`. Similar renaming in Java API.

In reference to https://github.com/facebook/rocksdb/issues/9389

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438

Test Plan: CI

Reviewed By: mrambacher

Differential Revision: D33780269

Pulled By: pdillinger

fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac
2022-01-27 15:45:30 -08:00
Peter Dillinger
ea89c77f27 Fix major bug with MultiGet, DeleteRange, and memtable Bloom (#9453)
Summary:
MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.

This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)

Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9453

Test Plan:
new unit test added. Added memtable Bloom to crash test.

Performance testing
--------------------

Build WAL-only DB (recovers to memtable):
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000
```

Query test command, to maximize sensitivity to the changed code:
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS
```
(Note -num here is 10x larger for mostly memtable misses)

Before & after run simultaneously, average over 10 iterations per data point, ops/sec.

MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066

MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990

MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before:  12081715
After: 12117603

MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591

Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.

Reviewed By: ajkr

Differential Revision: D33805025

Pulled By: pdillinger

fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
2022-01-27 14:55:04 -08:00
Hui Xiao
1e0e883ca5 Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit (#9452)
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452

Test Plan: Rely on my eyeball and CI

Reviewed By: ajkr

Differential Revision: D33804938

Pulled By: hx235

fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945
2022-01-27 13:01:09 -08:00
mrambacher
37ec9d0c12 Improve performance of SliceTransform::AsString (#9401)
Summary:
1. Removed the options from the Capped/Fixed SliceTransforms.  Instead these classes are created with id.number.  This allows the GetID() id to be calculated and stored at class construction time.  This change puts the construction back to similar to how it was prior to the Customizable changes for SliceTransform.

2.  Improve the performance of AsString by using the ID only if there are no option properties (which is the case for all of the builtin transforms).

Ran tests of calling AsString in a loop 5M times and found approximately a 10x performance increase vs the original code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9401

Reviewed By: pdillinger

Differential Revision: D33668672

Pulled By: mrambacher

fbshipit-source-id: d0075912c6ece8ed754ee543bc6b0b49a169b309
2022-01-27 10:05:33 -08:00
Baptiste Lemaire
92822655fd Remove deprecated table_cache_remove_scan_count_limit option. (#9450)
Summary:
In RocksDB, this option was already marked as "NOT SUPPORTED" for a long time, and setting this option does not have any effect on the behavior of RocksDB library. Therefore, we are removing it in the preparations of the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9450

Reviewed By: ajkr

Differential Revision: D33802466

Pulled By: bjlemaire

fbshipit-source-id: 97570985f1400525304053476450f7ef504c0cd5
2022-01-27 09:33:31 -08:00
Peter Dillinger
449029f865 Remove deprecated ObjectLibrary::Register() (and Regex public API) (#9439)
Summary:
Regexes are considered potentially problematic for use in
registering RocksDB extensions, so we are removing
ObjectLibrary::Register() and the Regex public API it depended on (now
unused).

In reference to https://github.com/facebook/rocksdb/issues/9389

Why?
* The power of Regexes can make it hard to reason about which extension
will match what. (The replacement API isn't perfect, but we are at least
"holding the line" on patterns we have seen in practice.)
* It is easy to make regexes that don't quite mean what you think they
mean, such as forgetting that the `.` in `foo.bar` can match any character
or that matching is nondeterministic, as in `a🅱️42` matching `.*:[0-9]+`.
* Some regexes and implementations can have disastrously bad
performance. This might not be much practical concern for ObjectLibray
here, but we don't want to encourage potentially dangerous further use
in production code. (Testing code is fine. See TestRegex.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9439

Test Plan: CI

Reviewed By: mrambacher

Differential Revision: D33792342

Pulled By: pdillinger

fbshipit-source-id: 4f64dcb04764e639162c8977a5fa196f67754cec
2022-01-26 16:22:44 -08:00
Siddhartha Roychowdhury
c27ca23644 Add option for WAL compression algorithm (#9432)
Summary:
Add an option to set the WAL compression algorithm - wal_compression.

TODO: WAL compression is not implemented and will only support zstd initially. Will be added in subsequent diffs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9432

Reviewed By: pdillinger

Differential Revision: D33797275

Pulled By: sidroyc

fbshipit-source-id: 8db81d9c9cea5e2e4f1445d3aecad8106137b8e7
2022-01-26 14:23:00 -08:00
Jay Zhuang
961d8dacf2 Remove unused option purge_redundant_kvs_while_flush (#9429)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9429

Test Plan: fake release for test: D33754513

Reviewed By: riversand963

Differential Revision: D33753637

Pulled By: jay-zhuang

fbshipit-source-id: 18db4701e8f28dda8f1ab660c2be9890a8312c12
2022-01-26 10:24:16 -08:00
Akanksha Mahajan
8822562d75 Remove deprecated function DB::AddFile (#9433)
Summary:
RocksDB has marked DB::AddFile() as "DEPRECATED_FUNC" for a long time, and
it will be removed in the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9433

Test Plan: make check -j64; CircleCI

Reviewed By: riversand963

Differential Revision: D33763987

Pulled By: akankshamahajan15

fbshipit-source-id: a3407324479bb43689e1213e4e29d53095e7579a
2022-01-25 23:22:58 -08:00
Jay Zhuang
022b400cba Make bottommost_temperature dynamically changeable (#9402)
Summary:
Make `AdvancedColumnFamilyOptions.bottommost_temperature`
dynamically changeable with `SetOptions` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9402

Test Plan: added unittest

Reviewed By: siying

Differential Revision: D33674487

Pulled By: jay-zhuang

fbshipit-source-id: 8943768156aa6197c63850a64238a8092527d517
2022-01-25 15:23:04 -08:00
Yanqin Jin
fa52376117 Move RADOS support to separate repo (#9206)
Summary:
This PR moves RADOS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host RADOS support. At this point,
people can start from the example repo and fork.

The goal is to include this commit in RocksDB 7.0 release.

Reference:
https://github.com/ajkr/dedupfs by ajkr

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9206

Test Plan:
Follow instructions in https://github.com/riversand963/rocksdb-rados-env/blob/main/README.md and build
test binary `env_librados_test` and run it.

Also, make check

Reviewed By: ajkr

Differential Revision: D33751690

Pulled By: riversand963

fbshipit-source-id: 30466c62afa9e4619847a48567ed158e62835e35
2022-01-24 22:50:07 -08:00
Yanqin Jin
50135c1bf3 Move HDFS support to separate repo (#9170)
Summary:
This PR moves HDFS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host hdfs support. At this point,
people can start from the example repo and fork.

Java/JNI is not included yet, and needs to be done later if necessary.

The goal is to include this commit in RocksDB 7.0 release.

Reference:
https://github.com/ajkr/dedupfs by ajkr

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9170

Test Plan:
Follow the instructions in https://github.com/riversand963/rocksdb-hdfs-env/blob/master/README.md. Build and run db_bench and db_stress.

make check

Reviewed By: ajkr

Differential Revision: D33751662

Pulled By: riversand963

fbshipit-source-id: 22b4db7f31762ed417a20239f5a08dcd1696244f
2022-01-24 20:23:54 -08:00
anand76
e8f116deab Update version to 6.29.0 (#9418)
Summary:
Update version for 6.29 release

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9418

Reviewed By: riversand963

Differential Revision: D33721048

Pulled By: anand1976

fbshipit-source-id: e73602ee1c829c2e47ce6e181bca4db7cb663979
2022-01-21 18:23:07 -08:00
Peter Dillinger
e7ac7363b4 Add to HISTORY and minor loose ends from #9294, #9254 (#9386)
Summary:
Loose ends relate to mmap on 32-bit systems. (Testing is more
complicated when the feature was completely disabled on 32-bit.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9386

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D33590715

Pulled By: pdillinger

fbshipit-source-id: f2637036a538a552200adee65b6765fce8cae27b
2022-01-21 13:04:19 -08:00
Brian Chen
93a0e9f3fa Mark destructors as override (#9404)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9404

It is better practice to mark destructors as override. Without this
change there can be issues building with
-Wsuggest-destructor-override.

Reviewed By: riversand963

Differential Revision: D33671992

fbshipit-source-id: 75b0c15010cbab5fbc071c150fef1dc85d5d9d96
2022-01-20 08:44:27 -08:00
Peter Dillinger
ffe1e4b820 Make some FilterPolicy deprecations more clear (#9403)
Summary:
The old block-based filter has been deprecated for years, but
this makes that more clear by marking the functions specific to it and
logging a warning when the feature is used.

It is deprecated because of performance. In that old design, you have to
binary search through the full SST index before a bloom filter query, which
is much more expensive than a bloom query itself.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9403

Test Plan:
Used db_bench with and without -use_block_based_filter,
running at the same time

    TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom,readrandom -num=10000000 -duration=20 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0

No significant difference in construction time but 3x slower readrandom
with -use_block_based_filter:
readrandom   :     100.517 micros/op 9948 ops/sec;    1.1 MB/s
vs.
readrandom   :      33.368 micros/op 29968 ops/sec;    3.3 MB/s

Also saw deprecation message (just once) in LOG only with
-use_block_based_filter

Reviewed By: ajkr

Differential Revision: D33673202

Pulled By: pdillinger

fbshipit-source-id: 99f6f0eff619408d9e5f7ef546954ed0be6c7a5b
2022-01-19 18:12:10 -08:00
Andrew Kryczka
875bfd75a0 Add API warning for Iterator::Refresh() with range tombstones (#9398)
Summary:
Need this until we properly return an error or fix the combination. Reported in https://github.com/facebook/rocksdb/issues/9255.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9398

Reviewed By: riversand963

Differential Revision: D33641396

Pulled By: ajkr

fbshipit-source-id: 9fe804108f7b93912f5b9c7252ac49acedc4f805
2022-01-19 10:13:27 -08:00
Hui Xiao
f61df25cc2 Add missing comment to RateLimiter::Request() (#9392)
Summary:
**Context/Summary:**
There are two `RateLimiter::Request()` in public header. One of them is missing some comment that the other one has.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9392

Test Plan: rely on CI test

Reviewed By: pdillinger

Differential Revision: D33623609

Pulled By: hx235

fbshipit-source-id: 42dc06308ff0bcf5ee7ef67e0b1c0172fc239b20
2022-01-19 10:09:49 -08:00
Peter Dillinger
5576ded762 Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363)
Summary:
In response to https://github.com/facebook/rocksdb/issues/9354, this PR adds a way for users to "opt out"
of extra checks that can impact peak write performance, which
currently only includes force_consistency_checks. I considered including
some other options but did not see a db_bench performance difference.

Also clarify in comment for force_consistency_checks that it can "slow
down saturated writing."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9363

Test Plan:
basic coverage in unit tests

Using my perf test in https://github.com/facebook/rocksdb/issues/9354 comment, I see

force_consistency_checks=true -> 725360 ops/s
force_consistency_checks=false -> 783072 ops/s

Reviewed By: mrambacher

Differential Revision: D33636559

Pulled By: pdillinger

fbshipit-source-id: 25bfd006f4844675e7669b342817dd4c6a641e84
2022-01-18 17:31:03 -08:00
Sergei Petrunia
c9042db619 Range Locking: add support for escalation barriers (#9290)
Summary:
Range Locking supports Lock Escalation. Lock Escalation is invoked when
lock memory is nearly exhausted and it reduced the amount of memory used
by joining adjacent locks.

Bridging the gap between certain locks has adverse effects. For example,
in MyRocks it is not a good idea to bridge the gap between locks in
different indexes, as that get the lock to cover large portions of
indexes, or even entire indexes.

Resolve this by introducing Escalation Barrier. The escalation process
will call the user-provided barrier callback function:
   bool(const Endpoint& a, const Endpoint& b)

If the function returns true, there's a barrier between a and b and Lock
Escalation will not try to bridge the gap between a and b.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9290

Reviewed By: akankshamahajan15

Differential Revision: D33486753

Pulled By: riversand963

fbshipit-source-id: f97910b67aba0579ea1d35f523ca6863d3dd018e
2022-01-14 12:46:09 -08:00
Si Ke
93b1de4f45 Enable db_test running in Centos 32 bit OS and Alpine 32 bit OS (#9294)
Summary:
Closes https://github.com/facebook/rocksdb/issues/9271

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9294

Reviewed By: riversand963, hx235

Differential Revision: D33586002

Pulled By: pdillinger

fbshipit-source-id: 3d1a2fa71023e108613ff03dbd37a5f954fc4920
2022-01-14 11:58:18 -08:00
Niklas Fiekas
f8bdd5797f Take compression level_values as const pointer (#9376)
Summary:
Compatible change, more natural (especially in generated Rust bindings), no risk that the API will ever need mutable access because it has to make a copy anyway.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9376

Reviewed By: ajkr

Differential Revision: D33541435

Pulled By: pdillinger

fbshipit-source-id: 15c512a0d70b6e8694fa99d598b7d022751c1e59
2022-01-12 08:34:53 -08:00
mrambacher
1973fcba11 Restore Regex support for ObjectLibrary::Register, rename new APIs to allow old one to be deprecated in the future (#9362)
Summary:
In order to support old-style regex function registration, restored the original "Register<T>(string, Factory)" method using regular expressions.  The PatternEntry methods were left in place but renamed to AddFactory.  The goal is to allow for the deprecation of the original regex Registry method in an upcoming release.

Added modes to the PatternEntry kMatchZeroOrMore and kMatchAtLeastOne to match * or +, respectively (kMatchAtLeastOne was the original behavior).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9362

Reviewed By: pdillinger

Differential Revision: D33432562

Pulled By: mrambacher

fbshipit-source-id: ed88ab3f9a2ad0d525c7bd1692873f9bb3209d02
2022-01-11 06:33:48 -08:00
Yanqin Jin
55a2105258 Make RocksDB codebase compatible with newer compilers like clang-12 (#9370)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9370

GCC and newer clang, e.g. clang-12 treat `std::unique_ptr` slightly differently.
For the following code
```
#include <iostream>
#include <memory>
#include <type_traits>

struct A {
    std::unique_ptr<int> m1;
};

int main()
{
    std::cout << std::boolalpha;
    std::cout << std::is_standard_layout<A>::value << '\n';
    return 0;
}
```
GCC11(C++20) (tested on https://en.cppreference.com/w/cpp/types/is_standard_layout) will print "true", while newer clang, e.g. clang-12 will print "false". This breaks the usage of `offsetof()` on structs with non-static members of type `std::unique_ptr`.
Fixing this by replacing the builtin `offsetof` with a trick documented at https://gist.github.com/graphitemaster/494f21190bb2c63c5516.

Reviewed By: jay-zhuang

Differential Revision: D33420840

fbshipit-source-id: 02bde281dfa28809bec787ad0f7019e85dd9c607
2022-01-10 11:09:05 -08:00
Hui Xiao
f62efb9d35 Clarify Options::rate_limiter api (#9361)
Summary:
Context/Summary:
I believe we also rate-limit read rate using the rate limiter passed into db options, e.g, https://github.com/facebook/rocksdb/blob/6.27.fb/file/random_access_file_reader.cc#L159

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9361

Test Plan: Existing tests

Reviewed By: jay-zhuang

Differential Revision: D33420803

Pulled By: hx235

fbshipit-source-id: 0ef3c4d0aaacb9bee9a5d2caceddfc76588c8949
2022-01-06 10:13:53 -08:00
mrambacher
fe31dc53ca Make the Env class Customizable (#9293)
Summary:
Allows the Env to have options (Configurable) and loads like other Customizable classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293

Reviewed By: pdillinger, zhichao-cao

Differential Revision: D33181591

Pulled By: mrambacher

fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7
2022-01-04 16:45:49 -08:00
mrambacher
1c39b7952b Remove/Reduce use of Regex in ObjectRegistry/Library (#9264)
Summary:
Added new ObjectLibrary::Entry classes to replace/reduce the use of Regex.  For simple factories that only do name matching, there are "StringEntry" and "AltStringEntry" classes.  For classes that use some semblance of regular expressions, there is a PatternEntry class that can match a name and prefixes.  There is also a class for Customizable::IndividualId format matches.

Added tests for the new derivative classes and got all unit tests to pass.

Resolves https://github.com/facebook/rocksdb/issues/9225.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9264

Reviewed By: pdillinger

Differential Revision: D33062001

Pulled By: mrambacher

fbshipit-source-id: c2d2143bd2d38bdf522705c8280c35381b135c03
2021-12-29 07:56:23 -08:00
Andrew Kryczka
aa2b3bf675 Added TraceOptions::preserve_write_order (#9334)
Summary:
This option causes trace records to be written in the serialized write thread. That way, the write records in the trace must follow the same order as writes that are logged to WAL and writes that are applied to the DB.

By default I left it disabled to match existing behavior. I enabled it in `db_stress`, though, as that use case requires order of write records in trace matches the order in WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9334

Test Plan:
- See if below unsynced data loss crash test can run  for 24h straight. It used to crash after a few hours when reaching an unlucky trace ordering.

```
DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm /usr/local/bin/python3 -u tools/db_crashtest.py blackbox --interval=10 --max_key=100000 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --value_size_mult=33 --sync_fault_injection=1 --test_batches_snapshots=0 --duration=86400
```

Reviewed By: zhichao-cao

Differential Revision: D33301990

Pulled By: ajkr

fbshipit-source-id: 82d97559727adb4462a7af69758449c8725b22d3
2021-12-28 15:04:26 -08:00
Andrew Kryczka
2ee20a669d Extend trace filtering to more operation types (#9335)
Summary:
- Extended trace filtering to cover `MultiGet()`, `Seek()`, and `SeekForPrev()`. Now all user ops that can be traced support filtering.
- Enabled the new filter masks in `db_stress` since it only cares to trace writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9335

Test Plan:
- trace-heavy `db_stress` command reduced 30% elapsed time  (79.21 -> 55.47 seconds)

Benchmark command:
```
$ /usr/bin/time ./db_stress -ops_per_thread=100000 -sync_fault_injection=1 --db=/dev/shm/rocksdb_stress_db/ --expected_values_dir=/dev/shm/rocksdb_stress_expected/ --clear_column_family_one_in=0
```

- replay-heavy `db_stress` command reduced 12.4% elapsed time (23.69 -> 20.75 seconds)

Setup command:
```
$  ./db_stress -ops_per_thread=100000000 -sync_fault_injection=1 -db=/dev/shm/rocksdb_stress_db/ -expected_values_dir=/dev/shm/rocksdb_stress_expected --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress
```

Benchmark command:
```
$ /usr/bin/time ./db_stress -ops_per_thread=1 -reopen=0 -expected_values_dir=/dev/shm/rocksdb_stress_expected/ -db=/dev/shm/rocksdb_stress_db/ --clear_column_family_one_in=0 --destroy_db_initially=0
```

Reviewed By: zhichao-cao

Differential Revision: D33304580

Pulled By: ajkr

fbshipit-source-id: 0df10f87c1fc506e9484b6b42cea2ef96c7ecd65
2021-12-28 11:46:30 -08:00
slk
2e5f764294 Make IncreaseFullHistoryTsLow to a public API (#9221)
Summary:
As (https://github.com/facebook/rocksdb/issues/9210) discussed, the **full_history_ts_low** is a member of CompactRangeOptions currently, which means a CF's fullHistoryTsLow is advanced only when users submit a CompactRange request.
However, users may want to advance the fllHistoryTsLow without an immediate compact.
This merge make IncreaseFullHistoryTsLow to a public API so users can advance each CF's fullHistoryTsLow seperately.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9221

Reviewed By: akankshamahajan15

Differential Revision: D33201106

Pulled By: riversand963

fbshipit-source-id: 9cb1d013ba93260f72e16353e693ffee167b47ee
2021-12-23 11:03:51 -08:00
Sergei Petrunia
1b076e82db Expose locktree's wait count in RangeLockManagerHandle::Counters (#9289)
Summary:
locktree is a module providing Range Locking. It has a counter for
the number of times a lock acquisition request was blocked by an
existing conflicting lock and had to wait for it to be released.

Expose this counter in RangeLockManagerHandle::Counters::lock_wait_count.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9289

Reviewed By: jay-zhuang

Differential Revision: D33079182

Pulled By: riversand963

fbshipit-source-id: 25b1a362d9da247536ab5007bd15900b319f139e
2021-12-22 21:14:48 -08:00
Andrew Kryczka
782fcc44e1 Fix race condition in error_handler_fs_test (#9325)
Summary:
We saw the below assertion failure in `error_handler_fs_test`:

```
db/error_handler_fs_test.cc:2471: Failure
Expected equality of these values:
  listener->new_bg_error()
    Which is: 16-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>
  Status::Aborted()
    Which is: 16-byte object <0A-00 00-00 60-61 00-00 00-00 00-00 00-00 00-00>
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
  what():  db/error_handler_fs_test.cc:2471: Failure
Expected equality of these values:
  listener->new_bg_error()
    Which is: 16-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>
  Status::Aborted()
    Which is: 16-byte object <0A-00 00-00 60-61 00-00 00-00 00-00 00-00 00-00>
Received signal 6 (Aborted)
```

The problem was completing `OnErrorRecoveryCompleted()` would
wake up the main thread and allow it to proceed to that assertion. But
that assertion assumes `OnErrorRecoveryEnd()` has completed since
only `OnErrorRecoveryEnd()` affects `new_bg_error()`.

The fix is just to make `OnErrorRecoveryCompleted()` not wake up the
main thread, by means of not implementing it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9325

Test Plan:
- ran `while TEST_TMPDIR=/dev/shm ./error_handler_fs_test ; do : ; done` for a while
- injected sleep between `OnErrorRecovery{Completed,End}()` callbacks, which guaranteed repro before this PR

Reviewed By: anand1976

Differential Revision: D33249200

Pulled By: ajkr

fbshipit-source-id: 1659ee183cd09f90d4dbd898f65103473fcf84a8
2021-12-20 23:16:52 -08:00
Akanksha Mahajan
7bfad07194 Update to version 6.28 (#9312)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9312

Reviewed By: ajkr

Differential Revision: D33196324

Pulled By: akankshamahajan15

fbshipit-source-id: 471da75eaedc54d3151672adc28643bc1d6fdf23
2021-12-17 16:20:39 -08:00
mrambacher
423538a816 Make MemoryAllocator into a Customizable class (#8980)
Summary:
- Make MemoryAllocator and its implementations into a Customizable class.
- Added a "DefaultMemoryAllocator" which uses new and delete
- Added a "CountedMemoryAllocator" that counts the number of allocs and free
- Updated the existing tests to use these new allocators
- Changed the memkind allocator test into a generic test that can test the various allocators.
- Added tests for creating all of the allocators
- Added tests to verify/create the JemallocNodumpAllocator using its options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980

Reviewed By: zhichao-cao

Differential Revision: D32990403

Pulled By: mrambacher

fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76
2021-12-17 04:20:47 -08:00
Yanqin Jin
5455cacd18 Fix link error reported in issue 9272 (#9278)
Summary:
As title, Closes https://github.com/facebook/rocksdb/issues/9272
Since TimestampAssigner-related classes needs to access
`WriteBatch::ProtectionInfo` objects which is for internal use only,
it's difficult to make `AssignTimestamp` methods a template and put them
in the same public header, `include/rocksdb/write_batch.h`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9278

Test Plan:
```
make check
# Also manually test following the repro-steps in issue 9272
```

Reviewed By: ltamasi

Differential Revision: D33012686

Pulled By: riversand963

fbshipit-source-id: 89f24a86a1170125bd0b94ef3b32e69aa08bd949
2021-12-10 20:33:46 -08:00
Yanqin Jin
bd513fd075 Add commit marker with timestamp (#9266)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266

This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.

Reviewed By: ltamasi

Differential Revision: D31721350

fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
2021-12-10 11:05:35 -08:00
Peter Dillinger
653c392e47 More refactoring ahead of footer & meta changes (#9240)
Summary:
I'm working on a new format_version=6 to support context
checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test
updates to support that change.

Test coverage data and manual inspection agree on dead code in
block_based_table_reader.cc (removed).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240

Test Plan:
tests enhanced to cover more cases etc.

Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible.

TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true}

(Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness)
Before w/ wal: 454512
After w/ wal: 444820 (-2.1%)
Before w/o wal: 1004560
After w/o wal: 998897 (-0.6%)

Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case.

This regression will be corrected in a follow-up PR.

Reviewed By: ajkr

Differential Revision: D32813769

Pulled By: pdillinger

fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f
2021-12-10 08:13:26 -08:00
anand76
ecf2bec613 Add a listener callback for end of auto error recovery (#9244)
Summary:
Previously, the OnErrorRecoveryCompleted callback was called when
RocksDB was able to successfully recover from a retryable error.
However, if the recovery failed and was eventually stopped, there was no
indication of the status. To fix that, a new OnErrorRecoveryEnd callback
is introduced that deprecates the OnErrorRecoveryCompleted callback. The
new callback is called with the original error and the new error status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9244

Test Plan: Add a new unit test in error_handler_fs_test

Reviewed By: zhichao-cao

Differential Revision: D32922303

Pulled By: anand1976

fbshipit-source-id: f04e77a9cb92c5ea6385590682d3fcf559971b99
2021-12-08 14:30:57 -08:00
Hui Xiao
9daf07305c Replace TableProperties::properties_offsets map with external_sst_file_global_seqno_offset (#9212)
Summary:
**Context:**
Searching `TableProperties::properties_offsets` across the codebase reveals that internally it is only used to find the external SST file's global seqno offeset. Therefore we can narrow it down and replace this map property with a uint64_t property `external_sst_file_global_seqno_offset` to save memory usage related to table properties.

Note:
- See PR comments for discussion about potential impact on existing external usage of `TableProperties::properties_offsets`
- See PR comments for discussion on keeping external SST file global seqno's offset VS using a simple flag indicating seqno's existence.

**Summary:**
- Replaced `TableProperties::properties_offsets` with `TableProperties::external_sst_file_global_seqno_offset`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9212

Test Plan: - Relied on existing tests should be sufficient since `TableProperties::properties_offsets` existed before and should already be tested.

Reviewed By: ajkr

Differential Revision: D32665941

Pulled By: hx235

fbshipit-source-id: 718e44617346dc4f3b1276ee953e61c196277795
2021-12-02 08:30:36 -08:00
mrambacher
7cd5835a28 Make RateLimiter Customizable (#9141)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9141

Reviewed By: zhichao-cao

Differential Revision: D32432190

Pulled By: mrambacher

fbshipit-source-id: 7930ed88a02412128cd407b5063522484e45c6ce
2021-12-01 06:57:02 -08:00
Yanqin Jin
924616526a Update WriteBatch::AssignTimestamp() and Add (#9205)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9205

Update WriteBatch::AssignTimestamp() APIs so that they take an
additional argument, i.e. a function object called `checker` indicating the user-specified logic of performing
checks on timestamp sizes.

WriteBatch is a building block used by multiple other RocksDB components, each of which may track
timestamp information in different data structures. For example, transaction can either write to
`WriteBatchWithIndex` which is a `WriteBatch` with index, or write directly to raw `WriteBatch` if
`Transaction::DisableIndexing()` is called.
`WriteBatchWithIndex` keeps mapping from column family id to comparator, and transaction needs
to keep similar information for the `WriteBatch` if user calls `Transaction::DisableIndexing()` (dynamically)
so that we will know the size of each timestamp later. The bookkeeping info maintained by `WriteBatchWithIndex`
and `Transaction` should not overlap.
When we later call `WriteBatch::AssignTimestamp()`, we need to use these data structures to guarantee
that we do not accidentally assign timestamps for keys from column families that disable timestamp.

Reviewed By: ltamasi

Differential Revision: D31735186

fbshipit-source-id: 8b1709ed880ac72f995aa9e012e5873b290840a7
2021-11-30 22:33:00 -08:00