Summary:
This change removes the ability to configure the deprecated,
inefficient block-based filter in the public API. Options that would
have enabled it now use "full" (and optionally partitioned) filters.
Existing block-based filters can still be read and used, and a "back
door" way to build them still exists, for testing and in case of trouble.
About the only way this removal would cause an issue for users is if
temporary memory for filter construction greatly increases. In
HISTORY.md we suggest a few possible mitigations: partitioned filters,
smaller SST files, or setting reserve_table_builder_memory=true.
Or users who have customized a FilterPolicy using the
CreateFilter/KeyMayMatch mechanism removed in https://github.com/facebook/rocksdb/issues/9501 will have to upgrade
their code. (It's long past time for people to move to the new
builder/reader customization interface.)
This change also introduces some internal-use-only configuration strings
for testing specific filter implementations while bypassing some
compatibility / intelligence logic. This is intended to hint at a path
toward making FilterPolicy Customizable, but it also gives us a "back
door" way to configure block-based filter.
Aside: updated db_bench so that -readonly implies -use_existing_db
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9535
Test Plan:
Unit tests updated. Specifically,
* BlockBasedTableTest.BlockReadCountTest is tweaked to validate the back
door configuration interface and ignoring of `use_block_based_builder`.
* BlockBasedTableTest.TracingGetTest is migrated from testing
block-based filter access pattern to full filter access patter, by
re-ordering some things.
* Options test (pretty self-explanatory)
Performance test - create with `./db_bench -db=/dev/shm/rocksdb1 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0` with and without `-use_block_based_filter`, which creates a DB with 21 SST files in L0. Read with `./db_bench -db=/dev/shm/rocksdb1 -readonly -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=30`
Without -use_block_based_filter: readrandom 464 ops/sec, 689280 KB DB
With -use_block_based_filter: readrandom 169 ops/sec, 690996 KB DB
No consistent difference with fillrandom
Reviewed By: jay-zhuang
Differential Revision: D34153871
Pulled By: pdillinger
fbshipit-source-id: 31f4a933c542f8f09aca47fa64aec67832a69738
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9537
Add `Transaction::SetReadTimestampForValidation()` and
`Transaction::SetCommitTimestamp()` APIs with default implementation
returning `Status::NotSupported()`. Currently, calling these two APIs do not
have any effect.
Also add checks to `PessimisticTransactionDB`
to enforce that column families in the same db either
- disable user-defined timestamp
- enable 64-bit timestamp
Just to clarify, a `PessimisticTransactionDB` can have some column families without
timestamps as well as column families that enable timestamp.
Each `PessimisticTransaction` can have two optional timestamps, `read_timestamp_`
used for additional validation and `commit_timestamp_` which denotes when the transaction commits.
For now, we are going to support `WriteCommittedTxn` (in a series of subsequent PRs)
Once set, we do not allow decreasing `read_timestamp_`. The `commit_timestamp_` must be
greater than `read_timestamp_` for each transaction and must be set before commit, unless
the transaction does not involve any column family that enables user-defined timestamp.
TransactionDB builds on top of RocksDB core `DB` layer. Though `DB` layer assumes
that user-defined timestamps are byte arrays, `TransactionDB` uses uint64_t to store
timestamps. When they are passed down, they are still interpreted as
byte-arrays by `DB`.
Reviewed By: ltamasi
Differential Revision: D31567959
fbshipit-source-id: b0b6b69acab5d8e340cf174f33e8b09f1c3d3502
Summary:
This change should guarantee that the default ObjectLibrary/Registry are long-lived and not destroyed while the process is running. This will prevent some issues of them being referenced after they were destroyed via the static destruction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9464
Reviewed By: pdillinger
Differential Revision: D33849876
Pulled By: mrambacher
fbshipit-source-id: 7a69177d7c58c81be293fc7ef8e600d47ddbc14b
Summary:
When tests are run with TMPD, c_test may fail because TMPD
is not created by the test. It results in IO error: No such file
or directory: While mkdir if missing:
/tmp/rocksdb_test_tmp/rocksdb_c_test-0: No such file or directory
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9547
Test Plan:
make -j32 c_test;
TEST_TMPDIR=/tmp/rocksdb_test ./c_test
Reviewed By: riversand963
Differential Revision: D34173298
Pulled By: akankshamahajan15
fbshipit-source-id: 5b5a01f5b842c2487b05b0708c8e9532241db7f8
Summary:
This fix addresses https://github.com/facebook/rocksdb/issues/9299.
If attempting to create a new object via the ObjectRegistry and a factory is not found, the ObjectRegistry will return a "NotSupported" status. This is the same behavior as previously.
If the factory is found but could not successfully create the object, an "InvalidArgument" status is returned. If the factory returned a reason why (in the errmsg), this message will be in the returned status.
In practice, there are two options in the ConfigOptions that control how these errors are propagated:
- If "ignore_unknown_options=true", then both InvalidArgument and NotSupported status codes will be swallowed internally. Both cases will return success
- If "ignore_unsupported_options=true", then having no factory will return success but a failing factory will return an error
- If both options are false, both cases (no and failing factory) will return errors.
In practice this likely only changes Customizable that may be partially available. For example, the JEMallocMemoryAllocator is a built-in allocator that is registered with the system but may not be compiled in. In this case, the status code for this allocator changed from NotSupported("JEMalloc not available") to InvalidArgumen("JEMalloc not available"). Other Customizable builtins/plugins would have the same semantics.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9333
Reviewed By: pdillinger
Differential Revision: D33517681
Pulled By: mrambacher
fbshipit-source-id: 8033052d4a4a7b88c2d9f90147b1b4467e51f6fd
Summary:
In RocksDB option new_table_reader_for_compaction_inputs has
not effect on Compaction or on the behavior of RocksDB library.
Therefore, we are removing it in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33788508
Pulled By: akankshamahajan15
fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d
Summary:
* Inefficient block-based filter is no longer customizable in the public
API, though (for now) can still be enabled.
* Removed deprecated FilterPolicy::CreateFilter() and
FilterPolicy::KeyMayMatch()
* Removed `rocksdb_filterpolicy_create()` from C API
* Change meaning of nullptr return from GetBuilderWithContext() from "use
block-based filter" to "generate no filter in this case." This is a
cleaner solution to the proposal in https://github.com/facebook/rocksdb/issues/8250.
* Also, when user specifies bits_per_key < 0.5, we now round this down
to "no filter" because we expect a filter with >= 80% FP rate is
unlikely to be worth the CPU cost of accessing it (esp with
cache_index_and_filter_blocks=1 or partition_filters=1).
* bits_per_key >= 0.5 and < 1.0 is still rounded up to 1.0 (for 62% FP
rate)
* This also gives us some support for configuring filters from OPTIONS
file as currently saved: `filter_policy=rocksdb.BuiltinBloomFilter`.
Opening from such an options file will enable reading filters (an
improvement) but not writing new ones. (See Customizable follow-up
below.)
* Also removed deprecated functions
* FilterBitsBuilder::CalculateNumEntry()
* FilterPolicy::GetFilterBitsBuilder()
* NewExperimentalRibbonFilterPolicy()
* Remove default implementations of
* FilterBitsBuilder::EstimateEntriesAdded()
* FilterBitsBuilder::ApproximateNumEntries()
* FilterPolicy::GetBuilderWithContext()
* Remove support for "filter_policy=experimental_ribbon" configuration
string.
* Allow "filter_policy=bloomfilter:n" without bool to discourage use of
block-based filter.
Some pieces for https://github.com/facebook/rocksdb/issues/9389
Likely follow-up (later PRs):
* Refactoring toward FilterPolicy Customizable, so that we can generate
filters with same configuration as before when configuring from options
file.
* Remove support for user enabling block-based filter (ignore `bool
use_block_based_builder`)
* Some months after this change, we could even remove read support for
block-based filter, because it is not critical to DB data
preservation.
* Make FilterBitsBuilder::FinishV2 to avoid `using
FilterBitsBuilder::Finish` mess and add support for specifying a
MemoryAllocator (for cache warming)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9501
Test Plan:
A number of obsolete tests deleted and new tests or test
cases added or updated.
Reviewed By: hx235
Differential Revision: D34008011
Pulled By: pdillinger
fbshipit-source-id: a39a720457c354e00d5b59166b686f7f59e392aa
Summary:
... seen only in internal clang-analyze runs after https://github.com/facebook/rocksdb/issues/9481
* Mostly, this works around falsely reported leaks by using
std::unique_ptr in some places where clang-analyze was getting
confused. (I didn't see any changes in C++17 that could make our Status
implementation leak memory.)
* Also fixed SetBGError returning address of a stack variable.
* Also fixed another false null deref report by adding an assert.
Also, use SKIP_LINK=1 to speed up `make analyze`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9515
Test Plan:
Was able to reproduce the reported errors locally and verify
they're fixed (except SetBGError). Otherwise, existing tests
Reviewed By: hx235
Differential Revision: D34054630
Pulled By: pdillinger
fbshipit-source-id: 38600ef3da75ddca307dff96b7a1a523c2885c2e
Summary:
In RocksDB few overloads of DB::GetApproximateSizes are marked as
DEPRECATED_FUNC, and we are removing it in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9458
Test Plan: CircleCI
Reviewed By: riversand963
Differential Revision: D34043791
Pulled By: akankshamahajan15
fbshipit-source-id: 815c0ad283a6627c4b241479c7d40ce03a758493
Summary:
For tiered storage
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9498
Test Plan: Just API placeholders for now
Reviewed By: jay-zhuang
Differential Revision: D33993094
Pulled By: pdillinger
fbshipit-source-id: 3cf19a450c7232e05306e94018559b26e9fd35db
Summary:
Drop support for some old compilers by requiring C++17 standard
(or higher). See https://github.com/facebook/rocksdb/issues/9388
First modification based on this is to remove some conditional compilation in slice.h (also
better for ODR)
Also in this PR:
* Fix some Makefile formatting that seems to affect ASSERT_STATUS_CHECKED config in
some cases
* Add c_test to NON_PARALLEL_TEST in Makefile
* Fix a clang-analyze reported "potential leak" in lru_cache_test
* Better "compatibility" definition of DEFINE_uint32 for old versions of gflags
* Fix a linking problem with shared libraries in Makefile (`./random_test: error while loading shared libraries: librocksdb.so.6.29: cannot open shared object file: No such file or directory`)
* Always set ROCKSDB_SUPPORT_THREAD_LOCAL and use thread_local (from C++11)
* TODO in later PR: clean up that obsolete flag
* Fix a cosmetic typo in c.h (https://github.com/facebook/rocksdb/issues/9488)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9481
Test Plan:
CircleCI config substantially updated.
* Upgrade to latest Ubuntu images for each release
* Generally prefer Ubuntu 20, but keep a couple Ubuntu 16 builds with oldest supported
compilers, to ensure compatibility
* Remove .circleci/cat_ignore_eagain except for Ubuntu 16 builds, because this is to work
around a kernel bug that should not affect anything but Ubuntu 16.
* Remove designated gcc-9 build, because the default linux build now uses GCC 9 from
Ubuntu 20.
* Add some `apt-key add` to fix some apt "couldn't be verified" errors
* Generally drop SKIP_LINK=1; work-around no longer needed
* Generally `add-apt-repository` before `apt-get update` as manual testing indicated the
reverse might not work.
Travis:
* Use gcc-7 by default (remove specific gcc-7 and gcc-4.8 builds)
* TODO in later PR: fix s390x "Assembler messages: Error: invalid switch -march=z14" failure
AppVeyor:
* Completely dropped because we are dropping VS2015 support and CircleCI covers
VS >= 2017
Also local testing with old gflags (out of necessity when using ROCKSDB_NO_FBCODE=1).
Reviewed By: mrambacher
Differential Revision: D33946377
Pulled By: pdillinger
fbshipit-source-id: ae077c823905b45370a26c0103ada119459da6c1
Summary:
In RocksDB, this option was already marked as "NOT SUPPORTED" for a long time, and setting this option does not have any effect on the behavior of RocksDB library. Therefore, we are removing it in the preparations of the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9446
Reviewed By: ajkr
Differential Revision: D33793048
Pulled By: bjlemaire
fbshipit-source-id: 73316efdb194e90225005246673dae99e65577ae
Summary:
I feel it would be nice if we can fix this spelling error.
In `SizeApproximationOptions`, the `include_memtabtles` should be `include_memtables`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9490
Test Plan: make check
Reviewed By: hx235
Differential Revision: D33949862
Pulled By: riversand963
fbshipit-source-id: b2be67501b65d4aabb6b8df1bf25eb8d54cc1466
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".
According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).
For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.
Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.
The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.
Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946
Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0
Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```
Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s
```
Reviewed By: ltamasi
Differential Revision: D33721359
Pulled By: riversand963
fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
Summary:
Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, https://github.com/facebook/rocksdb/pull/9345, (optional) https://github.com/facebook/rocksdb/pull/9393
**Context:**
(Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following:
a) set of keys to add to filter
b) set of hashes to add to filter (64-bit hash applied to each key)
c) set of Bloom indices to set in filter, with duplicates
d) set of Bloom indices to set in filter, deduplicated
e) final filter and its checksum
This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level.
- b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`)
- c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO.
Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default.
**Summary:**
- Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()`
- Ensured hash entries, final filter and banding and their [cache reservation ](https://github.com/facebook/rocksdb/issues/9073) are released properly despite corruption
- See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design
- Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries`
- Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()`
- When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9342
Test Plan:
- Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption`
- Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl
- For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true -quick -runs 10 | grep 'Build avg'` to make sure nothing break.
- Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()`
- FastLocalBloom
- Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s**
- After change:
- `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)**
- `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)**
- `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)**
- Standard128Ribbon
- Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s**
- After change:
- `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)**
- `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)**
- `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)**
- Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true`
- Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status.
Reviewed By: pdillinger
Differential Revision: D33746928
Pulled By: hx235
fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
Summary:
Apparently setting total_order_seek=true for DB::Get was
intended to allow accurate read semantics if the current prefix
extractor doesn't match what was used to generate SST files on
disk. But since prefix_extractor was made a mutable option in 5.14.0, we
have been able to detect this case and provide the correct semantics
regardless of the total_order_seek option. Since that time, the option
has only made Get() slower in a reasonably common case: prefix_extractor
unchanged and whole_key_filtering=false.
So this change primarily removes unnecessary effect of
total_order_seek on Get. Also cleans up some related comments.
Also adds a -total_order_seek option to db_bench and canonicalizes
handling of ReadOptions in db_bench so that command line options have
the expected association with library features. (There is potential
for change in regression test behavior, but the old behavior is likely
indefensible, or some other inconsistency would need to be fixed.)
TODO in follow-up work: there should be no reason for Get() to depend on
current prefix extractor at all.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9427
Test Plan:
Unit tests updated.
Performance (using db_bench update)
Create DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12 -whole_key_filtering=0`
Test with and without `-total_order_seek` on `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=readrandom -num=10000000 -duration=40 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`
Before this change, total_order_seek=false: 25188 ops/sec
Before this change, total_order_seek=true: 1222 ops/sec (~20x slower)
After this change, total_order_seek=false: 24570 ops/sec
After this change, total_order_seek=true: 25012 ops/sec (indistinguishable)
Reviewed By: siying
Differential Revision: D33753458
Pulled By: pdillinger
fbshipit-source-id: bf892f34907a5e407d9c40bd4d42f0adbcbe0014
Summary:
**Context:**
Compiling RocksDB with -Winconsistent-missing-destructor-override reveals the following :
```
./include/rocksdb/env.h:174:11: error: '~Env' overrides a destructor but is not marked 'override' [-Werror,-Winconsistent-missing-destructor-override]
virtual ~Env();
^
./include/rocksdb/customizable.h:58:3: note: overridden virtual function is here
~Customizable() override {}
```
The need of overriding the Env's destructor seems to be introduced by https://github.com/facebook/rocksdb/pull/9293 and surfaced by -Winconsistent-missing-destructor-override, which is not turned on by default.
**Summary:**
Mark ~Env() override
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9467
Test Plan: - Turn on -Winconsistent-missing-destructor-override and USE_CLANG=1 make -jN env/env.o to see whether the error shows up
Reviewed By: jay-zhuang, riversand963, george-reynya
Differential Revision: D33864985
Pulled By: hx235
fbshipit-source-id: 4a78bd161ff153902b2676829723e9a1c33dd749
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds has been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9455
Test Plan: Rely on my eyeball and CI
Reviewed By: ajkr
Differential Revision: D33811664
Pulled By: hx235
fbshipit-source-id: 866859427fe710354a90f1095057f80116365ff0
Summary:
From C++ 20 onwards, the != operator is not supported for a shared_ptr.
So switch to using ==.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9465
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D33850596
Pulled By: anand1976
fbshipit-source-id: eec16d1aa6c39a315ec2d44d233d7518f9c1ddcb
Summary:
In RocksDB DBOptions::skip_log_error_on_recovery is marked as
"NOT SUPPORTED" for a long time, and setting this option does not have
any effect on the behavior of RocksDB library. Therefore, we are removing it
in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9434
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33763015
Pulled By: akankshamahajan15
fbshipit-source-id: 11f09643298da6c02d3dcdb090b996f4c3cfdd76
Summary:
In RocksDB few overloads of DB::CompactRange() are marked as DEPRECATED_FUNC, and
we are removing it in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9444
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33788520
Pulled By: akankshamahajan15
fbshipit-source-id: 716e0d5f227f791605d4d91626c0cbf5b4571630
Summary:
The API is deprecated long time ago. Clean up the codebase by
removing it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9462
Test Plan: CI, fake release: D33835220
Reviewed By: riversand963
Differential Revision: D33835103
Pulled By: jay-zhuang
fbshipit-source-id: 6d2dc12c8e7fdbe2700865a3e61f0e3f78bd8184
Summary:
This also removes the obsolete names BackupableDBOptions
and UtilityDB. API users must now use BackupEngineOptions and
DBWithTTL::Open. In C API, `rocksdb_backupable_db_*` is replaced
`rocksdb_backup_engine_*`. Similar renaming in Java API.
In reference to https://github.com/facebook/rocksdb/issues/9389
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D33780269
Pulled By: pdillinger
fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac
Summary:
MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.
This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)
Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9453
Test Plan:
new unit test added. Added memtable Bloom to crash test.
Performance testing
--------------------
Build WAL-only DB (recovers to memtable):
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000
```
Query test command, to maximize sensitivity to the changed code:
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS
```
(Note -num here is 10x larger for mostly memtable misses)
Before & after run simultaneously, average over 10 iterations per data point, ops/sec.
MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066
MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990
MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before: 12081715
After: 12117603
MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591
Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.
Reviewed By: ajkr
Differential Revision: D33805025
Pulled By: pdillinger
fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452
Test Plan: Rely on my eyeball and CI
Reviewed By: ajkr
Differential Revision: D33804938
Pulled By: hx235
fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945
Summary:
1. Removed the options from the Capped/Fixed SliceTransforms. Instead these classes are created with id.number. This allows the GetID() id to be calculated and stored at class construction time. This change puts the construction back to similar to how it was prior to the Customizable changes for SliceTransform.
2. Improve the performance of AsString by using the ID only if there are no option properties (which is the case for all of the builtin transforms).
Ran tests of calling AsString in a loop 5M times and found approximately a 10x performance increase vs the original code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9401
Reviewed By: pdillinger
Differential Revision: D33668672
Pulled By: mrambacher
fbshipit-source-id: d0075912c6ece8ed754ee543bc6b0b49a169b309
Summary:
In RocksDB, this option was already marked as "NOT SUPPORTED" for a long time, and setting this option does not have any effect on the behavior of RocksDB library. Therefore, we are removing it in the preparations of the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9450
Reviewed By: ajkr
Differential Revision: D33802466
Pulled By: bjlemaire
fbshipit-source-id: 97570985f1400525304053476450f7ef504c0cd5
Summary:
Regexes are considered potentially problematic for use in
registering RocksDB extensions, so we are removing
ObjectLibrary::Register() and the Regex public API it depended on (now
unused).
In reference to https://github.com/facebook/rocksdb/issues/9389
Why?
* The power of Regexes can make it hard to reason about which extension
will match what. (The replacement API isn't perfect, but we are at least
"holding the line" on patterns we have seen in practice.)
* It is easy to make regexes that don't quite mean what you think they
mean, such as forgetting that the `.` in `foo.bar` can match any character
or that matching is nondeterministic, as in `a🅱️42` matching `.*:[0-9]+`.
* Some regexes and implementations can have disastrously bad
performance. This might not be much practical concern for ObjectLibray
here, but we don't want to encourage potentially dangerous further use
in production code. (Testing code is fine. See TestRegex.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9439
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D33792342
Pulled By: pdillinger
fbshipit-source-id: 4f64dcb04764e639162c8977a5fa196f67754cec
Summary:
Add an option to set the WAL compression algorithm - wal_compression.
TODO: WAL compression is not implemented and will only support zstd initially. Will be added in subsequent diffs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9432
Reviewed By: pdillinger
Differential Revision: D33797275
Pulled By: sidroyc
fbshipit-source-id: 8db81d9c9cea5e2e4f1445d3aecad8106137b8e7
Summary:
RocksDB has marked DB::AddFile() as "DEPRECATED_FUNC" for a long time, and
it will be removed in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9433
Test Plan: make check -j64; CircleCI
Reviewed By: riversand963
Differential Revision: D33763987
Pulled By: akankshamahajan15
fbshipit-source-id: a3407324479bb43689e1213e4e29d53095e7579a
Summary:
This PR moves RADOS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host RADOS support. At this point,
people can start from the example repo and fork.
The goal is to include this commit in RocksDB 7.0 release.
Reference:
https://github.com/ajkr/dedupfs by ajkr
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9206
Test Plan:
Follow instructions in https://github.com/riversand963/rocksdb-rados-env/blob/main/README.md and build
test binary `env_librados_test` and run it.
Also, make check
Reviewed By: ajkr
Differential Revision: D33751690
Pulled By: riversand963
fbshipit-source-id: 30466c62afa9e4619847a48567ed158e62835e35
Summary:
This PR moves HDFS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host hdfs support. At this point,
people can start from the example repo and fork.
Java/JNI is not included yet, and needs to be done later if necessary.
The goal is to include this commit in RocksDB 7.0 release.
Reference:
https://github.com/ajkr/dedupfs by ajkr
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9170
Test Plan:
Follow the instructions in https://github.com/riversand963/rocksdb-hdfs-env/blob/master/README.md. Build and run db_bench and db_stress.
make check
Reviewed By: ajkr
Differential Revision: D33751662
Pulled By: riversand963
fbshipit-source-id: 22b4db7f31762ed417a20239f5a08dcd1696244f
Summary:
Loose ends relate to mmap on 32-bit systems. (Testing is more
complicated when the feature was completely disabled on 32-bit.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9386
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D33590715
Pulled By: pdillinger
fbshipit-source-id: f2637036a538a552200adee65b6765fce8cae27b
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9404
It is better practice to mark destructors as override. Without this
change there can be issues building with
-Wsuggest-destructor-override.
Reviewed By: riversand963
Differential Revision: D33671992
fbshipit-source-id: 75b0c15010cbab5fbc071c150fef1dc85d5d9d96
Summary:
The old block-based filter has been deprecated for years, but
this makes that more clear by marking the functions specific to it and
logging a warning when the feature is used.
It is deprecated because of performance. In that old design, you have to
binary search through the full SST index before a bloom filter query, which
is much more expensive than a bloom query itself.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9403
Test Plan:
Used db_bench with and without -use_block_based_filter,
running at the same time
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom,readrandom -num=10000000 -duration=20 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0
No significant difference in construction time but 3x slower readrandom
with -use_block_based_filter:
readrandom : 100.517 micros/op 9948 ops/sec; 1.1 MB/s
vs.
readrandom : 33.368 micros/op 29968 ops/sec; 3.3 MB/s
Also saw deprecation message (just once) in LOG only with
-use_block_based_filter
Reviewed By: ajkr
Differential Revision: D33673202
Pulled By: pdillinger
fbshipit-source-id: 99f6f0eff619408d9e5f7ef546954ed0be6c7a5b
Summary:
**Context/Summary:**
There are two `RateLimiter::Request()` in public header. One of them is missing some comment that the other one has.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9392
Test Plan: rely on CI test
Reviewed By: pdillinger
Differential Revision: D33623609
Pulled By: hx235
fbshipit-source-id: 42dc06308ff0bcf5ee7ef67e0b1c0172fc239b20
Summary:
In response to https://github.com/facebook/rocksdb/issues/9354, this PR adds a way for users to "opt out"
of extra checks that can impact peak write performance, which
currently only includes force_consistency_checks. I considered including
some other options but did not see a db_bench performance difference.
Also clarify in comment for force_consistency_checks that it can "slow
down saturated writing."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9363
Test Plan:
basic coverage in unit tests
Using my perf test in https://github.com/facebook/rocksdb/issues/9354 comment, I see
force_consistency_checks=true -> 725360 ops/s
force_consistency_checks=false -> 783072 ops/s
Reviewed By: mrambacher
Differential Revision: D33636559
Pulled By: pdillinger
fbshipit-source-id: 25bfd006f4844675e7669b342817dd4c6a641e84
Summary:
Range Locking supports Lock Escalation. Lock Escalation is invoked when
lock memory is nearly exhausted and it reduced the amount of memory used
by joining adjacent locks.
Bridging the gap between certain locks has adverse effects. For example,
in MyRocks it is not a good idea to bridge the gap between locks in
different indexes, as that get the lock to cover large portions of
indexes, or even entire indexes.
Resolve this by introducing Escalation Barrier. The escalation process
will call the user-provided barrier callback function:
bool(const Endpoint& a, const Endpoint& b)
If the function returns true, there's a barrier between a and b and Lock
Escalation will not try to bridge the gap between a and b.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9290
Reviewed By: akankshamahajan15
Differential Revision: D33486753
Pulled By: riversand963
fbshipit-source-id: f97910b67aba0579ea1d35f523ca6863d3dd018e
Summary:
Compatible change, more natural (especially in generated Rust bindings), no risk that the API will ever need mutable access because it has to make a copy anyway.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9376
Reviewed By: ajkr
Differential Revision: D33541435
Pulled By: pdillinger
fbshipit-source-id: 15c512a0d70b6e8694fa99d598b7d022751c1e59
Summary:
In order to support old-style regex function registration, restored the original "Register<T>(string, Factory)" method using regular expressions. The PatternEntry methods were left in place but renamed to AddFactory. The goal is to allow for the deprecation of the original regex Registry method in an upcoming release.
Added modes to the PatternEntry kMatchZeroOrMore and kMatchAtLeastOne to match * or +, respectively (kMatchAtLeastOne was the original behavior).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9362
Reviewed By: pdillinger
Differential Revision: D33432562
Pulled By: mrambacher
fbshipit-source-id: ed88ab3f9a2ad0d525c7bd1692873f9bb3209d02
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9370
GCC and newer clang, e.g. clang-12 treat `std::unique_ptr` slightly differently.
For the following code
```
#include <iostream>
#include <memory>
#include <type_traits>
struct A {
std::unique_ptr<int> m1;
};
int main()
{
std::cout << std::boolalpha;
std::cout << std::is_standard_layout<A>::value << '\n';
return 0;
}
```
GCC11(C++20) (tested on https://en.cppreference.com/w/cpp/types/is_standard_layout) will print "true", while newer clang, e.g. clang-12 will print "false". This breaks the usage of `offsetof()` on structs with non-static members of type `std::unique_ptr`.
Fixing this by replacing the builtin `offsetof` with a trick documented at https://gist.github.com/graphitemaster/494f21190bb2c63c5516.
Reviewed By: jay-zhuang
Differential Revision: D33420840
fbshipit-source-id: 02bde281dfa28809bec787ad0f7019e85dd9c607
Summary:
Allows the Env to have options (Configurable) and loads like other Customizable classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293
Reviewed By: pdillinger, zhichao-cao
Differential Revision: D33181591
Pulled By: mrambacher
fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7
Summary:
Added new ObjectLibrary::Entry classes to replace/reduce the use of Regex. For simple factories that only do name matching, there are "StringEntry" and "AltStringEntry" classes. For classes that use some semblance of regular expressions, there is a PatternEntry class that can match a name and prefixes. There is also a class for Customizable::IndividualId format matches.
Added tests for the new derivative classes and got all unit tests to pass.
Resolves https://github.com/facebook/rocksdb/issues/9225.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9264
Reviewed By: pdillinger
Differential Revision: D33062001
Pulled By: mrambacher
fbshipit-source-id: c2d2143bd2d38bdf522705c8280c35381b135c03
Summary:
This option causes trace records to be written in the serialized write thread. That way, the write records in the trace must follow the same order as writes that are logged to WAL and writes that are applied to the DB.
By default I left it disabled to match existing behavior. I enabled it in `db_stress`, though, as that use case requires order of write records in trace matches the order in WAL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9334
Test Plan:
- See if below unsynced data loss crash test can run for 24h straight. It used to crash after a few hours when reaching an unlucky trace ordering.
```
DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm /usr/local/bin/python3 -u tools/db_crashtest.py blackbox --interval=10 --max_key=100000 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --value_size_mult=33 --sync_fault_injection=1 --test_batches_snapshots=0 --duration=86400
```
Reviewed By: zhichao-cao
Differential Revision: D33301990
Pulled By: ajkr
fbshipit-source-id: 82d97559727adb4462a7af69758449c8725b22d3
Summary:
As (https://github.com/facebook/rocksdb/issues/9210) discussed, the **full_history_ts_low** is a member of CompactRangeOptions currently, which means a CF's fullHistoryTsLow is advanced only when users submit a CompactRange request.
However, users may want to advance the fllHistoryTsLow without an immediate compact.
This merge make IncreaseFullHistoryTsLow to a public API so users can advance each CF's fullHistoryTsLow seperately.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9221
Reviewed By: akankshamahajan15
Differential Revision: D33201106
Pulled By: riversand963
fbshipit-source-id: 9cb1d013ba93260f72e16353e693ffee167b47ee
Summary:
locktree is a module providing Range Locking. It has a counter for
the number of times a lock acquisition request was blocked by an
existing conflicting lock and had to wait for it to be released.
Expose this counter in RangeLockManagerHandle::Counters::lock_wait_count.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9289
Reviewed By: jay-zhuang
Differential Revision: D33079182
Pulled By: riversand963
fbshipit-source-id: 25b1a362d9da247536ab5007bd15900b319f139e
Summary:
We saw the below assertion failure in `error_handler_fs_test`:
```
db/error_handler_fs_test.cc:2471: Failure
Expected equality of these values:
listener->new_bg_error()
Which is: 16-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>
Status::Aborted()
Which is: 16-byte object <0A-00 00-00 60-61 00-00 00-00 00-00 00-00 00-00>
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
what(): db/error_handler_fs_test.cc:2471: Failure
Expected equality of these values:
listener->new_bg_error()
Which is: 16-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>
Status::Aborted()
Which is: 16-byte object <0A-00 00-00 60-61 00-00 00-00 00-00 00-00 00-00>
Received signal 6 (Aborted)
```
The problem was completing `OnErrorRecoveryCompleted()` would
wake up the main thread and allow it to proceed to that assertion. But
that assertion assumes `OnErrorRecoveryEnd()` has completed since
only `OnErrorRecoveryEnd()` affects `new_bg_error()`.
The fix is just to make `OnErrorRecoveryCompleted()` not wake up the
main thread, by means of not implementing it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9325
Test Plan:
- ran `while TEST_TMPDIR=/dev/shm ./error_handler_fs_test ; do : ; done` for a while
- injected sleep between `OnErrorRecovery{Completed,End}()` callbacks, which guaranteed repro before this PR
Reviewed By: anand1976
Differential Revision: D33249200
Pulled By: ajkr
fbshipit-source-id: 1659ee183cd09f90d4dbd898f65103473fcf84a8
Summary:
- Make MemoryAllocator and its implementations into a Customizable class.
- Added a "DefaultMemoryAllocator" which uses new and delete
- Added a "CountedMemoryAllocator" that counts the number of allocs and free
- Updated the existing tests to use these new allocators
- Changed the memkind allocator test into a generic test that can test the various allocators.
- Added tests for creating all of the allocators
- Added tests to verify/create the JemallocNodumpAllocator using its options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980
Reviewed By: zhichao-cao
Differential Revision: D32990403
Pulled By: mrambacher
fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76
Summary:
As title, Closes https://github.com/facebook/rocksdb/issues/9272
Since TimestampAssigner-related classes needs to access
`WriteBatch::ProtectionInfo` objects which is for internal use only,
it's difficult to make `AssignTimestamp` methods a template and put them
in the same public header, `include/rocksdb/write_batch.h`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9278
Test Plan:
```
make check
# Also manually test following the repro-steps in issue 9272
```
Reviewed By: ltamasi
Differential Revision: D33012686
Pulled By: riversand963
fbshipit-source-id: 89f24a86a1170125bd0b94ef3b32e69aa08bd949
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266
This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.
Reviewed By: ltamasi
Differential Revision: D31721350
fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
Summary:
I'm working on a new format_version=6 to support context
checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test
updates to support that change.
Test coverage data and manual inspection agree on dead code in
block_based_table_reader.cc (removed).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240
Test Plan:
tests enhanced to cover more cases etc.
Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible.
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true}
(Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness)
Before w/ wal: 454512
After w/ wal: 444820 (-2.1%)
Before w/o wal: 1004560
After w/o wal: 998897 (-0.6%)
Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case.
This regression will be corrected in a follow-up PR.
Reviewed By: ajkr
Differential Revision: D32813769
Pulled By: pdillinger
fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f
Summary:
Previously, the OnErrorRecoveryCompleted callback was called when
RocksDB was able to successfully recover from a retryable error.
However, if the recovery failed and was eventually stopped, there was no
indication of the status. To fix that, a new OnErrorRecoveryEnd callback
is introduced that deprecates the OnErrorRecoveryCompleted callback. The
new callback is called with the original error and the new error status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9244
Test Plan: Add a new unit test in error_handler_fs_test
Reviewed By: zhichao-cao
Differential Revision: D32922303
Pulled By: anand1976
fbshipit-source-id: f04e77a9cb92c5ea6385590682d3fcf559971b99
Summary:
**Context:**
Searching `TableProperties::properties_offsets` across the codebase reveals that internally it is only used to find the external SST file's global seqno offeset. Therefore we can narrow it down and replace this map property with a uint64_t property `external_sst_file_global_seqno_offset` to save memory usage related to table properties.
Note:
- See PR comments for discussion about potential impact on existing external usage of `TableProperties::properties_offsets`
- See PR comments for discussion on keeping external SST file global seqno's offset VS using a simple flag indicating seqno's existence.
**Summary:**
- Replaced `TableProperties::properties_offsets` with `TableProperties::external_sst_file_global_seqno_offset`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9212
Test Plan: - Relied on existing tests should be sufficient since `TableProperties::properties_offsets` existed before and should already be tested.
Reviewed By: ajkr
Differential Revision: D32665941
Pulled By: hx235
fbshipit-source-id: 718e44617346dc4f3b1276ee953e61c196277795
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9205
Update WriteBatch::AssignTimestamp() APIs so that they take an
additional argument, i.e. a function object called `checker` indicating the user-specified logic of performing
checks on timestamp sizes.
WriteBatch is a building block used by multiple other RocksDB components, each of which may track
timestamp information in different data structures. For example, transaction can either write to
`WriteBatchWithIndex` which is a `WriteBatch` with index, or write directly to raw `WriteBatch` if
`Transaction::DisableIndexing()` is called.
`WriteBatchWithIndex` keeps mapping from column family id to comparator, and transaction needs
to keep similar information for the `WriteBatch` if user calls `Transaction::DisableIndexing()` (dynamically)
so that we will know the size of each timestamp later. The bookkeeping info maintained by `WriteBatchWithIndex`
and `Transaction` should not overlap.
When we later call `WriteBatch::AssignTimestamp()`, we need to use these data structures to guarantee
that we do not accidentally assign timestamps for keys from column families that disable timestamp.
Reviewed By: ltamasi
Differential Revision: D31735186
fbshipit-source-id: 8b1709ed880ac72f995aa9e012e5873b290840a7
Summary:
Extend C API to add new function `rocksdb_livefiles_column_family_name`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9232
Reviewed By: akankshamahajan15
Differential Revision: D32736516
Pulled By: ajkr
fbshipit-source-id: a854256a0f4652c903ab5ad8355ded051ac19987
Summary:
1. Fix GetOptionsPtr for Wrapped (Inner() != nullptr) Customizable objects. This allows the inner options to be returned via this method.
2. Allow the option type map to be nullptr. This allows objects to be registered as options (for GetOptionsPtr) but not be used by the configuration methods.
Added tests as appropriate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9213
Reviewed By: zhichao-cao
Differential Revision: D32718882
Pulled By: mrambacher
fbshipit-source-id: 563203d1f006a2629060feb31c5dff9a233e1e83
Summary:
* added missing override specifiers for overriden methods
this fixes compiler warnings emitted by g++ and clang++ when compile option `-Wsuggest-override` is turned on.
* fix compile warning with -Wmaybe-uninitialized
g++-11 warns about a _potentially_ uninitialized variable when using `-Wmaybe_uninitialized`:
```
env/env.cc: In member function ‘virtual rocksdb::Status rocksdb::Env::GetHostNameString(std::string*)’:
env/env.cc:738:66: error: ‘hostname_buf’ may be used uninitialized [-Werror=maybe-uninitialized]
738 | Status s = GetHostName(hostname_buf.data(), hostname_buf.size());
| ^
In file included from /usr/include/c++/11/tuple:39,
from /usr/include/c++/11/functional:54,
from ./include/rocksdb/env.h:22,
from env/env.cc:10:
/usr/include/c++/11/array:176:7: note: by argument 1 of type ‘const std::array<char, 256>*’ to ‘constexpr std::array<_Tp, _Nm>::size_type std::array<_Tp, _Nm>::size() const [with _Tp = char; long unsigned int _Nm = 256]’ declared here
176 | size() const noexcept { return _Nm; }
| ^~~~
env/env.cc:737:37: note: ‘hostname_buf’ declared here
737 | std::array<char, kMaxHostNameLen> hostname_buf;
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9199
Reviewed By: jay-zhuang
Differential Revision: D32630703
Pulled By: pdillinger
fbshipit-source-id: 9ea3010b1105a582548e3c3c0db4475b201e4a10
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187
Test Plan: Ran `make check` and the stress/crash test.
Reviewed By: riversand963
Differential Revision: D32565512
Pulled By: ltamasi
fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
Summary:
`ReadOptions::iter_start_seqnum` and `DBOptions::preserve_deletes` are
deprecated, please try using user defined timestamp feature instead.
The feature is used to support differential snapshots, but not well
maintained (https://github.com/facebook/rocksdb/issues/6837, https://github.com/facebook/rocksdb/issues/8472) and the interface is not user friendly which
returns an internal key from the iterator. The user defined timestamp
feature is a more flexible feature to support similar usecase, please
switch to that if you have such usecase.
The deprecated feature will be removed in a future release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9091
Test Plan:
check LOG
Fix https://github.com/facebook/rocksdb/issues/9090
Reviewed By: ajkr
Differential Revision: D32071750
Pulled By: jay-zhuang
fbshipit-source-id: b882c4668dd1bf26ce03c4c192f1bba584bf6104
Summary:
- Fixed bug where bottom-pri manual compactions were counting towards `bg_compaction_scheduled_` instead of `bg_bottom_compaction_scheduled_`. It seems to have no negative effect.
- Fixed bug where automatic compaction scheduling did not consider `bg_bottom_compaction_scheduled_`. Now automatic compactions cannot be scheduled that exceed the per-DB compaction concurrency limit (`max_compactions`) when some existing compactions are bottommost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9179
Test Plan: new unit test for manual/automatic. Also verified the existing automatic/automatic test ("ConcurrentBottomPriLowPriCompactions") hanged until changing it to explicitly enable concurrency.
Reviewed By: riversand963
Differential Revision: D32488048
Pulled By: ajkr
fbshipit-source-id: 20c4c0693678e81e43f85ed3cc3402fcf26e3310
Summary:
Add a new API in listener.h that notifies about IOErrors on
Read/Write/Append/Flush etc. The API reports about IOStatus, filename, Operation
name, offset and length.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9177
Test Plan: Added new unit tests
Reviewed By: anand1976
Differential Revision: D32470627
Pulled By: akankshamahajan15
fbshipit-source-id: 189a717033590ae227b3beae8b1e7e185e4cdc12
Summary:
* Checksums are now checked on meta blocks unless specifically
suppressed or not applicable (e.g. plain table). (Was other way around.)
This means a number of cases that were not checking checksums now are,
including direct read TableProperties in Version::GetTableProperties
(fixed in meta_blocks ReadTableProperties), reading any block from
PersistentCache (fixed in BlockFetcher), read TableProperties in
SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open,
maybe more.
* For that to work, I moved the global_seqno+TableProperties checksum
logic to the shared table/ code, because that is used by many utilies
such as SstFileDumper.
* Also for that to work, we have to know when we're dealing with a block
that has a checksum (trailer), so added that capability to Footer based
on magic number, and from there BlockFetcher.
* Knowledge of trailer presence has also fixed a problem where other
table formats were reading blocks including bytes for a non-existant
trailer--and awkwardly kind-of not using them, e.g. no shared code
checking checksums. (BlockFetcher compression type was populated
incorrectly.) Now we only read what is needed.
* Minimized code duplication and differing/incompatible/awkward
abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block
without parsing block handle)
* Moved some meta block handling code from table_properties*.*
* Moved some code specific to block-based table from shared table/ code
to BlockBasedTable class. The checksum stuff means we can't completely
separate it, but things that don't need to be in shared table/ code
should not be.
* Use unique_ptr rather than raw ptr in more places. (Note: you can
std::move from unique_ptr to shared_ptr.)
Without enhancements to GetPropertiesOfAllTablesTest (see below),
net reduction of roughly 100 lines of code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163
Test Plan:
existing tests and
* Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that
checksums are now checked on direct read of table properties by TableCache
(new test would fail before this change)
* Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test
putting table properties under old meta name
* Also generally enhanced that same test to actually test what it was
supposed to be testing already, by kicking things out of table cache when
we don't want them there.
Reviewed By: ajkr, mrambacher
Differential Revision: D32514757
Pulled By: pdillinger
fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9
Summary:
Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge.
**Context:**
Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction, moving toward the goal of [single global memory limit using block cache capacity](https://github.com/facebook/rocksdb/wiki/Projects-Being-Developed#improving-memory-efficiency). It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under `LRUCacheOptions::strict_capacity_limit=true`.
The option to turn on this feature is `BlockBasedTableOptions::reserve_table_builder_memory = true` which by default is set to `false`. We [decided](https://github.com/facebook/rocksdb/pull/9073#discussion_r741548409) not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option.
**Summary:**
- Reserved/released cache for creation/destruction of three main memory users with the passed-in `FilterBuildingContext::cache_res_mgr` during filter construction:
- hash entries (i.e`hash_entries`.size(), we bucket-charge hash entries during insertion for performance),
- banding (Ribbon Filter only, `bytes_coeff_rows` +`bytes_result_rows` + `bytes_backtrack`),
- final filter (i.e, `mutable_buf`'s size).
- Implementation details: in order to use `CacheReservationManager::CacheReservationHandle` to account final filter's memory, we have to store the `CacheReservationManager` object and `CacheReservationHandle` for final filter in `XXPH3BitsFilterBuilder` as well as explicitly delete the filter bits builder when done with the final filter in block based table.
- Added option fo run `filter_bench` with this memory reservation feature
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9073
Test Plan:
- Added new tests in `db_bloom_filter_test` to verify filter construction peak cache reservation under combination of `BlockBasedTable::Rep::FilterType` (e.g, `kFullFilter`, `kPartitionedFilter`), `BloomFilterPolicy::Mode`(e.g, `kFastLocalBloom`, `kStandard128Ribbon`, `kDeprecatedBlock`) and `BlockBasedTableOptions::reserve_table_builder_memory`
- To address the concern for slow test: tests with memory reservation under `kFullFilter` + `kStandard128Ribbon` and `kPartitionedFilter` take around **3000 - 6000 ms** and others take around **1500 - 2000 ms**, in total adding **20000 - 25000 ms** to the test suit running locally
- Added new test in `bloom_test` to verify Ribbon Filter fallback on large banding in FullFilter
- Added test in `filter_bench` to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over **20** run as below:
- FastLocalBloom
- baseline `./filter_bench -impl=2 -quick -runs 20 | grep 'Build avg'`:
- **Build avg ns/key: 29.56295** (DEBUG_LEVEL=1), **29.98153** (DEBUG_LEVEL=0)
- new feature (expected to be similar as above)`./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg'`:
- **Build avg ns/key: 30.99046** (DEBUG_LEVEL=1), **30.48867** (DEBUG_LEVEL=0)
- new feature of RibbonFilter with fallback (expected to be similar as above) `./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
- **Build avg ns/key: 31.146975** (DEBUG_LEVEL=1), **30.08165** (DEBUG_LEVEL=0)
- Ribbon128
- baseline `./filter_bench -impl=3 -quick -runs 20 | grep 'Build avg'`:
- **Build avg ns/key: 129.17585** (DEBUG_LEVEL=1), **130.5225** (DEBUG_LEVEL=0)
- new feature (expected to be similar as above) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg' `:
- **Build avg ns/key: 131.61645** (DEBUG_LEVEL=1), **132.98075** (DEBUG_LEVEL=0)
- new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
- **Build avg ns/key: 52.032965** (DEBUG_LEVEL=1), **52.597825** (DEBUG_LEVEL=0)
- And the warning message of `"Cache reservation for Ribbon filter banding failed due to cache full"` is indeed logged to console.
Reviewed By: pdillinger
Differential Revision: D31991348
Pulled By: hx235
fbshipit-source-id: 9336b2c60f44d530063da518ceaf56dac5f9df8e
Summary:
Add the 3 read bytes counter to the Statistic, which will be used by storage tiering and get the information for files with different temperature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9123
Test Plan: added new testing cases.
Reviewed By: siying
Differential Revision: D32154745
Pulled By: zhichao-cao
fbshipit-source-id: b7905d6dae469a72428742364ec07b634b6f15da
Summary:
We have three layers of block cache that often use the same key
but map to different physical data:
* BlockBasedTableOptions::block_cache
* BlockBasedTableOptions::block_cache_compressed
* BlockBasedTableOptions::persistent_cache
If any two of these happen to share an underlying implementation and key
space (insertion into one shows up in another), then memory safety is
broken. The simplest case is block_cache == block_cache_compressed.
(Credit mrambacher for asking about this case in a review.)
With this change, we explicitly check for overlap and preemptively and
safely fail with a Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9172
Test Plan: test added. Crashes without new check
Reviewed By: anand1976
Differential Revision: D32465659
Pulled By: pdillinger
fbshipit-source-id: 3876b45b6dce6167e5a7a642725ddc86b96f8e40
Summary:
I was unable to figure out the behavior by reading the old doc so attempted to
write it differently.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9154
Reviewed By: riversand963
Differential Revision: D32338843
Pulled By: ajkr
fbshipit-source-id: e1e67720cd92572b195583e5ea2c592180d4fefd
Summary:
Implement the Name() method in FileSystemWrapper, since https://github.com/facebook/rocksdb/issues/8649 removed it and it can cause compilation failures. We can deprecate it in RocksDB 7.0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9156
Reviewed By: riversand963
Differential Revision: D32363977
Pulled By: anand1976
fbshipit-source-id: 1e5a2fec2ab0649255720d89abf5bac26bb64ded
Summary:
RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB.
This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block.
1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer), readahead_size will decrease by 8KB.
2. It maintains readahead_size for L1 - Ln levels.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056
Test Plan:
Added new unit tests
Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled.
Reviewed By: anand1976
Differential Revision: D31773640
Pulled By: akankshamahajan15
fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98
Summary:
Before this fix compilation with GCC 4.8.5 20150623 (Red Hat 4.8.5-36) would fail with the following error:
```
CC jls/db/db_impl/db_impl.o
In file included from ./env/file_system_tracer.h:8:0,
from ./file/random_access_file_reader.h:15,
from ./file/file_prefetch_buffer.h:15,
from ./table/format.h:13,
from ./table/internal_iterator.h:14,
from ./db/pinned_iterators_manager.h:12,
from ./db/range_tombstone_fragmenter.h:15,
from ./db/memtable.h:22,
from ./db/memtable_list.h:16,
from ./db/column_family.h:17,
from ./db/db_impl/db_impl.h:22,
from db/db_impl/db_impl.cc:9:
./include/rocksdb/file_system.h:108:8: error: unused parameter 'opts'
[-Werror=unused-parameter]
struct FileOptions : EnvOptions {
^
db/db_impl/db_impl.cc: In member function 'virtual rocksdb::Status
rocksdb::DBImpl::SetDBOptions(const
std::unordered_map<std::basic_string<char>, std::basic_string<char>
>&)':
db/db_impl/db_impl.cc:1230:36: note: synthesized method
'rocksdb::FileOptions& rocksdb::FileOptions::operator=(const
rocksdb::FileOptions&)' first required here
file_options_for_compaction_ = FileOptions(new_db_options);
^
CC jls/db/db_impl/db_impl_compaction_flush.o
cc1plus: all warnings being treated as errors
make[1]: *** [jls/db/db_impl/db_impl.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory `/rocksdb-local-build'
make: *** [rocksdbjavastatic] Error 2
Makefile:2222: recipe for target 'rocksdbjavastaticdockerarm64v8' failed
make: *** [rocksdbjavastaticdockerarm64v8] Error 2
```
This was detected on both ppc64le and arm64v8, however it does not seem to appear in the same GCC 4.8 version we use for x64 in CircleCI - https://app.circleci.com/pipelines/github/facebook/rocksdb/9691/workflows/c2a94367-14f3-4039-be95-325c34643d41/jobs/227906
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9144
Reviewed By: riversand963
Differential Revision: D32290770
fbshipit-source-id: c90a54ba2a618e1ff3660fff3f3368ab36c3c527
Summary:
The individual commits in this PR should be self-explanatory.
All small and _very_ low-priority changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5896
Reviewed By: riversand963
Differential Revision: D18065108
Pulled By: mrambacher
fbshipit-source-id: 236b1a1d9d21f982cc08aa67027108dde5eaf280
Summary:
Add clarification/extension to comments on max_total_wal_size and the Java wrapper MaxTotalWalSize to better explain the effect of the option on log file sizes.
Closes https://github.com/facebook/rocksdb/issues/5789
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9108
Reviewed By: pdillinger
Differential Revision: D32066640
Pulled By: mrambacher
fbshipit-source-id: 7d5affc87e4119019054af9c884a2ea01d68f5b7
Summary:
Context:
Surprisingly, there isn't any sanitization against negative `int64_t bytes` in `GenericRateLimiter::Request(int64_t bytes, const Env::IOPriority pri, Statistics* stats)`. A negative `bytes` can be passed in and incorrectly increases `available_bytes_` by subtracting the negative `bytes` from `available_bytes_`, such as [here](https://github.com/facebook/rocksdb/blob/main/util/rate_limiter.cc#L138) and [here](https://github.com/facebook/rocksdb/blob/main/util/rate_limiter.cc#L283), which are incorrect behaviors.
- Sanitized negative request bytes by rounding it up to 0
- Added notes to public and internal API
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9112
Test Plan: - Rely on existing tests
Reviewed By: ajkr
Differential Revision: D32085364
Pulled By: hx235
fbshipit-source-id: b1b6066b2dd5ffc7bcbfb07069ca65a33578251b
Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
to sync the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is forced by set
`IOOptions.force_dir_fsync = true`
4. renaming multiple files (like backup and checkpoint): dir-fsync is
forced, the same as above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903
Test Plan: run tests on btrfs and non btrfs
Reviewed By: ajkr
Differential Revision: D30885059
Pulled By: jay-zhuang
fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
Summary:
* Clarify that RocksDB is not exception safe on many of our callback
and extension interfaces
* Clarify FSRandomAccessFile::MultiRead implementations must accept
non-sorted inputs (see https://github.com/facebook/rocksdb/issues/8953)
* Clarify ConcurrentTaskLimiter and SstFileManager are not (currently)
extensible interfaces
* Mark WriteBufferManager as `final`, so it is then clearly not a
callback interface, even though it smells like one
* Clarify TablePropertiesCollector Status returns are mostly ignored
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9080
Test Plan: comments only (except WriteBufferManager final)
Reviewed By: ajkr
Differential Revision: D31968782
Pulled By: pdillinger
fbshipit-source-id: 11b648ce3ce3c5e5bdc02d2eafc7ea4b864bd1d2
Summary:
This feature was not part of any common or CI build, so no
surprise it broke. Now we can at least ensure compilation. I don't know
how to run the test successfully (missing config file) so it is bypassed
for now.
Fixes https://github.com/facebook/rocksdb/issues/9078
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9088
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D32009467
Pulled By: pdillinger
fbshipit-source-id: 3e0d1e5fde7f0ece703d48a81479e1cc7392c25c
Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.
Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument
This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.
DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.
Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.
### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
As you can see, especially once warmed up, xxh3 is fastest.
### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
Test
for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
Results (ops/sec)
for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3
Reviewed By: mrambacher
Differential Revision: D31905249
Pulled By: pdillinger
fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
Summary:
Somewhat confusingly, index and filter partition blocks are
never owned by table readers, even with
cache_index_and_filter_blocks=false. They still go into block cache
(possibly pinned by table reader) if there is a block cache. If no block
cache, they are only loaded transiently on demand.
This PR primarily clarifies the options APIs and some internal code
comments.
Also, this closes a hypothetical data corruption vulnerability where
some but not all index partitions are pinned. I haven't been able to
reproduce a case where it can happen (the failure seems to propagate
to abort table open) but it's worth patching nonetheless.
Fixes https://github.com/facebook/rocksdb/issues/8979
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9068
Test Plan:
existing tests :-/ I could cover the new code using sync
points, but then I'd have to very carefully relax my `assert(false)`
Reviewed By: ajkr
Differential Revision: D31898284
Pulled By: pdillinger
fbshipit-source-id: f2511a7d3a36bc04b627935d8e6cfea6422f98be
Summary:
This PR supports querying `GetMapProperty()` with "rocksdb.dbstats" to get the DB-level stats in a map format. It only reports cumulative stats over the DB lifetime and, as such, does not update the baseline for interval stats. Like other map properties, the string keys are not (yet) exposed in the public API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9057
Test Plan: new unit test
Reviewed By: zhichao-cao
Differential Revision: D31781495
Pulled By: ajkr
fbshipit-source-id: 6f77d3aee8b4b1a015061b8c260a123859ceaf9b
Summary:
This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty.
In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control.
Two set of write benchmarks are run:
1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163
2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655
Test Plan: Add unit tests to the functionality.
Reviewed By: ajkr
Differential Revision: D31787034
fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
Summary:
Currently, if Secondary Cache is provided to the lru cache, it is used by default. We add CacheTier to advanced_options.h to describe the cache tier we used. Add a `lowest_used_cache_tier` option to `DBOptions` (immutable) and pass it to BlockBasedTableReader to decide if secondary cache will be used or not. By default it is `CacheTier::kNonVolatileTier`, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileTier). By set it to `CacheTier::kVolatileTier`, the DB will not use the secondary cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9050
Test Plan: added new tests
Reviewed By: anand1976
Differential Revision: D31744769
Pulled By: zhichao-cao
fbshipit-source-id: a0575ebd23e1c6dfcfc2b4c8578764e73b15bce6
Summary:
* New public header unique_id.h and function GetUniqueIdFromTableProperties
which computes a universally unique identifier based on table properties
of table files from recent RocksDB versions.
* Generation of DB session IDs is refactored so that they are
guaranteed unique in the lifetime of a process running RocksDB.
(SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
this enables SST unique IDs to be guaranteed unique among SSTs generated
in a single process, and "better than random" between processes.
See https://github.com/pdillinger/unique_id
* In addition to public API producing 'external' unique IDs, there is a function
for producing 'internal' unique IDs, with functions for converting between the
two. In short, the external ID is "safe" for things people might do with it, and
the internal ID enables more "power user" features for the future. Specifically,
the external ID goes through a hashing layer so that any subset of bits in the
external ID can be used as a hash of the full ID, while also preserving
uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
and on full 192 bits).
Intended follow-up:
* Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
the third 64-bit value of the unique ID.)
* Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990
Test Plan:
Unit tests added, and checking of unique ids in stress test.
NOTE in stress test we do not generate nearly enough files to thoroughly
stress uniqueness, but the test trims off pieces of the ID to check for
uniqueness so that we can infer (with some assumptions) stronger
properties in the aggregate.
Reviewed By: zhichao-cao, mrambacher
Differential Revision: D31582865
Pulled By: pdillinger
fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243