Summary:
When sub compaction is decided for L0->L1 compaction, most of the cases, all L0 files will be involved in all sub compactions. However, it is not always the case. When files are generally (but not strictly) inserted in sequential order, there can be a subset of L0 files invovled. Yet RocksDB always open all those L0 files, and build an iterator, read many of the files' first of last block with expensive readahead. We trim some input files to reduce overhead a little bit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9802
Test Plan: Add a unit test to cover this case and manually validate the behavior while running the test.
Reviewed By: ajkr
Differential Revision: D35371031
fbshipit-source-id: 701ed7375b5cbe41672e93b38fe8a1503dad08b6
Summary:
This change adds two unit tests that would each catch the
regression fixed in https://github.com/facebook/rocksdb/issues/9736
* TableMetaIndexKeys - detects any churn in metaindex block keys
generated by SST files using standard db_test_util configurations.
* BloomFilterCompatibility - this detects if any common built-in
FilterPolicy configurations fail to read filters generated by another.
(The regression bug caused NewRibbonFilterPolicy not to read filters
from NewBloomFilterPolicy and vice-versa.) This replaces some previous
tests that didn't really appear to be testing much of anything except
basic data correctness, which doesn't tell you a filter is being used.
Light refactoring in meta_blocks.cc/h to support inspecting metaindex
keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9773
Test Plan:
this is the test. Verified that 7.0.2 fails both tests and 7.0.3 passes.
With backporting for intentional API changes in 7.0, 6.29 also passes.
Reviewed By: ajkr
Differential Revision: D35236248
Pulled By: pdillinger
fbshipit-source-id: 493dfe9ad7e27524bf7c6c1af8a4b8c31bc6ef5a
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9803
Only use Meta-internal version now. precommit_checker.py also now obsolete
Bring back `make commit_prereq` in follow-up work
Reviewed By: jay-zhuang
Differential Revision: D35372283
fbshipit-source-id: 7428438ca51f878802c301d0d5591675e551a113
Summary:
Update stats in random_access_file_reader for Read and
ReadAsync API to take into account the read latency for async
prefetching.
It also fixes ERROR_HANDLER_AUTORESUME_RETRY_COUNT stat whose value was
incorrect in portal.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9810
Test Plan: Update unit test
Reviewed By: anand1976
Differential Revision: D35433081
Pulled By: akankshamahajan15
fbshipit-source-id: aeec3901270e58a003ce6b5214bd25ddcb3a12a9
Summary:
Fixes https://github.com/facebook/rocksdb/issues/9779.
The padding at the end of a struct is added implicitly according to the
sizeof spec: "When applied to a class, the result is the
number of bytes in an object of that class including any padding
required for placing objects of that type in an array"
(https://eel.is/c++draft/expr.sizeof#2.sentence-2). We should drop the
explicit padding since it assumed support for zero-length arrays, which
is non-standard.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9809
Test Plan: rely on CI
Reviewed By: riversand963
Differential Revision: D35413496
Pulled By: ajkr
fbshipit-source-id: 25d52ca45e648ad0d5657149f26f6adecbed1cb4
Summary:
The name of this property "kIsFileDeletionsEnabled" is very, very easy to misunderstand.
I think 0 represents false (i.e. disabled) and non-0 means true (enabled), and this property is just the opposite.
I modified the name of this property, and as few other positions as possible, so that the final meaning remains the same, but the name of this property is more common sense.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9791
Reviewed By: ajkr
Differential Revision: D35362166
Pulled By: jay-zhuang
fbshipit-source-id: 85310d88bdd131893effb64e1adb7d0d7b202f88
Summary:
For write-prepared/write-unprepared transactions,
GetCommitTimeWriteBatch() can be used only if the transaction is started
with `TransactionOptions::use_only_the_last_commit_time_batch_for_recovery` set
to true. Otherwise, it is possible that multiple uncommitted versions of the
same key exist in the database. During bottommost compaction, RocksDB may
set the sequence numbers of both to zero once they become committed, causing
output SST file to have two identical internal keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9794
Test Plan:
make check
pay special attention to the following
```
transaction_test --gtest_filter=MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/*
```
Reviewed By: lth
Differential Revision: D35327214
Pulled By: riversand963
fbshipit-source-id: 3bae00a28359c10e96e4c6f676d20de5610d8a0f
Summary:
Various renaming and fixes to get rid of remaining uses of
"backupable" which is terminology leftover from the original, flawed
design of BackupableDB. Now any DB can be backed up, using BackupEngine.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9792
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D35334386
Pulled By: pdillinger
fbshipit-source-id: 2108a42b4575c8cccdfd791c549aae93ec2f3329
Summary:
**Context/Todo:**
As requested, allow IOOptions to take in an Env::IOPriority for convenience to pass down rate limiter related hint to file system level and for future interaction between RocksDB internal's rate limiting and custom file system level's rate-limiting.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9806
Test Plan: No actual code changes in RocksDB internals
Reviewed By: ajkr
Differential Revision: D35388966
Pulled By: hx235
fbshipit-source-id: 5891c97c3f9184cd221a9ab8536ce8dfa8526c08
Summary:
If FilePrefetchBuffer object is destroyed and then later Poll() calls callback on object which has been destroyed, it gives segfault on accessing destroyed object. It was caught after adding unit tests that tests Posix implementation of ReadAsync and Poll APIs.
This PR also updates and fixes existing IOURing tests which were not running locally because RocksDbIOUringEnable function wasn't defined and IOUring was disabled for those tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9777
Test Plan: Added new unit test
Reviewed By: anand1976
Differential Revision: D35254002
Pulled By: akankshamahajan15
fbshipit-source-id: 68e80054ffb14ae25c255920ebc6548ca5f130a1
Summary:
Make `commit_prereq` work and a few other improvements:
* Remove gcc 481 and gcc5xx which are no longer supported
* Remove platform007 which is gone
* `make clean` work for both mac and linux
* `precommit_checker.py` to python3
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9797
Test Plan: `make commit_prereq`
Reviewed By: ajkr
Differential Revision: D35338536
Pulled By: jay-zhuang
fbshipit-source-id: 1e159962ab9d31c43c4b85de7d0f582d3e881ffe
Summary:
Right now, parallelism information passed to "build_tools/rocksdb-lego-determinator no_compression" isn't effective when the test actually runs, as the information is dropped in the middle. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9796
Test Plan: Run "build_tools/rocksdb-lego-determinator no_compression" and execute the command line generated and observe the parallelism.
Reviewed By: jay-zhuang
Differential Revision: D35330085
fbshipit-source-id: e9b32d0520d61fbc2697ebd841099485f64482e3
Summary:
build_tools/rocksdb-lego-determinator is to generate commands for continuous tests. Recently it changed to by default run tests in parallel with parallelism to be number of CPU processors. This sometimes causes out of space when running so many tests in parallel. Reduce the parallelism by half to temporarily work it around.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9788
Test Plan: Run build_tools/rocksdb-lego-determinator and watch generated commands.
Reviewed By: pdillinger
Differential Revision: D35327704
fbshipit-source-id: 95a8c51a111bb6ab62c456c74ab9c905b457ea8f
Summary:
So the build on dev server will work.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9787
Test Plan: `$ make db_basic_bench` on dev server.
Reviewed By: ajkr
Differential Revision: D35295466
Pulled By: jay-zhuang
fbshipit-source-id: 58dccc65bc29e1185b97cbeb7630ed66deb604aa
Summary:
There's an existing benchmark, "getmergeoperands", but it is unconventional in that it has multiple phases and hardcoded setup parameters.
This PR adds a different one, "readrandomoperands", that follows the pattern of other benchmarks of having a single phase and taking its configuration from existing flags.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9785
Test Plan:
```
$ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -write_buffer_size=1048576 -max_bytes_for_level_base=4194304 -target_file_size_base=1048576 -compression_type=none -disable_auto_compactions=true
$ ./db_bench -use_existing_db=true -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -disable_auto_compactions=true -duration=10
...
readrandomoperands : 542.082 micros/op 1844 ops/sec; 0.2 MB/s (11980 of 18999 found)
```
Reviewed By: jay-zhuang
Differential Revision: D35290412
Pulled By: ajkr
fbshipit-source-id: fb367ca614b128cef844a75f0e5d9dd7c3328d85
Summary:
min_log_number_to_keep denotes that the WALs whose numbers are below
this value **will** be deleted by RocksDB.
delete_wals_before will be used by RocksDB if
track_and_verify_wals_in_manifest is set to true. During recovery,
RocksDB uses the info encoded in delete_wals_before to reconstruct its
knowledge about what WALs to expect existing.
If these two tags are not encoded in the same VersionEdit, then it's
possible for min_log_number_to_keep=100 to exist, but
delete_wals_before=100 to be lost due to power failure. Subsequent
recovery will delete 99.log. If the db crashes again, the following
recovery will expect to see 99.log since there is no
delete_wals_before=100 in the MANIFEST, but the WAL is already deleted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9766
Test Plan:
First of all, make check.
Second, format compatibility.
SHORT_TEST=1 ./tools/check_format_compatible.sh
Reviewed By: ltamasi
Differential Revision: D35203623
Pulled By: riversand963
fbshipit-source-id: 45623fc4b4b50d299d5e0f9559a3a4c5e9522c8f
Summary:
Right now we log a wrong error when DB::Open() fails. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9784
Test Plan: CI runs should pass
Reviewed By: ajkr, riversand963
Differential Revision: D35290203
fbshipit-source-id: ffc640afa27f6b0a2382ee153dc43f28d9e242be
Summary:
There is no need to release-and-acquire immediately when no listener is registered. This is
what we have been doing for `NotifyOnFlushBegin()`, `NotifyOnFlushCompleted()`, `NotifyOnCompactionBegin()`,
`NotifyOnCompactionCompleted()`, and some other `NotifyOnXX` methods in event_helpers.cc.
Do the same for `NotifyOnMemTableSealed ()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9758
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D35159552
Pulled By: riversand963
fbshipit-source-id: 6e0aac50bd5c8f506d809b6638c33a7a28d1e87f
Summary:
much needed
Some other minor tweaks also
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9778
Test Plan: existing tests
Reviewed By: ajkr
Differential Revision: D35258195
Pulled By: pdillinger
fbshipit-source-id: 974ddafc23a540aacceb91da72e81593d818f99c
Summary:
This commit was generated using `mgt import`.
pristine code for third-party libraries:
third-party/benchmark
upgrade google benchmark to v1.6.1
contains a local patch that reverts [this](https://github.com/google/benchmark/pull/1227?fbclid=IwAR2CCmIJmjU62SPPQQf_t8kdAsMjYv_Pa_GxabYUOdQpGPZUHKwbnYS_1oE) and changs `enum Flags` to be `enum Flags : uint32_t`.
Reviewed By: chadaustin
Differential Revision: D35136540
fbshipit-source-id: f3662f953cd87956e5e9b767e55e3697f99d3b49
Summary:
In making `SstFileMetaData` inherit from `FileStorageInfo`, I
overlooked setting some `FileStorageInfo` fields when then default
`SstFileMetaData()` ctor is used. This affected `GetLiveFilesMetaData()`.
Also removed some buggy `static_cast<size_t>`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9769
Test Plan: Updated tests
Reviewed By: jay-zhuang
Differential Revision: D35220383
Pulled By: pdillinger
fbshipit-source-id: 05b4ee468258dbd3699517e1124838bf405fe7f8
Summary:
Fixes https://github.com/facebook/rocksdb/issues/9718
The verify_checksums flag of read_options should be passed to the read options used by the BlockFetcher in a couple of cases where it is not at present. It will now happen (but did not, previously) on iteration and on [multi]get, where a fetcher is created as part of the iterate/get call.
This may result in much better performance in a few workloads where the client chooses to remove verification.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9767
Reviewed By: mrambacher
Differential Revision: D35218986
Pulled By: jay-zhuang
fbshipit-source-id: 329d29764bb70fbc7f2673440bc46c107a813bc8
Summary:
In ReadOption `async_io` which prefetches the data asynchronously, db_bench and db_stress runs were failing because wrong data was prefetched which resulted in Error: Checksum mismatched. Wrong data was copied because capacity was less than actual size needed. It has been fixed in this PR.
Since there are two separate methods for async and sync prefetching, these changes are in async prefetching methods and any changes would not effect normal prefetching. I ran the regressions to make sure normal prefetching is fine.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9734
Test Plan:
1. CircleCI jobs
2. Ran db_bench
```
. /db_bench -use_existing_db=true
-db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32
-value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680
-duration=120 -ops_between_duration_checks=1 -async_io=1 -adaptive_readahead=1
```
3. Ran db_stress test
```
export CRASH_TEST_EXT_ARGS=" --async_io=1 --adaptive_readahead=1"
make crash_test -j
```
4. Run regressions for async_io disabled.
Old flow without any async changes:
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB: version 7.0
Date: Thu Mar 17 13:11:34 2022
CPU: 24 * Intel Core Processor (Broadwell)
CPUCache: 16384 KB
Keys: 32 bytes each (+ 0 bytes user-defined timestamp)
Values: 512 bytes each (256 bytes after compression)
Entries: 5000000
Prefix: 0 bytes
Keys per prefix: 0
RawSize: 2594.0 MB (estimated)
FileSize: 1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom : 483618.390 micros/op 2 ops/sec; 338.9 MB/s (249 of 249 found)
```
With async prefetching changes and async_io disabled to make sure in normal prefetching there is no regression.
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 --async_io=0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB: version 7.1
Date: Wed Mar 23 15:56:37 2022
CPU: 24 * Intel Core Processor (Broadwell)
CPUCache: 16384 KB
Keys: 32 bytes each (+ 0 bytes user-defined timestamp)
Values: 512 bytes each (256 bytes after compression)
Entries: 5000000
Prefix: 0 bytes
Keys per prefix: 0
RawSize: 2594.0 MB (estimated)
FileSize: 1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom : 481819.816 micros/op 2 ops/sec; 340.2 MB/s (250 of 250 found)
```
Reviewed By: riversand963
Differential Revision: D35058471
Pulled By: akankshamahajan15
fbshipit-source-id: 9233a1e6d97cea0c7a8111bfb9e8ac3251c341ce
Summary:
Fixes a bug introduced by me in https://github.com/facebook/rocksdb/pull/9733
That PR added a counter so that the per-thread seeds in ThreadState would
be unique even when --benchmarks had more than one test. But it incorrectly
used this counter as the value for ThreadState::tid as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9757
Test Plan:
Confirm that unexpectedly good QPS results on the regression tests return
to normal with this fix. I have confirmed that the QPS increase starts with
the PR 9733 diff.
Reviewed By: jay-zhuang
Differential Revision: D35149303
Pulled By: mdcallag
fbshipit-source-id: dee5cc36b7faaba6c3be6d6a253d3c2eaad72864
Summary:
Uniformly use GetByteArrayRegion() instead of GetByteArrayElements()
to copy bytes.
In addition, it can avoid an inefficient ReleaseByteArrayElements()
operation.
Some benefits of GetByteArrayRegion() can be referred to:
https://stackoverflow.com/a/2480493
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9380
Reviewed By: ajkr
Differential Revision: D35135474
Pulled By: jay-zhuang
fbshipit-source-id: a32c1774d37f2d22b9bcd105d83e0bb984b71b54
Summary:
This is for https://github.com/facebook/rocksdb/issues/9737
I have wasted more than a few hours running db_bench benchmarks where --seed was not set
and getting better than expected results because cache hit rates are great because
multiple invocations of db_bench used the same value for --seed or did not set it,
and then all used 0. The result is that all see the same sequence of keys.
Others have done the same. The problem is worse in that it is easy to miss and the result is a benchmark with results that are misleading.
A good way to avoid this is to set it to the equivalent of gettimeofday() when either
--seed is not set or it is set to 0 (the default).
With this change the actual seed is printed when it was 0 at process start:
Set seed to 1647992570365606 because --seed was 0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9740
Test Plan:
Perf results:
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000
readrandom : 6.469 micros/op 154583 ops/sec; 17.1 MB/s (4000000 of 4000000 found)
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=0
readrandom : 6.565 micros/op 152321 ops/sec; 16.9 MB/s (4000000 of 4000000 found)
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=1
readrandom : 6.461 micros/op 154777 ops/sec; 17.1 MB/s (4000000 of 4000000 found)
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=2
readrandom : 6.525 micros/op 153244 ops/sec; 17.0 MB/s (4000000 of 4000000 found)
Reviewed By: jay-zhuang
Differential Revision: D35145361
Pulled By: mdcallag
fbshipit-source-id: 2b35b153ccec46b27d7c9405997523555fc51267
Summary:
After commit [d642c60](d642c60bdc), the stats `READ_BLOCK_COMPACTION_MICROS` cannot record any compaction read duration, and it always report zero.
This PR targets to distinguish `READ_BLOCK_COMPACTION_MICROS` with `READ_BLOCK_GET_MICROS` so that `READ_BLOCK_COMPACTION_MICROS` could record the correct stats.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9722
Reviewed By: ajkr
Differential Revision: D35021870
Pulled By: jay-zhuang
fbshipit-source-id: f1a804994265e51465de64c2a08f2e0eeb6fc5a3
Summary:
Seems clean-rocksjava and clean-rocks conflict.
Also remove unnecessary step in java CI build, otherwise it will rebuild
the code again as java make sample do clean up first.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9710
Test Plan: `make rocksdbjava && make clean` should return success
Reviewed By: riversand963
Differential Revision: D35122872
Pulled By: jay-zhuang
fbshipit-source-id: 2a15b83e7a763c0fc0e42e1f35aac9551f951ece
Summary:
This adds the --slow_usecs option with a default value of 1M. Operations that
take this much time have a message printed when --histogram=1, --stats_interval=0
and --stats_interval_seconds=0. The current code hardwired this to 20,000 usecs
and for some stress tests that reduced throughput by 20% or more.
This is for https://github.com/facebook/rocksdb/issues/9620
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9732
Test Plan:
./db_bench --benchmarks=fillrandom,readrandom --compression_type=lz4 --slow_usecs=100 --histogram=1
./db_bench --benchmarks=fillrandom,readrandom --compression_type=lz4 --slow_usecs=100000 --histogram=1
Reviewed By: jay-zhuang
Differential Revision: D35121522
Pulled By: mdcallag
fbshipit-source-id: daf27f937efd748980545d6395db332712fc078b
Summary:
Although ColumnFamilySet comments say that DB mutex can be
freed during iteration, as long as you hold a ref while releasing DB
mutex, this is not quite true because UnrefAndTryDelete might delete cfd
right before it is needed to get ->next_ for the next iteration of the
loop.
This change solves the problem by making a wrapper class that makes such
iteration easier while handling the tricky details of UnrefAndTryDelete
on the previous cfd only after getting next_ in operator++.
FreeDeadColumnFamilies should already have been obsolete; this removes
it for good. Similarly, ColumnFamilySet::iterator doesn't need to check
for cfd with 0 refs, because those are immediately deleted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9730
Test Plan:
was reported with ASAN on unit tests like
DBLogicalBlockSizeCacheTest.CreateColumnFamily (very rare); keep watching
Reviewed By: ltamasi
Differential Revision: D35038143
Pulled By: pdillinger
fbshipit-source-id: 0a5478d5be96c135343a00603711b7df43ae19c9
Summary:
Extend Java RocksDB iterators to support indirect byte buffers, to add to the existing support for direct byte buffers.
Code to distinguish direct/indirect buffers is switched in Java, and a 2nd separate JNI call implemented to support indirect
buffers. Indirect support passes contained buffers using byte[]
There are some Java subclasses of iterator (WBWIIterator, SstFileReaderIterator) which also now have parallel JNI support functions implemented, along with direct/indirect switches in Java methods.
Closes https://github.com/facebook/rocksdb/issues/6282
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9222
Reviewed By: ajkr
Differential Revision: D35115283
Pulled By: jay-zhuang
fbshipit-source-id: f8d5d20b975aef700560fbcc99f707bb028dc42e