Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
Summary:
Annotate all of the logging functions to inform the compiler that these
use printf-style formatting arguments. This allows the compiler to emit
warnings if the format arguments are incorrect.
This also fixes many problems reported now that format string checking
is enabled. Many of these are simply mix-ups in the argument type (e.g,
int vs uint64_t), but in several cases the wrong number of arguments
were being passed in which can cause the code to crash.
The primary motivation for this was to fix the log message in
`DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s
format parameter with no argument supplied.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089
Differential Revision: D14574795
Pulled By: simpkins
fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a
Summary:
Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it.
Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098
Differential Revision: D14603753
Pulled By: siying
fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae
Summary:
This PR allows RocksDB to run in single-primary, multi-secondary process mode.
The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.
This PR has several components:
1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.
2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.
3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
3.1 Tail the primary's MANIFEST during recovery.
3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
3.3 Tailing WAL will be in a future PR.
4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899
Differential Revision: D14510945
Pulled By: riversand963
fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
Summary:
We introduced ttl option in CompactionOptionsFIFO when ttl-based file
deletion (compaction) was supported only as part of FIFO Compaction. But
with the extension of ttl semantics even to Level compaction,
CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start
using ColumnFamilyOptions.ttl for FIFO compaction as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965
Differential Revision: D14072960
Pulled By: sagar0
fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e
Summary:
Make file ingestion atomic.
as title.
Ingesting external SST files into multiple column families should be atomic. If
a crash occurs and db reopens, either all column families have successfully
ingested the files before the crash, or non of the ingestions have any effect
on the state of the db.
Also add unit tests for atomic ingestion.
Note that the unit test here does not cover the case of incomplete atomic group
in the MANIFEST, which is covered in VersionSetTest already.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4895
Differential Revision: D13718245
Pulled By: riversand963
fbshipit-source-id: 7df97cc483af73ad44dd6993008f99b083852198
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889
Differential Revision: D13701276
Pulled By: zinoale
fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
Summary:
Right now, CompactionPri = kMinOverlappingRatio provides best write amplification, but it doesn't
prioritize files with more tombstones. We combine the two good features: make kMinOverlappingRatio
to boost files with lots of tombstones too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4907
Differential Revision: D13788774
Pulled By: siying
fbshipit-source-id: 1991cbb495fb76c8b529de69896e38d81ed9d9b3
Summary:
https://github.com/facebook/rocksdb/pull/3340 introduces preloading when max_open_files != -1.
It doesn't preload index and filter in non-initial file loading case. This is a little bit too
complicated to understand. We observed in one MyRocks use case where the filter is expected to be
preloaded but is not. To simplify the use case, we simply always prefetch the index and filter.
They anyway is expected to be loaded in the file verification phase anyway.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4852
Differential Revision: D13595402
Pulled By: siying
fbshipit-source-id: d4d8624eb3e849e20aeb990df2100502d85aff31
Summary:
Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this:
- L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1].
- L1's file2 gets compacted to L2.
- User issues `Get()` for b#3.
- L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1.
- `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`.
The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829
Differential Revision: D13561355
Pulled By: ajkr
fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f
Summary:
Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340
Differential Revision: D6686945
Pulled By: siying
fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f
Summary:
Now that v2 is fully functional, the v1 aggregator is removed.
The v2 aggregator has been renamed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778
Differential Revision: D13495930
Pulled By: abhimadan
fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6
Summary:
RangeDelAggregatorV2 now supports ShouldDelete calls on
snapshot stripes and creation of range tombstone compaction iterators.
RangeDelAggregator is no longer used on any non-test code path, and will
be removed in a future commit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758
Differential Revision: D13439254
Pulled By: abhimadan
fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401
Summary:
If one column family is dropped, we should simply skip it and continue to flush
other active ones.
Currently we use Status::ShutdownInProgress to notify caller of column families
being dropped. In the future, we should consider using a different Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708
Differential Revision: D13378954
Pulled By: riversand963
fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca
Summary:
…ons (#4676)"
This reverts commit b32d087dbb.
`MemoryAllocator` needs to be with `Cache`, since cache entry can
outlive DB and block based table. The cache needs to hold reference to
memory allocator when deleting cache entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697
Differential Revision: D13133490
Pulled By: yiwu-arbug
fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a
Summary:
The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649
Differential Revision: D13146964
Pulled By: abhimadan
fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
Summary:
Since a range tombstone seen at one level will cover all keys
in the range at lower levels, there was a short-circuiting check in Get
that reported a key was not found at most one file after the range
tombstone was discovered. However, this was incorrect for merge
operands, since a deletion might only cover some merge operands,
which implies that the key should be found. This PR fixes this logic in
the Version portion of Get, and removes the logic from the MemTable
portion of Get, since the perforamnce benefit provided there is minimal.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4698
Differential Revision: D13142484
Pulled By: abhimadan
fbshipit-source-id: cbd74537c806032f2bfa564724d01a80df7c8f10
Summary:
This is a quick fix for the uninitialized bugs in `LiveFileMetaData` and `SstFileMetaData` that were uncovered in #4686.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4693
Differential Revision: D13113189
Pulled By: ajkr
fbshipit-source-id: 18e798d031d2a59d0b55fc010c135e0126f4042d
Summary:
Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy.
It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676
Differential Revision: D13047662
Pulled By: yiwu-arbug
fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd
Summary:
he ratio of num_deletions to num_entries of a level can be useful to determine if a manual compaction needs to be triggered on a level.
Also refer #3980
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4623
Differential Revision: D13045744
Pulled By: sagar0
fbshipit-source-id: 71f3c8e363a8ffd194ec3bb0ed0b69612231f0b3
Summary:
this PR adds two more per-level perf context counters to track
* number of keys returned in Get call, break down by levels
* total processing time at each level during Get call
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4617
Differential Revision: D12898024
Pulled By: miasantreble
fbshipit-source-id: 6b84ef1c8097c0d9e97bee1a774958f56ab4a6c4
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
Differential Revision: D12934992
Pulled By: sagar0
fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
Summary:
Add unit tests to demonstrate that `VersionSet::Recover` is able to detect and handle cases in which the MANIFEST has valid atomic group, incomplete trailing atomic group, atomic group mixed with normal version edits and atomic group with incorrect size.
With this capability, RocksDB identifies non-valid groups of version edits and do not apply them, thus guaranteeing that the db is restored to a state consistent with the most recent successful atomic flush before applying WAL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4433
Differential Revision: D10079202
Pulled By: riversand963
fbshipit-source-id: a0e0b8bf4da1cf68e044d397588c121b66c68876
Summary:
Since the number of range deletions are reported in
TableProperties, it is confusing to not report the number of merge
operands and point deletions as top-level properties; they are
accessible through the public API, but since they are not the "main"
properties, they do not appear in aggregated table properties, or the
string representation of table properties.
This change promotes those two property keys to
`rocksdb/table_properties.h`, adds corresponding uint64 members for
them, deprecates the old access methods `GetDeletedKeys()` and
`GetMergeOperands()` (though they are still usable for now), and removes
`InternalKeyPropertiesCollector`. The property key strings are the same
as before this change, so this should be able to read DBs written from older
versions (though I haven't tested this yet).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594
Differential Revision: D12826893
Pulled By: abhimadan
fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.
On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found)
```
Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
Tombstones? | avg micros/op | stddev micros/op | avg ops/s | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None | 0.6186 | 0.04637 | 1,625,252.90 | 124,679.41
500 Expanded | 0.6019 | 0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded | 0.6435 | 0.03994 | 1,559,979.40 | 104,090.52
1k Expanded | 0.6034 | 0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded | 0.6261 | 0.03093 | 1,600,457.50 | 79,024.94
5k Expanded | 0.6163 | 0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded | 0.6402 | 0.04002 | 1,567,804.70 | 100,965.55
10k Expanded | 0.6036 | 0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded | 0.6128 | 0.02598 | 1,634,633.40 | 72,161.82
25k Expanded | 0.6198 | 0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded | 0.5478 | 0.0362 | 1,833,059.10 | 121,233.81
50k Expanded | 0.5104 | 0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded | 0.4528 | 0.03387 | 2,219,034.50 | 170,984.32
```
After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493
Differential Revision: D10842844
Pulled By: abhimadan
fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.
In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```
...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```
The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.
Readrandom results before PR:
```
readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found)
```
Readrandom results after PR:
```
readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found)
```
So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).
----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449
Differential Revision: D10370575
Pulled By: abhimadan
fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
Summary:
There was a bug that the user comparator would receive the internal key instead of the user key. The bug was due to RangeMightExistAfterSortedRun expecting user key but receiving internal key when called in GenerateBottommostFiles. The patch augment an existing unit test to reproduce the bug and fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4575
Differential Revision: D10500434
Pulled By: maysamyabandeh
fbshipit-source-id: 858346d2fd102cce9e20516d77338c112bdfe366
Summary:
Level compaction usually performs poorly when the writes so heavy that the level targets can't be guaranteed. With this improvement, we improve level_compaction_dynamic_level_bytes = true so that in the write heavy cases, the level multiplier can be slightly adjusted based on the size of L0.
We keep the behavior the same if number of L0 files is under 2X compaction trigger and the total size is less than options.max_bytes_for_level_base, so that unless write is so heavy that compaction cannot keep up, the behavior doesn't change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4338
Differential Revision: D9636782
Pulled By: siying
fbshipit-source-id: e27fc17a7c29c84b00064cc17536a01dacef7595
Summary:
Current `log::Reader` does not perform retry after encountering `EOF`. In the future, we need the log reader to be able to retry tailing the log even after `EOF`.
Current implementation is simple. It does not provide more advanced retry policies. Will address this in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4394
Differential Revision: D9926508
Pulled By: riversand963
fbshipit-source-id: d86d145792a41bd64a72f642a2a08c7b7b5201e1
Summary:
When a CompactRange() call for a level is truncated before the end key
is reached, because it exceeds max_compaction_bytes, we need to properly
set the compaction_end parameter to indicate the stop key. The next
CompactRange will use that as the begin key. We set it to the smallest
key of the next file in the level after expanding inputs to get a clean
cut.
Previously, we were setting it before expanding inputs. So we could end
up recompacting some files. In a pathological case, where a single key
has many entries spanning all the files in the level (possibly due to
merge operands without a partial merge operator, thus resulting in
compaction output identical to the input), this would result in
an endless loop over the same set of files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4496
Differential Revision: D10395026
Pulled By: anand1976
fbshipit-source-id: f0c2f89fee29b4b3be53b6467b53abba8e9146a9
Summary:
Leverage existing `FlushJob` to implement atomic flush of multiple column families.
This PR depends on other PRs and is a subset of #3752 . This PR itself is not sufficient in fulfilling atomic flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4262
Differential Revision: D9283109
Pulled By: riversand963
fbshipit-source-id: 65401f913e4160b0a61c0be6cd02adc15dad28ed
Summary:
We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running.
To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933
Differential Revision: D10219571
Pulled By: riversand963
fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
Summary:
To more accurately truncate range tombstones at SST boundaries,
we now represent them in RangeDelAggregator using InternalKeys, which
are end-key-exclusive as they were before this change.
During compaction, "atomic compaction unit boundaries" (the range of
keys contained in neighbouring and overlaping SSTs) are propagated down
to RangeDelAggregator to truncate range tombstones at those boundariies
instead. See https://github.com/facebook/rocksdb/pull/4432#discussion_r221072219 and https://github.com/facebook/rocksdb/pull/4432#discussion_r221138683
for motivating examples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4432
Differential Revision: D10263952
Pulled By: abhimadan
fbshipit-source-id: 2fe85ff8a02b3a6a2de2edfe708012797a7bd579
Summary:
This fix is for `level == 0` in `GetOverlappingInputs()`:
- In `GetOverlappingInputs()`, if `level == 0`, it has potential
risk of overflow if `i == 0`.
- Optmize process when `expand = true`, the expected complexity
can be reduced to O(n).
Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4385
Differential Revision: D10181001
Pulled By: riversand963
fbshipit-source-id: 46eef8a1d1605c9329c164e6471cd5c5b6de16b5
Summary:
`FindFile()` and `FindFileInRange()` actually works as the same
of `std::lower_bound()`. Use `std::lower_bound()` to reduce the
repeated code.
- change `FindFile()` and `FindFileInRange()` to use `std::lower_bound()`
Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4372
Differential Revision: D9919677
Pulled By: ajkr
fbshipit-source-id: f74aaa30e2f80e410e299c5a5bca4eaf2a7a26de
Summary:
The code is dead in RocksDB as `log::Reader::initial_offset_` is always zero. We should delete it so we don't have to maintain it like in #4359.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4362
Differential Revision: D9817829
Pulled By: ajkr
fbshipit-source-id: 474a2c679e5bd273b40608f3a5332931d9eefe6d
Summary:
As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited.
Besides this, try to fix some warnings about loss of data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339
Differential Revision: D9654990
Pulled By: ajkr
fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848
Summary:
We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039
Differential Revision: D8670178
Pulled By: riversand963
fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a
Summary:
During recovery, RocksDB is able to handle version edits that belong to group commits.
This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3945
Differential Revision: D8529122
Pulled By: riversand963
fbshipit-source-id: 57cb0f9cc55ecca684a837742d6626dc9c07f37e
Summary:
RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
`global_seqno`. Since random write is not supported in some non-POSIX compliant
file systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update `global_seqno` during
file ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172
Differential Revision: D8961465
Pulled By: riversand963
fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
Summary:
Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490
Differential Revision: D6969154
Pulled By: miasantreble
fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029
Summary:
PR #3944 introduces group commit of `VersionEdit` in MANIFEST. The
implementation has a bug. When updating the log file number of each column
family, we must consider only `VersionEdit`s that operate on the same column
family. Otherwise, a column family may accidentally set its log file number
higher than actual value, indicating that log files with smaller file number
will be ignored, thus causing some updates to be lost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4157
Differential Revision: D8916650
Pulled By: riversand963
fbshipit-source-id: 8f456cf688f17bf35ad87b38e30e899aa162f201
Summary:
Do not consider the range tombstone sentinel key as causing 2 adjacent
sstables in a level to overlap. When a range tombstone's end key is the
largest key in an sstable, the sstable's end key is so to a "sentinel"
value that is the smallest key in the next sstable with a sequence
number of kMaxSequenceNumber. This "sentinel" is guaranteed to not
overlap in internal-key space with the next sstable. Unfortunately,
GetOverlappingFiles uses user-keys to determine overlap and was thus
considering 2 adjacent sstables in a level to overlap if they were
separated by this sentinel key. This in turn would cause compactions to
be larger than necessary.
Note that this conflicts with
https://github.com/facebook/rocksdb/pull/2769 and cases
`DBRangeDelTest.CompactionTreatsSplitInputLevelDeletionAtomically` to
fail.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4050
Differential Revision: D8844423
Pulled By: ajkr
fbshipit-source-id: df3f9f1db8f4cff2bff77376b98b83c2ae1d155b
Summary:
clang analyze is giving the following warnings:
> db/compaction_job.cc:1178:16: warning: Called C++ object pointer is null
} else if (meta->smallest.size() > 0) {
^~~~~~~~~~~~~~~~~~~~~
db/compaction_job.cc:1201:33: warning: Access to field 'marked_for_compaction' results in a dereference of a null pointer (loaded from variable 'meta')
meta->marked_for_compaction = sub_compact->builder->NeedCompact();
~~~~
db/version_set.cc:2770:26: warning: Called C++ object pointer is null
uint32_t cf_id = last_writer->cfd->GetID();
^~~~~~~~~~~~~~~~~~~~~~~~~
Closes https://github.com/facebook/rocksdb/pull/4072
Differential Revision: D8685852
Pulled By: miasantreble
fbshipit-source-id: b0e2fd9dfc1cbba2317723e09886384b9b1c9085
Summary:
This PR supports the group commit of multiple version edit entries corresponding to different column families. Column family drop/creation still cannot be grouped. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Closes https://github.com/facebook/rocksdb/pull/3944
Differential Revision: D8432536
Pulled By: riversand963
fbshipit-source-id: 8f11bd05193b6c0d9272d82e44b676abfac113cb