Summary:
this PR adds two more per-level perf context counters to track
* number of keys returned in Get call, break down by levels
* total processing time at each level during Get call
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4617
Differential Revision: D12898024
Pulled By: miasantreble
fbshipit-source-id: 6b84ef1c8097c0d9e97bee1a774958f56ab4a6c4
Summary:
Move the line `Add xxhash64 checksum support` to `Unreleased` section because it has not been released to 5.17.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4627
Differential Revision: D12944123
Pulled By: riversand963
fbshipit-source-id: 2762857065b9e741c64ff8b6116ca62fe31891d8
Summary:
When evicting an entry form the commit_cache, it is verified against the list of old snapshots to see if it overlaps with any. The list of old snapshots is split into two lists: an efficient concurrent cache and an slow vector protected by a lock. The patch fixes a bug that would stop the search in the cache if it finds any and yet would not include the larger snapshots in the slower list.
An extra info log entry is also removed. The condition to trigger that although very rare is still feasible and should not spam the LOG when that happens.
Fixes#4621
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4639
Differential Revision: D12934989
Pulled By: maysamyabandeh
fbshipit-source-id: 4e0fe8147ba292b554ae78e94c21c2ef31e03e2d
Summary:
This property can help debug why SST files aren't being deleted. Previously we only had the property "rocksdb.is-file-deletions-enabled". However, even when that returned true, obsolete SSTs may still not be deleted due to the coarse-grained mechanism we use to prevent newly created SSTs from being accidentally deleted. That coarse-grained mechanism uses a lower bound file number for SSTs that should not be deleted, and this property exposes that lower bound.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4618
Differential Revision: D12898179
Pulled By: ajkr
fbshipit-source-id: fe68acc041ddbcc9276bbd48976524d95aafc776
Summary:
Since the number of range deletions are reported in
TableProperties, it is confusing to not report the number of merge
operands and point deletions as top-level properties; they are
accessible through the public API, but since they are not the "main"
properties, they do not appear in aggregated table properties, or the
string representation of table properties.
This change promotes those two property keys to
`rocksdb/table_properties.h`, adds corresponding uint64 members for
them, deprecates the old access methods `GetDeletedKeys()` and
`GetMergeOperands()` (though they are still usable for now), and removes
`InternalKeyPropertiesCollector`. The property key strings are the same
as before this change, so this should be able to read DBs written from older
versions (though I haven't tested this yet).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594
Differential Revision: D12826893
Pulled By: abhimadan
fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68
Summary:
Rename the interface, as it is mean to be a generic interface for memory allocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590
Differential Revision: D10866340
Pulled By: yiwu-arbug
fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16
Summary:
There was a bug that the user comparator would receive the internal key instead of the user key. The bug was due to RangeMightExistAfterSortedRun expecting user key but receiving internal key when called in GenerateBottommostFiles. The patch augment an existing unit test to reproduce the bug and fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4575
Differential Revision: D10500434
Pulled By: maysamyabandeh
fbshipit-source-id: 858346d2fd102cce9e20516d77338c112bdfe366
Summary:
Level compaction usually performs poorly when the writes so heavy that the level targets can't be guaranteed. With this improvement, we improve level_compaction_dynamic_level_bytes = true so that in the write heavy cases, the level multiplier can be slightly adjusted based on the size of L0.
We keep the behavior the same if number of L0 files is under 2X compaction trigger and the total size is less than options.max_bytes_for_level_base, so that unless write is so heavy that compaction cannot keep up, the behavior doesn't change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4338
Differential Revision: D9636782
Pulled By: siying
fbshipit-source-id: e27fc17a7c29c84b00064cc17536a01dacef7595
Summary:
WriteBatchWithIndex's SeekForPrev() has a bug that we internally place the position just before the seek key rather than after. This makes the iterator to miss the result that is the same as the seek key. Fix it by position the iterator equal or smaller.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4559
Differential Revision: D10468534
Pulled By: siying
fbshipit-source-id: 2fb371ae809c561b60a1c11cef71e1c66fea1f19
Summary:
Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level.
This will be helpful when analyzing point and range query performance as well as tuning bloom filter
Also replaced __thread with thread_local keyword for perf_context
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226
Differential Revision: D10369509
Pulled By: miasantreble
fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82
Summary:
When a CompactRange() call for a level is truncated before the end key
is reached, because it exceeds max_compaction_bytes, we need to properly
set the compaction_end parameter to indicate the stop key. The next
CompactRange will use that as the begin key. We set it to the smallest
key of the next file in the level after expanding inputs to get a clean
cut.
Previously, we were setting it before expanding inputs. So we could end
up recompacting some files. In a pathological case, where a single key
has many entries spanning all the files in the level (possibly due to
merge operands without a partial merge operator, thus resulting in
compaction output identical to the input), this would result in
an endless loop over the same set of files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4496
Differential Revision: D10395026
Pulled By: anand1976
fbshipit-source-id: f0c2f89fee29b4b3be53b6467b53abba8e9146a9
Summary:
`CompactionIterator::snapshots_` is ordered by ascending seqnum, just like `DBImpl`'s linked list of snapshots from which it was copied. This PR exploits this ordering to make `findEarliestVisibleSnapshot` do binary search rather than linear scan. This can make flush/compaction significantly faster when many snapshots exist since that function is called on every single key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4495
Differential Revision: D10386470
Pulled By: ajkr
fbshipit-source-id: 29734991631227b6b7b677e156ac567690118a8b
Summary:
There is a bug when the write queue leader is blocked on a write
delay/stop, and the queue has writers with WriteOptions::no_slowdown set
to true. They are not woken up until the write stall is cleared.
The fix introduces a dummy writer inserted at the tail to indicate a
write stall and prevent further inserts into the queue, and a condition
variable that writers who can tolerate slowdown wait on before adding
themselves to the queue. The leader calls WriteThread::BeginWriteStall()
to add the dummy writer and then walk the queue to fail any writers with
no_slowdown set. Once the stall clears, the leader calls
WriteThread::EndWriteStall() to remove the dummy writer and signal the
condition variable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4475
Differential Revision: D10285827
Pulled By: anand1976
fbshipit-source-id: 747465e5e7f07a829b1fb0bc1afcd7b93f4ab1a9
Summary:
The controller you requested could not be found. PTAL
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4466
Differential Revision: D10241358
Pulled By: yiwu-arbug
fbshipit-source-id: 99664eb286860a6c8844d50efeb0ef6f0e10dd1e
Summary:
This is a conceptually simple change, but it touches many files to
pass the allocator through function calls.
We introduce CacheAllocator, which can be used by clients to configure
custom allocator for cache blocks. Our motivation is to hook this up
with folly's `JemallocNodumpAllocator`
(f43ce6d686/folly/experimental/JemallocNodumpAllocator.h),
but there are many other possible use cases.
Additionally, this commit cleans up memory allocation in
`util/compression.h`, making sure that all allocations are wrapped in a
unique_ptr as soon as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437
Differential Revision: D10132814
Pulled By: yiwu-arbug
fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564
Summary:
Previously `CompactFiles` with `CompressionType::kDisableCompressionOption` caused program to crash on assertion failure. This PR fixes the crash by adding support for that setting. Now, that setting will cause RocksDB to choose compression according to the column family's options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4438
Differential Revision: D10115761
Pulled By: ajkr
fbshipit-source-id: a553c6fa76fa5b6f73b0d165d95640da6f454122
Summary:
The default for index_block_restart_interval is 1 but some use 16 in production. The patch extends crash test to test both values.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4383
Differential Revision: D9887304
Pulled By: maysamyabandeh
fbshipit-source-id: a8d00fea974a79ad563f9f4d9d7b069e9f746a8f
Summary:
TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346
Differential Revision: D9759149
Pulled By: maysamyabandeh
fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b
Summary:
Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening.
One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable.
This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297
Differential Revision: D9420705
Pulled By: mikhail-antonov
fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc
Summary: For the CURRENT file forged during checkpoint, we were forgetting to `fsync` or `fdatasync` it after its creation. This PR fixes it.
Differential Revision: D9525939
Pulled By: ajkr
fbshipit-source-id: a505483644026ee3f501cfc0dcbe74832165b2e3
Summary:
I have a PR to start calling `OnTableFileCreated` for empty SSTs: #4307. However, it is a behavior change so should not go into a patch release.
This PR adds back a check to make sure range deletions at least exist before starting file creation. This PR should be safe to backport to earlier versions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4311
Differential Revision: D9493734
Pulled By: ajkr
fbshipit-source-id: f0d43cda4cfd904f133cfe3a6eb622f52a9ccbe8
Summary:
The API comment on `OnTableFileCreationStarted` (b6280d01f9/include/rocksdb/listener.h (L331-L333)) led users to believe a call to `OnTableFileCreationStarted` will always be matched with a call to `OnTableFileCreated`. However, we were skipping the `OnTableFileCreated` call in one case: no error happens but also no file is generated since there's no data.
This PR adds the call to `OnTableFileCreated` for that case. The filename will be "(nil)" and the size will be zero.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4307
Differential Revision: D9485201
Pulled By: ajkr
fbshipit-source-id: 2f077ec7913f128487aae2624c69a50762394df6
Summary:
ZSTD's dynamic library exports `ZDICT_trainFromBuffer` symbol since v1.1.3, and its static library exports it since v0.6.1. We don't know whether linkage is static or dynamic, so just require v1.1.3 to use dictionary trainer.
Fixes the issue reported here: https://jira.mariadb.org/browse/MDEV-16525.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4295
Differential Revision: D9417183
Pulled By: ajkr
fbshipit-source-id: 0e89d2f48d9e7f6eee73e7f4572660a9f7122db8
Summary:
Right now, `ldb idump` may have memory out of control if there is a big range of tombstones. Add an option to cut maxinum number of keys in GetAllKeyVersions(), and push down --max_num_ikeys from ldb.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4271
Differential Revision: D9369149
Pulled By: siying
fbshipit-source-id: 7cbb797b7d2fa16573495a7e84937456d3ff25bf
Summary:
This PR addresses issue #3865 and implements the following approach to fix it:
- adds `MergeContext::GetOperandsDirectionForward` and `MergeContext::GetOperandsDirectionBackward` to query merge operands in a specific order
- `MergeContext::GetOperands` becomes a shortcut for `MergeContext::GetOperandsDirectionForward`
- pass `MergeContext::GetOperandsDirectionBackward` to `MergeOperator::ShouldMerge` and document the order
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4266
Differential Revision: D9360750
Pulled By: sagar0
fbshipit-source-id: 20cb73ff017760b062ecdcf4382560767086e092
Summary:
Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`
The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174
Differential Revision: D8965914
Pulled By: fgwu
fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
Summary:
A framework of trace analyzing for RocksDB
After collecting the trace by using the tool of [PR #3837](https://github.com/facebook/rocksdb/pull/3837). User can use the Trace Analyzer to interpret, analyze, and characterize the collected workload.
**Input:**
1. trace file
2. Whole keys space file
**Statistics:**
1. Access count of each operation (Get, Put, Delete, SingleDelete, DeleteRange, Merge) in each column family.
2. Key hotness (access count) of each one
3. Key space separation based on given prefix
4. Key size distribution
5. Value size distribution if appliable
6. Top K accessed keys
7. QPS statistics including the average QPS and peak QPS
8. Top K accessed prefix
9. The query correlation analyzing, output the number of X after Y and the corresponding average time
intervals
**Output:**
1. key access heat map (either in the accessed key space or whole key space)
2. trace sequence file (interpret the raw trace file to line base text file for future use)
3. Time serial (The key space ID and its access time)
4. Key access count distritbution
5. Key size distribution
6. Value size distribution (in each intervals)
7. whole key space separation by the prefix
8. Accessed key space separation by the prefix
9. QPS of each operation and each column family
10. Top K QPS and their accessed prefix range
**Test:**
1. Added the unit test of analyzing Get, Put, Delete, SingleDelete, DeleteRange, Merge
2. Generated the trace and analyze the trace
**Implemented but not tested (due to the limitation of trace_replay):**
1. Analyzing Iterator, supporting Seek() and SeekForPrev() analyzing
2. Analyzing the number of Key found by Get
**Future Work:**
1. Support execution time analyzing of each requests
2. Support cache hit situation and block read situation of Get
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4091
Differential Revision: D9256157
Pulled By: zhichao-cao
fbshipit-source-id: f0ceacb7eedbc43a3eee6e85b76087d7832a8fe6
Summary:
After refactoring in https://github.com/facebook/rocksdb/pull/4158 the properties block is written after the index block. This breaks the existing logic in estimating the index size in partitioned indexes. The patch fixes that by using the accurate index block size, which is available since by the time we write the properties block, the index block is already written.
The patch also fixes an issue in estimating the partition size with format_version=3 which was resulting into partitions smaller than the configured metadata_block_size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4259
Differential Revision: D9274454
Pulled By: maysamyabandeh
fbshipit-source-id: c82d045505cca3e7ed1a44ee1eaa26e4f25a4272
Summary:
Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4.
The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
Results with sysbench read-only using 4k blocks and using 16 index restart interval:
Format 2:
19585 rocksdb read-only range=100
Format 3:
19569 rocksdb read-only range=100
Format 4:
19352 rocksdb read-only range=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983
Differential Revision: D8361343
Pulled By: maysamyabandeh
fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
Summary:
Given that we have cut 5.15, we should bump the version number to the next
version, i.e. 5.16.
Also update HISTORY.md
cc sagar0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4176
Differential Revision: D8977965
Pulled By: riversand963
fbshipit-source-id: 481d75d2f446946f0eb2afb7e94ef894c8c87e1e
Summary:
Fix the issue when pipelined write is enabled, writers can get stuck indefinitely and not able to finish the write. It can show with the following example: Assume there are 4 writers W1, W2, W3, W4 (W1 is the first, W4 is the last).
T1: all writers pending in WAL writer queue:
WAL writer queue: W1, W2, W3, W4
memtable writer queue: empty
T2. W1 finish WAL writer and move to memtable writer queue:
WAL writer queue: W2, W3, W4,
memtable writer queue: W1
T3. W2 and W3 finish WAL write as a batch group. W2 enter ExitAsBatchGroupLeader and move the group to memtable writer queue, but before wake up next leader.
WAL writer queue: W4
memtable writer queue: W1, W2, W3
T4. W1, W2, W3 finish memtable write as a batch group. Note that W2 still in the previous ExitAsBatchGroupLeader, although W1 have done memtable write for W2.
WAL writer queue: W4
memtable writer queue: empty
T5. The thread corresponding to W3 create another writer W3' with the same address as W3.
WAL writer queue: W4, W3'
memtable writer queue: empty
T6. W2 continue with ExitAsBatchGroupLeader. Because the address of W3' is the same as W3, the last writer in its group, it thinks there are no pending writers, so it reset newest_writer_ to null, emptying the queue. W4 and W3' are deleted from the queue and will never be wake up.
The issue exists since pipelined write was introduced in 5.5.0.
Closes#3704
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4143
Differential Revision: D8871599
Pulled By: yiwu-arbug
fbshipit-source-id: 3502674e51066a954a0660257e24ac588f815e2a
Summary:
Now by default, with NewSstFileManager, checkpoints may be corrupted. Disable this feature to avoid this issue.
Closes https://github.com/facebook/rocksdb/pull/4092
Differential Revision: D8729856
Pulled By: siying
fbshipit-source-id: 914c321d6eaf52d8c5981171322d85dd29088307
Summary:
…ression
For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts
Closes https://github.com/facebook/rocksdb/pull/3985
Reviewed By: riversand963
Differential Revision: D8385911
Pulled By: zhichao-cao
fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc
Summary:
https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case.
Closes https://github.com/facebook/rocksdb/pull/4053
Differential Revision: D8662546
Pulled By: maysamyabandeh
fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR #3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.
/cc ajkr nvanbenschoten
Closes https://github.com/facebook/rocksdb/pull/4016
Differential Revision: D8527575
Pulled By: ajkr
fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
Summary:
Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead.
Closes https://github.com/facebook/rocksdb/pull/4037
Differential Revision: D8596218
Pulled By: maysamyabandeh
fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.
**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.
**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.
**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.
**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```
```
Before:
seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884
Differential Revision: D8082143
Pulled By: sagar0
fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
Summary:
https://github.com/facebook/rocksdb/pull/3764 introduced an optimization feature to skip duplicate prefix entires in full bloom filters. Unfortunately it also introduces a bug in partitioned full filters, where the duplicate prefix should still be inserted if it is in a new partition. The patch fixes the bug by resetting the duplicate detection logic each time a partition is cut.
This bug could result into false negatives, which means that DB could skip an existing key.
Closes https://github.com/facebook/rocksdb/pull/4024
Differential Revision: D8518866
Pulled By: maysamyabandeh
fbshipit-source-id: 044f4d988e606a330ecafd8c79daceb68b8796bf
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881
Differential Revision: D8076288
Pulled By: ajkr
fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
Summary:
Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
Closes https://github.com/facebook/rocksdb/pull/3894
Differential Revision: D8118775
Pulled By: maysamyabandeh
fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
Summary:
Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache.
Closes https://github.com/facebook/rocksdb/pull/3877
Differential Revision: D8100895
Pulled By: yiwu-arbug
fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da
Summary:
Previously we were using -1 as the default for every library, which was legacy from our zlib options. That worked for a while, but after zstd introduced a146ee04ae, it started giving poor compression ratios by default in zstd.
This PR adds a constant to RocksDB public API, `CompressionOptions::kDefaultCompressionLevel`, which will get translated to the default value specific to the compression library being used in "util/compression.h". The constant uses a number that appears to be larger than any library's maximum compression level.
Closes https://github.com/facebook/rocksdb/pull/3895
Differential Revision: D8125780
Pulled By: ajkr
fbshipit-source-id: 2db157a89118cd4f94577c2f4a0a5ff31c8391c6
Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601
Differential Revision: D7253114
Pulled By: miasantreble
fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
Summary:
Previously it only printed percentiles, even though our histogram keeps track of count and sum (and more). There have been many times we want to know more than the percentiles. For example, we currently want sum of "rocksdb.compression.times.nanos" and sum of "rocksdb.decompression.times.nanos", which would allow us to know the relative cost of compression vs decompression.
This PR adds count and sum to the string printed by `StatisticsImpl::ToString`. This is a bit risky as there are definitely parsers assuming the old format. I will mention it in HISTORY.md and hope for the best...
Closes https://github.com/facebook/rocksdb/pull/3863
Differential Revision: D8038831
Pulled By: ajkr
fbshipit-source-id: 0465b72e4b0cbf18ef965f4efe402601d16d5b5c
Summary:
Right now ReverseBytewiseComparator::FindShortestSeparator() doesn't really shorten key, and ReverseBytewiseComparator::FindShortestSuccessor() seems to return wrong results. The code is confusing too as it uses BytewiseComparatorImpl::FindShortestSeparator() but the function actually won't do anything if the the first key is larger than the second.
Implement ReverseBytewiseComparator::FindShortestSeparator() and override ReverseBytewiseComparator::FindShortestSuccessor() to be empty.
Closes https://github.com/facebook/rocksdb/pull/3836
Differential Revision: D7959762
Pulled By: siying
fbshipit-source-id: 93acb621c16ce6f23e087ae4e19f7d84d1254683
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
Summary:
This feature was introduced for universal compaction in cc01985d. At that point we thought it'd be used only to prevent long-running universal full compactions from blocking short-lived upper-level compactions. Now we have a level compaction user who could benefit from it since they use more expensive compression algorithm in the bottom level. So enable it for level.
Closes https://github.com/facebook/rocksdb/pull/3835
Differential Revision: D7957179
Pulled By: ajkr
fbshipit-source-id: 177285d2cef3b650b6a4d81dc5db84bc441c9fe4
Summary:
Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.
This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
Closes https://github.com/facebook/rocksdb/pull/3829
Differential Revision: D7915443
Pulled By: ajkr
fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279
Summary:
The only use of RandomRW is to change seqno when bulkloading, and in this use case, the file should exist. We should fail the file opening in this case.
Closes https://github.com/facebook/rocksdb/pull/3827
Differential Revision: D7913719
Pulled By: siying
fbshipit-source-id: 62cf6734f1a6acb9e14f715b927da388131c3492
Summary:
sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want.
Closes https://github.com/facebook/rocksdb/pull/3767
Differential Revision: D7760136
Pulled By: siying
fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f
Summary:
Background activities like compaction can negatively affect
latency of higher-priority tasks like request processing. To avoid this,
rocksdb already lowers the IO priority of background threads on Linux
systems. While this takes care of typical IO-bound systems, it does not
help much when CPU (temporarily) becomes the bottleneck. This is
especially likely when using more expensive compression settings.
This patch adds an API to allow for lowering the CPU priority of
background threads, modeled on the IO priority API. Benchmarks (see
below) show significant latency and throughput improvements when CPU
bound. As a result, workloads with some CPU usage bursts should benefit
from lower latencies at a given utilization, or should be able to push
utilization higher at a given request latency target.
A useful side effect is that compaction CPU usage is now easily visible
in common tools, allowing for an easier estimation of the contribution
of compaction vs. request processing threads.
As with IO priority, the implementation is limited to Linux, degrading
to a no-op on other systems.
Closes https://github.com/facebook/rocksdb/pull/3763
Differential Revision: D7740096
Pulled By: gwicke
fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c
Summary:
There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes.
These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing.
Closes https://github.com/facebook/rocksdb/pull/3602
Differential Revision: D7251562
Pulled By: al13n321
fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d
Summary:
1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the
number of keys successfully read
2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in
DBImpl::GetImpl(), but not MultiGet
Closes https://github.com/facebook/rocksdb/pull/3730
Differential Revision: D7677364
Pulled By: anand1976
fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5
Summary:
Previously threads were named "rocksdb:bg\<index in thread pool\>", so the first thread in all thread pools would be named "rocksdb:bg0". Users want to be able to distinguish threads used for flush (high-pri) vs regular compaction (low-pri) vs compaction to bottom-level (bottom-pri). So I changed the thread naming convention to include the thread-pool priority.
Closes https://github.com/facebook/rocksdb/pull/3702
Differential Revision: D7581415
Pulled By: ajkr
fbshipit-source-id: ce04482b6acd956a401ef22dc168b84f76f7d7c1
Summary:
This is a hack as temporary fix of MyRocks with rollbacking the merge operands. The way MyRocks uses merge operands is without protection of locks, which violates the assumption behind the rollback algorithm. They are ok with not being rolled back as it would just create a gap in the autoincrement column. The hack add an option to disable the rollback of merge operands by default and only enables it to let the unit test pass.
Closes https://github.com/facebook/rocksdb/pull/3711
Differential Revision: D7597177
Pulled By: maysamyabandeh
fbshipit-source-id: 544be0f666c7e7abb7f651ec8b23124e05056728
Summary:
The existing unit test did not set the level so the check for pinned partitioned filter/index being properly released from the block cache was not properly exercised as they only take effect in level 0. As a result a memory leak in pinned partitioned filters was hidden. The patch fix the test as well as the bug.
Closes https://github.com/facebook/rocksdb/pull/3692
Differential Revision: D7559763
Pulled By: maysamyabandeh
fbshipit-source-id: 55eff274945838af983c764a7d71e8daff092e4a
Summary:
When `max_valid_backups_to_open` is set, the `BackupEngine` doesn't know about the files referenced by existing backups. This PR prevents us from deleting valid files when that option is set, in cases where we are unable to accurately determine refcount. There are warnings logged when we may miss deleting unreferenced files, and a recommendation in the header for users to periodically unset this option and run a full `GarbageCollect`.
Closes https://github.com/facebook/rocksdb/pull/3518
Differential Revision: D7008331
Pulled By: ajkr
fbshipit-source-id: 87907f964dc9716e229d08636a895d2fc7b72305
Summary:
Level Compaction with TTL.
As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.
Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.
This functionality can be controlled by the newly introduced column family option -- ttl.
TODO for later:
- Make ttl mutable
- Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
- Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
Closes https://github.com/facebook/rocksdb/pull/3591
Differential Revision: D7275442
Pulled By: sagar0
fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
Summary:
Provide a block_align option in BlockBasedTableOptions to allow
alignment of SST file data blocks. This will avoid higher
IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages.
When this option is set to true, the block alignment is set to lower of
block size and 4KB.
Closes https://github.com/facebook/rocksdb/pull/3502
Differential Revision: D7400897
Pulled By: anand1976
fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44
Summary:
Currently log_writer->AddRecord in WriteImpl is protected from concurrent calls via FlushWAL only if two_write_queues_ option is set. The patch fixes the problem by i) skip log_writer->AddRecord in FlushWAL if manual_wal_flush is not set, ii) protects log_writer->AddRecord in WriteImpl via log_write_mutex_ if manual_wal_flush_ is set but two_write_queues_ is not.
Fixes#3599
Closes https://github.com/facebook/rocksdb/pull/3656
Differential Revision: D7405608
Pulled By: maysamyabandeh
fbshipit-source-id: d6cc265051c77ae49c7c6df4f427350baaf46934
Summary:
We have not been updating our HISTORY.md change log with the RocksJava changes. Going forward, lets add Java changes also to HISTORY.md.
There is an old java/HISTORY-JAVA.md, but it hasn't been updated in years. It is much easier to remember to update the change log in a single file, HISTORY.md.
I added information about shared block cache here, which was introduced in #3623.
Closes https://github.com/facebook/rocksdb/pull/3647
Differential Revision: D7384448
Pulled By: sagar0
fbshipit-source-id: 9b6e569f44e6df5cb7ba06413d9975df0b517d20
Summary:
Fsync after writing global sequence number to the ingestion file in ExternalSstFileIngestionJob. Otherwise the file metadata could be incorrect.
Closes https://github.com/facebook/rocksdb/pull/3644
Differential Revision: D7373813
Pulled By: sagar0
fbshipit-source-id: 4da2c9e71a8beb5c08b4ac955f288ee1576358b8
Summary:
Add `bytes_max_delete_chunk` in SstFileManager so that we can drop a large file in multiple batches.
Closes https://github.com/facebook/rocksdb/pull/3640
Differential Revision: D7358679
Pulled By: siying
fbshipit-source-id: ef17f0da2f5723dbece2669485a9b91b3edc0bb7
Summary:
`Writer::WriteBuffer` was always called at the beginning of checkpoint/backup. But that log writer has no internal synchronization, which meant the same buffer could be flushed twice in a race condition case, causing a WAL entry to be duplicated. Then subsequent WAL entries would be at unexpected offsets, causing the 32KB block boundaries to be overlapped and manifesting as a corruption.
This PR fixes the behavior to only use `WriteBuffer` (via `FlushWAL`) in checkpoint/backup when manual WAL flush is enabled. In that case, users are responsible for providing synchronization between WAL flushes. We can also consider removing the call entirely.
Closes https://github.com/facebook/rocksdb/pull/3603
Differential Revision: D7277447
Pulled By: ajkr
fbshipit-source-id: 1b15bd7fd930511222b075418c10de0aaa70a35a
Summary:
This diff handles cases where compaction causes an ENOSPC error.
This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize.
It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC.
Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions.
Closes https://github.com/facebook/rocksdb/pull/3449
Differential Revision: D7016941
Pulled By: amytai
fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445
Summary:
This is the simplest way I could think of to speed up `CompactRange`. It works but isn't that optimal because it relies on the same `max_compaction_bytes` and `max_subcompactions` options that are used in other places. If it turns out to be useful we can allow overriding these in `CompactRangeOptions` in the future.
Closes https://github.com/facebook/rocksdb/pull/3549
Differential Revision: D7117634
Pulled By: ajkr
fbshipit-source-id: d0cd03d6bd0d2fd7ea3fb13cd3b8bf7c47d11e42
Summary:
Red diff to remove existing implementation of garbage collection. The current approach is reference counting kind of approach and require a lot of effort to get the size counter right on compaction and deletion. I'm going to go with a simple mark-sweep kind of approach and will send another PR for that.
CompactionEventListener was added solely for blob db and it adds complexity and overhead to compaction iterator. Removing it as well.
Closes https://github.com/facebook/rocksdb/pull/3551
Differential Revision: D7130190
Pulled By: yiwu-arbug
fbshipit-source-id: c3a375ad2639a3f6ed179df6eda602372cc5b8df
Summary:
The zeroed entries were not removed from prepared_section_completed_ map. This patch adds a unit test to show the problem and fixes that by refactoring the code. The new code is more efficient since i) it uses two separate mutex to avoid contention between commit and prepare threads, ii) it uses a sorted vector for maintaining uniq log entires with prepare which avoids a very large heap with many duplicate entries.
Closes https://github.com/facebook/rocksdb/pull/3545
Differential Revision: D7106071
Pulled By: maysamyabandeh
fbshipit-source-id: b3ae17cb6cd37ef10b6b35e0086c15c758768a48
Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548
Differential Revision: D7116939
Pulled By: yiwu-arbug
fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
Summary:
CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range.
- Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables`
- Reuse the above logic in `CompactRange` and skip flushing if no overlap
Closes https://github.com/facebook/rocksdb/pull/3520
Differential Revision: D7018897
Pulled By: ajkr
fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934
Summary:
RocksDB should always be able to parse an option file generated using the same or lower version. Unknown option should only happen if it is from a higher version. Change the behavior of RocksDBOptionsParser::Parse()'s behavior with ignore_unknown_options=true so that unknown option from a lower or the same version will never be skipped.
Closes https://github.com/facebook/rocksdb/pull/3527
Differential Revision: D7048851
Pulled By: siying
fbshipit-source-id: e261caea12f6515611a4a29f39acf2b619df2361
Summary:
- made `CreateCheckpoint` properly return `InvalidArgument` when called with an empty directory. Previously it triggered an assertion failure due to a bug in the logic.
- made `ldb` set empty `checkpoint_dir` if that's what the user specifies, so that we can use it to properly test `CreateCheckpoint` in the future.
Differential Revision: D6874562
fbshipit-source-id: dcc1bd41768261d9338987fa7711444289707ed7
Summary:
- Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur.
- Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true.
- Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3).
Closes https://github.com/facebook/rocksdb/pull/3381
Differential Revision: D6754983
Pulled By: ajkr
fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972
Summary:
There are a couple of places where we swallow any error from
WriteBuffer() - in SwitchMemtable() and DBImpl::CloseImpl(). Propagate
the error up in those cases rather than ignoring it.
Closes https://github.com/facebook/rocksdb/pull/3404
Differential Revision: D6879954
Pulled By: anand1976
fbshipit-source-id: 2ef88b554be5286b0a8bad7384ba17a105395bdb
Summary:
Using `DeleteFilesInRange` to delete files in a lot of ranges can be slow, because
`VersionSet::LogAndApply` is expensive.
This PR adds a new `DeleteFilesInRange` function to delete files in multiple
ranges at once.
Close https://github.com/facebook/rocksdb/issues/2951
Closes https://github.com/facebook/rocksdb/pull/3431
Differential Revision: D6849228
Pulled By: ajkr
fbshipit-source-id: daeedcabd8def4b1d9ee95a58266dee77b5d68cb
Summary:
This change improves the performance of iterators doing long range scans (e.g. big/full table scans in MyRocks) by using readahead and prefetching additional data on each disk IO. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.
Constraints:
- The prefetched data is stored by the OS in page cache. So this currently works only for non direct-reads use-cases i.e applications which use page cache. (Direct-I/O support will be enabled in a later PR).
- This gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
Thanks to siying for the original idea and implementation.
**Benchmarks:**
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```
Page cache was cleared before each experiment with the command:
```
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
```
```
Before:
seekrandom : 34020.945 micros/op 29 ops/sec; 32.5 MB/s (1636 of 1999 found)
With this change:
seekrandom : 8726.912 micros/op 114 ops/sec; 126.8 MB/s (5702 of 6999 found)
```
~3.9X performance improvement.
Also verified with strace and gdb that the readahead size is increasing as expected.
```
strace -e readahead -f -T -t -p <db_bench process pid>
```
Closes https://github.com/facebook/rocksdb/pull/3282
Differential Revision: D6586477
Pulled By: sagar0
fbshipit-source-id: 8a118a0ed4594fbb7f5b1cafb242d7a4033cb58c
Summary:
To prevent user who implement the Iterator interface fail to implement SeekForPrev by mistake.
Closes https://github.com/facebook/rocksdb/pull/3402
Differential Revision: D6790681
Pulled By: yiwu-arbug
fbshipit-source-id: bd75b8ced30208982e0a1414d34384d93496827a
Summary:
Flush() call could be waiting indefinitely if min_write_buffer_number_to_merge is used. Consider the sequence:
1. User call Flush() with flush_options.wait = true
2. The manual flush started in the background
3. New memtable become immutable because of writes. The new memtable will not trigger flush if min_write_buffer_number_to_merge is not reached.
4. The manual flush finish.
Because of the new memtable created at step 3 not being flush, previous logic of WaitForFlushMemTable() keep waiting, despite the memtables it intent to flush has been flushed.
Here instead of checking if there are any more memtables to flush, WaitForFlushMemTable() also check the id of the earliest memtable. If the id is larger than that of latest memtable at the time flush was initiated, it means all the memtable at the time of flush start has all been flush.
Closes https://github.com/facebook/rocksdb/pull/3378
Differential Revision: D6746789
Pulled By: yiwu-arbug
fbshipit-source-id: 35e698f71c7f90b06337a93e6825f4ea3b619bfa
Summary:
When calling `DisableFileDeletions` followed by `GetSortedWalFiles`, we guarantee the files returned by the latter call won't be deleted until after file deletions are re-enabled. However, `GetSortedWalFiles` didn't omit files already planned for deletion via `PurgeObsoleteFiles`, so the guarantee could be broken.
We fix it by making `GetSortedWalFiles` wait for the number of pending purges to hit zero if file deletions are disabled. This condition is eventually met since `PurgeObsoleteFiles` is guaranteed to be called for the existing pending purges, and new purges cannot be scheduled while file deletions are disabled. Once the condition is met, `GetSortedWalFiles` simply returns the content of DB and archive directories, which nobody can delete (except for deletion scheduler, for which I plan to fix this bug later) until deletions are re-enabled.
Closes https://github.com/facebook/rocksdb/pull/3341
Differential Revision: D6681131
Pulled By: ajkr
fbshipit-source-id: 90b1e2f2362ea9ef715623841c0826611a817634
Summary:
**# Summary**
RocksDB uses SSE crc32 intrinsics to calculate the crc32 values but it does it in single way fashion (not pipelined on single CPU core). Intel's whitepaper () published an algorithm that uses 3-way pipelining for the crc32 intrinsics, then use pclmulqdq intrinsic to combine the values. Because pclmulqdq has overhead on its own, this algorithm will show perf gains on buffers larger than 216 bytes, which makes RocksDB a perfect user, since most of the buffers RocksDB call crc32c on is over 4KB. Initial db_bench show tremendous CPU gain.
This change uses the 3-way SSE algorithm by default. The old SSE algorithm is now behind a compiler tag NO_THREEWAY_CRC32C. If user compiles the code with NO_THREEWAY_CRC32C=1 then the old SSE Crc32c algorithm would be used. If the server does not have SSE4.2 at the run time the slow way (Non SSE) will be used.
**# Performance Test Results**
We ran the FillRandom and ReadRandom benchmarks in db_bench. ReadRandom is the point of interest here since it calculates the CRC32 for the in-mem buffers. We did 3 runs for each algorithm.
Before this change the CRC32 value computation takes about 11.5% of total CPU cost, and with the new 3-way algorithm it reduced to around 4.5%. The overall throughput also improved from 25.53MB/s to 27.63MB/s.
1) ReadRandom in db_bench overall metrics
PER RUN
Algorithm | run | micros/op | ops/sec |Throughput (MB/s)
3-way | 1 | 4.143 | 241387 | 26.7
3-way | 2 | 3.775 | 264872 | 29.3
3-way | 3 | 4.116 | 242929 | 26.9
FastCrc32c|1 | 4.037 | 247727 | 27.4
FastCrc32c|2 | 4.648 | 215166 | 23.8
FastCrc32c|3 | 4.352 | 229799 | 25.4
AVG
Algorithm | Average of micros/op | Average of ops/sec | Average of Throughput (MB/s)
3-way | 4.01 | 249,729 | 27.63
FastCrc32c | 4.35 | 230,897 | 25.53
2) Crc32c computation CPU cost (inclusive samples percentage)
PER RUN
Implementation | run | TotalSamples | Crc32c percentage
3-way | 1 | 4,572,250,000 | 4.37%
3-way | 2 | 3,779,250,000 | 4.62%
3-way | 3 | 4,129,500,000 | 4.48%
FastCrc32c | 1 | 4,663,500,000 | 11.24%
FastCrc32c | 2 | 4,047,500,000 | 12.34%
FastCrc32c | 3 | 4,366,750,000 | 11.68%
**# Test Plan**
make -j64 corruption_test && ./corruption_test
By default it uses 3-way SSE algorithm
NO_THREEWAY_CRC32C=1 make -j64 corruption_test && ./corruption_test
make clean && DEBUG_LEVEL=0 make -j64 db_bench
make clean && DEBUG_LEVEL=0 NO_THREEWAY_CRC32C=1 make -j64 db_bench
Closes https://github.com/facebook/rocksdb/pull/3173
Differential Revision: D6330882
Pulled By: yingsu00
fbshipit-source-id: 8ec3d89719533b63b536a736663ca6f0dd4482e9
Summary:
I browsed through the history since 5.9 was released and found some changes worth mentioning in HISTORY.md.
Closes https://github.com/facebook/rocksdb/pull/3237
Differential Revision: D6506472
Pulled By: ajkr
fbshipit-source-id: 627ce9f94ca33df9f0f231a9c5ced3624b05506c
Summary:
Since #1665, on merge error, iterator will be set to corrupted status, but it doesn't invalidate the iterator. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3226
Differential Revision: D6499094
Pulled By: yiwu-arbug
fbshipit-source-id: 80222930f949e31f90a6feaa37ddc3529b510d2c