Summary:
It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).
Test plan (on devserver):
```
$COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
All tests must pass.
We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
```
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
```
Repeat for 6 times for both versions.
Results are as follows:
```
| | readrandom | fillrandom |
| master | 16.77 MB/s | 47.05 MB/s |
| PR5079 | 16.44 MB/s | 47.03 MB/s |
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079
Differential Revision: D15132946
Pulled By: riversand963
fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
Summary:
With this commit, RocksDB secondary instance respects atomic groups in version edits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5411
Differential Revision: D15617512
Pulled By: HaoyuHuang
fbshipit-source-id: 913f4ede391d772dcaf5649e3cd2099fa292d120
Summary:
Flush/compaction use `MergeUntil` which has a special code path to
handle a merge ending with a non-`Merge` point key. In particular if
that key is a `Put` we forgot to check whether it is covered by a range
tombstone. If it is covered then we must not include it in the following call
to `TimedFullMerge`.
Fixes#5392.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5406
Differential Revision: D15611144
Pulled By: sagar0
fbshipit-source-id: ba6a7863ca2d043f591de78fd0c4f4561f0c500e
Summary:
The PR #5275 separated the column dropped and shutdown status codes. However, there were a couple of places in compaction where this change ended up treating a ShutdownInProgress() error as a real error and set bg_error. This caused MyRocks unit test to fail due to WAL writes during shutdown returning this error. Fix it by ignoring the shutdown status during compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5400
Differential Revision: D15611680
Pulled By: anand1976
fbshipit-source-id: c602e97840e3ae24eb420d61e0ce95d3e6258632
Summary:
Currently we validate options in DB::Open. However the validation step is missing when options are dynamically updated in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5368
Differential Revision: D15540101
Pulled By: maysamyabandeh
fbshipit-source-id: d27bbffd8f0252d1b50bcf59e0a70a278ed937f4
Summary:
util/ means for lower level libraries. trace_replay is highly integrated to DB and sometimes call DB. Move it out to a separate directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5376
Differential Revision: D15550938
Pulled By: siying
fbshipit-source-id: f46dce5ceffdc05a73f26379c7bb1b79ebe6c207
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387
Differential Revision: D15579036
Pulled By: siying
fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
Summary:
In order to improve code readability, this PR moves LevelCompactionBuilder and LevelCompactionPicker to compaction_picker_level.h and .cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5369
Differential Revision: D15540172
Pulled By: miasantreble
fbshipit-source-id: c1a578b93f127cd63661b53f32b356e6edd349af
Summary:
The methods and fields in the private section of DBImpl were all intermingled, making it hard to figure out where the fields/methods start and where they end. I cleaned up the code a little so that all the type declaration are at the beginning, followed by methods, and all the data fields are at the end. This follows
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5385
Differential Revision: D15566978
Pulled By: sagar0
fbshipit-source-id: 4618a7d819ad4e2d7cc9ae1af2c59f400140bb1b
Summary:
1. Fix a bug in WAL replay in which write batches with old sequence numbers are mistakenly inserted into memtables.
2. Add support for benchmarking secondary instance to db_bench_tool.
With changes made in this PR, we can start benchmarking secondary instance
using two processes. It is also possible to vary the frequency at which the
secondary instance tries to catch up with the primary. The info log of the
secondary can be found in a directory whose path can be specified with
'-secondary_path'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5170
Differential Revision: D15564608
Pulled By: riversand963
fbshipit-source-id: ce97688ed3d33f69d3a0b9266ebbbbf887aa0ec8
Summary:
Fix flaky DBTest2.PresetCompressionDict test.
This PR fixes two issues with the test:
1. Replaces `GetSstFiles` with `TotalSize`, which is based on `DB::GetColumnFamilyMetaData` so that only the size of the live SST files is taken into consideration when computing the total size of all sst files. Earlier, with `GetSstFiles`, even obsolete files were getting picked up.
1. In ZSTD compression, it is sometimes possible that using a trained dictionary is not better than using an untrained one. Using a trained dictionary performs well in 99% of the cases, but still in the remaining ~1% of the cases (out of 10000 runs) using an untrained dictionary gets better compression results.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5378
Differential Revision: D15559100
Pulled By: sagar0
fbshipit-source-id: c35adbf13871f520a2cec48f8bad9ff27ff7a0b4
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377
Differential Revision: D15551366
Pulled By: siying
fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
Summary:
By increasing the ratio, we ensure that all files go through background deletion and eliminate flakiness due to timing of deletions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5366
Differential Revision: D15549992
Pulled By: anand1976
fbshipit-source-id: d137375cd791fc1a802841412755d6e2b8fd7688
Summary:
When dynamically setting options, we check the option type info and skip options that are marked deprecated. However this check is only done at top level, which results in bugs where SetOptions will corrupt option values and cause unexpected system behavior iff a deprecated second level option is set dynamically.
For exmaple, the following call:
```
dbfull()->SetOptions(
{{"compaction_options_fifo",
"{allow_compaction=true;max_table_files_size=1024;ttl=731;}"}});
```
was from pre 6.0 release when `ttl` was part of `compaction_options_fifo`. Now that it got moved out of `compaction_options_fifo`, this call will incorrectly set `compaction_options_fifo.max_table_files_size` to 731 (as `max_table_files_size` is the first one in `OptionsHelper::fifo_compaction_options_type_info` struct) and cause files to gett evicted much faster than expected.
This PR adds verification to second level options like `compaction_options_fifo.ttl` or `compaction_options_fifo.max_table_files_size` when set dynamically, and filter out those marked as deprecated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5367
Differential Revision: D15530998
Pulled By: miasantreble
fbshipit-source-id: 818258be5c3abe09cd82d62f3c083572d70fecdd
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375
Differential Revision: D15550935
Pulled By: siying
fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
Summary:
Add some comments in db_impl.h. Also reordered function order a little bit so that I can add a comment to flag the area of functions implementing DB interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5338
Differential Revision: D15498284
Pulled By: siying
fbshipit-source-id: 3d7c59c8303577fe44d13c74ae84c7ce05164f77
Summary:
RocksDB always tries to perform a hard link operation on the external SST file to ingest. This operation can fail if the external SST resides on a different device/FS, or the underlying FS does not support hard link. Currently RocksDB assumes that if the link fails, the user is willing to perform file copy, which is not true according to the post. This commit provides an option named 'failed_move_fall_back_to_copy' for users to choose which behavior they want.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5333
Differential Revision: D15457597
Pulled By: HaoyuHuang
fbshipit-source-id: f3626e13f845db4f7ed970a53ec8a2b1f0d62214
Summary:
Add file comment in db/db_iter.h and minor changes in other parts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5340
Differential Revision: D15484605
Pulled By: siying
fbshipit-source-id: 173771f9d5bd51303de5410ee5afd0a4af9d6572
Summary:
Modified FindIntraL0Compaction to stop picking more files if total
amount of compensated bytes would be larger than max_compaction_bytes
option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5329
Differential Revision: D15435728
Pulled By: ThomasFersch
fbshipit-source-id: d118a6da88d5df8ee20944422ade37cf6b15d60c
Summary:
In version_set.cc, there is a function GetCurrentManifestPath. The goal of this task is to refactor ListColumnFamilies function so that ListColumnFamilies calls GetCurrentManifestPath to search for MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5331
Differential Revision: D15444524
Pulled By: HaoyuHuang
fbshipit-source-id: 1dcbd030bc0f2e835695741f450bba150f2f2903
Summary:
Right now, in log writer, we call flush after writing each physical record. I don't see the necessarity of it. Right now, the underlying writer has a buffer, so there isn't a concern that the write request is too large either. On the other hand, in an Env where every flush is expensive, the current approach is significantly slower than only flushing after a whole record finishes, when the record is very large.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5328
Differential Revision: D15425032
Pulled By: siying
fbshipit-source-id: 440ebef002dfbb60c59d8388c9ddfc83d79700aa
Summary:
WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes.
The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313
Differential Revision: D15379977
Pulled By: maysamyabandeh
fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00
Summary:
RocksDB secondary can replay both MANIFEST and WAL now.
On the one hand, the memory usage by memtables will grow after replaying WAL for sometime. On the other hand, replaying the MANIFEST can bring the database persistent data to a more recent point in time, giving us the opportunity to discard some memtables containing out-dated data.
This PR coordinates the MANIFEST and WAL replay, using the updates from MANIFEST replay to update the active memtable and immutable memtable list of each column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5305
Differential Revision: D15386512
Pulled By: riversand963
fbshipit-source-id: a3ea6fc415f8382d8cf624f52a71ebdcffa3e355
Summary:
Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111
Differential Revision: D15330061
Pulled By: siying
fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024
Summary:
RangeDelAggregator::StripeRep::Invalidate() clears up several vectors. If we know there isn't anything to there, we can safe these small CPUs. Profiling shows that it sometimes take non-negligible amount of CPU. Worth a small optimization.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5312
Differential Revision: D15380511
Pulled By: siying
fbshipit-source-id: 53c5f34c33b4cb1e743643c6086ac56d0b84ec2e
Summary:
If DB is opened with `avoid_unnecessary_blocking_io` being true, then `~ColumnFamilyHandleImpl` enqueues a purge request and schedules a background thread to perform the deletion. Without test sync point, whether the SST file is purged or not at a later point in time is not deterministic. If the SST does not exist, it will cause an assertion failure.
How to reproduce:
```
$git checkout 6492430eaf
$make -j20 deletefile_test
$gtest-parallel --repeat 1000 --worker 16 ./deletefile_test --gtest_filter=DeleteFileTest.BackgroundPurgeCFDropTest
```
The test may fail a few times.
With changes made in this PR, repeat the above commands, and the test should not fail.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5310
Differential Revision: D15361136
Pulled By: riversand963
fbshipit-source-id: c4308d5f8da83472c893bf7f8ceed347fbfa850f
Summary:
The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306
Differential Revision: D15341458
Pulled By: maysamyabandeh
fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323
Summary:
The test was not using separate MemTablePostProcessInfo per memetable insert thread and thus tsan was complaining about data race.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5308
Differential Revision: D15356420
Pulled By: maysamyabandeh
fbshipit-source-id: 46c2f2d19fb02c3c775b587aa09ca9c0dae6ed04
Summary:
This PR has two fixes for crash test failures -
1. Fix a bug in TestMultiGet() in db_stress that was passing list of key to MultiGet() in the wrong order, thus ensuring that actual values don't match expected values
2. Remove an incorrect assertion in FilePickerMultiGet::GetNextFileInLevelWithKeys() that checks that files in a level are in sorted order. This is not true with MultiGet(), especially if there are duplicate keys and we may have to go back one file for the next key. Furthermore, this assertion makes more sense when a new version is created, rather than at lookup time
Test -
asan_crash and ubsan_crash tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5301
Differential Revision: D15337383
Pulled By: anand1976
fbshipit-source-id: 35092cb15bbc1700e5e823cbe07bfa62f1e9e6c6
Summary:
Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
The patch is prepared based on an original by siying
Existing unit tests are extended to include unordered_write option.
Benchmark Results:
```
TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1
```
With WAL
- Vanilla RocksDB: 78.6 MB/s
- WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
- unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
Without WAL
- Vanilla RocksDB: 111.3 MB/s
- WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
- unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
- WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
Limitations:
- The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
Differential Revision: D15219029
Pulled By: maysamyabandeh
fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
Summary:
Previous code may call `~ColumnFamilyData` in `DBImpl::AtomicFlushMemTablesToOutputFiles` if the column family is dropped or `cfd->IsFlushPending() == false`. In `~ColumnFamilyData`, the db mutex is released briefly and re-acquired. This can cause correctness issue. The reason is as follows.
Assume there are more bg flush threads. After bg_flush_thr1 releases the db mutex, bg_flush_thr2 can grab it and pop an element from the flush queue. This will cause bg_flush_thr2 to accidentally pick some memtables which should have been picked by bg_flush_thr1. To make the matter worse, bg_flush_thr2 can clear `flush_requested_` flag for the memtable list, causing a subsequent call to `MemTableList::IsFlushPending()` by bg_flush_thr1 to return false, which is wrong.
The fix is to delay `ColumnFamilyData::Unref` and `~ColumnFamilyData` for column families not selected for flush until `AtomicFlushMemTablesToOutputFiles` returns. Furthermore, a bg flush thread should not clear `MemTableList::flush_requested_` in `MemTableList::PickMemtablesToFlush` unless atomic flush is not used **or** the memtable list does not have unpicked memtables.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5294
Differential Revision: D15295297
Pulled By: riversand963
fbshipit-source-id: 03b101205ca22c242647cbf488bcf0ed80b2ecbd
Summary:
There were no C bindings for lowering thread pool priority. This adds those.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5285
Differential Revision: D15290050
Pulled By: siying
fbshipit-source-id: b2ed94d0c39d27434ace2204829a242b53d0d67a
Summary:
When reseek happens in merging iterator, reseeking a child iterator can be avoided if:
(1) the iterator represents imutable data
(2) reseek() to a larger key than the current key
(3) the current key of the child iterator is larger than the seek key
because it is guaranteed that the result will fall into the same position.
This optimization will be useful for use cases where users keep seeking to keys nearby in ascending order.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5286
Differential Revision: D15283635
Pulled By: siying
fbshipit-source-id: 35f79ffd5ce3609146faa8cd55f2bfd733502f83
Summary:
This PR fixes a couple of bugs in FilePickerMultiGet that were causing db_stress test failures. The failures were caused by -
1. Improper handling of a key that matches the user key portion of an L0 file's largest key. In this case, the curr_index_in_curr_level file index in L0 for that key was getting incremented, but batch_iter_ was not advanced. By design, all keys in a batch are supposed to be checked against an L0 file before advancing to the next L0 file. Not advancing to the next key in the batch was causing a double increment of curr_index_in_curr_level due to the same key being processed again
2. Improper handling of a key that matches the user key portion of the largest key in the last file of L1 and higher. This was resulting in a premature end to the processing of the batch for that level when the next key in the batch is a duplicate. Typically, the keys in MultiGet will not be duplicates, but its good to handle that case correctly
Test -
asan_crash
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5292
Differential Revision: D15282530
Pulled By: anand1976
fbshipit-source-id: d1a6a86e0af273169c3632db22a44d79c66a581f
Summary:
Right now, DBIter::Next() always checks whether an entry is for the same user key as the previous entry to see whether the key should be hidden to the user. However, if previous entry's sequence number is 0, the check is not needed because 0 is the oldest possible sequence number.
We could extend it from seqnum 0 case to simply prev_seqno >= current_seqno. However, it is less robust with bug or unexpected situations, while the gain is relatively low. We can always extend it later when needed.
In a readseq benchmark with full formed LSM-tree, number of key comparisons called is reduced from 2.981 to 2.165. readseq against a fully compacted DB, no key comparison is called. Performance in this benchmark didn't show obvious improvement, which is expected because key comparisons only takes small percentage of CPU. But it may show up to be more effective if users have an expensive customized comparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5244
Differential Revision: D15067257
Pulled By: siying
fbshipit-source-id: b7e1ef3ec4fa928cba509683d2b3246e35d270d9
Summary:
Previously in PR https://github.com/facebook/rocksdb/pull/5161 we have added the capability to do WAL tailing in `OpenAsSecondary`, in this PR we extend such feature to `TryCatchUpWithPrimary` which is useful for an secondary RocksDB instance to retrieve and apply the latest updates and refresh log readers if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5282
Differential Revision: D15261011
Pulled By: miasantreble
fbshipit-source-id: a15c94471e8c3b3b1f7f47c3135db1126e936949
Summary:
Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278
Differential Revision: D15203291
Pulled By: maysamyabandeh
fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258
Summary:
This PR fixes three memory issues found by ASAN
* in db_stress, the key vector for MultiGet is created using `emplace_back` which could potentially invalidates references to the underlying storage (vector<string>) due to auto resizing. Fix by calling reserve in advance.
* Similar issue in construction of GetContext autovector in version_set.cc
* In multiget_context.h use T[] specialization for unique_ptr that holds a char array
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5279
Differential Revision: D15202893
Pulled By: miasantreble
fbshipit-source-id: 14cc2cda0ed64d29f2a1e264a6bfdaa4294ee75d
Summary:
Right now, when Seek() is called again, RocksDB always does a binary search against the files and index blocks, even if they end up with the same file/block. Improve it as following:
1. in LevelIterator, reseek first try to check the boundary of the current file. If it falls into the same file, skip the binary search to find the file
2. in block based table iterator, reseek skip to reseek the iterator block if the seek key is larger than the current key and lower than the index key (boundary of the current block and the next block).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5256
Differential Revision: D15105072
Pulled By: siying
fbshipit-source-id: 39634bdb4a881082451fa39cecd7ecf12160bf80
Summary:
Sometimes, users might make mistake of not releasing snapshots before closing the DB. This is undocumented use of RocksDB and the behavior is unknown. We return DB::Close() to provide a way to check it for the users. Aborted() will be returned to users when they call DB::Close().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5272
Differential Revision: D15159713
Pulled By: siying
fbshipit-source-id: 39369def612398d9f239d83d396b5a28e5af65cd
Summary:
Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269
Differential Revision: D15151285
Pulled By: maysamyabandeh
fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e
Summary:
With atomic flush, RocksDB background flush will flush memtables of a column family up to the largest memtable id in the immutable memtable list. This can introduce a bug in the following scenario. A user thread inserts into a column family until the memtable is full and triggers a flush. This will add the column family to flush_scheduler_. Then the user thread writes another record to the column family. In the PreprocessWrite function, the user thread picks the column family from flush_scheduler_ and schedules a flush request. The flush request gaurantees to flush all the memtables up to the current largest memtable ID of the immutable memtable list. Then the user thread writes new data to the newly-created active memtable. After the write returns, the user thread closes the db. This can cause assertion failure when the background flush thread tries to install superversion for the column family. The solution is to not install flush results if the db has already set `shutting_down_` to true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5254
Differential Revision: D15124149
Pulled By: riversand963
fbshipit-source-id: 0a667a41339dedb5a18bcb01b0bf11c275c04df0
Summary:
Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.
1. Stop creating new table readers when the user explicitly sets readahead size.
2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
3. Add `readahead_size` to db_bench.
**Benchmarks:**
https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737
For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.
**Test Plan:**
Updated `DBIteratorTest.ReadAhead` test to make sure that:
- no new table readers are created for iterators on setting ReadOptions.readahead_size
- At least "readahead" number of bytes are actually getting read on each iterator read.
TODO later:
- Use similar logic for compactions as well.
- This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246
Differential Revision: D15107946
Pulled By: sagar0
fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
Summary:
The newly added test CompactionJobTest.SnapshotRefresh sets the snapshot refresh period to 0 to stress the feature. This results into large number of refresh events, which in turn results into an UBSAN failure when a bitwise shift operand goes beyond the uint64_t size.
The patch fixes that by simplifying the shift logic to be done only by 2 bits after each refresh. Furthermore it verifies that the shift operation does not result in decreasing the refresh period.
Testing:
COMPILE_WITH_UBSAN=1 make -j32 compaction_job_test
./compaction_job_test --gtest_filter=CompactionJobTest.SnapshotRefresh
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5257
Differential Revision: D15106463
Pulled By: maysamyabandeh
fbshipit-source-id: f2718898ea7ba4fa9f7e87b70cf98fe647c0de80
Summary:
Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099
Differential Revision: D15086710
Pulled By: maysamyabandeh
fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12
Summary:
Currently one thread in RocksDB keeps a WAL file open while another thread
deletes it. Although the first thread never writes to the WAL again, it still
tries to close it in the end. This is fine on POSIX, but can be problematic on
other platforms, e.g. HDFS, etc.. It will either cause a lot of warning messages or
throw exceptions. The solution is to let the second thread close the WAL before deleting it.
RocksDB keeps the writers of the logs to delete in `logs_to_free_`, which is passed to `job_context` during `FindObsoleteFiles` (holding mutex). Then in `PurgeObsoleteFiles` (without mutex), these writers should close the logs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5233
Differential Revision: D15032670
Pulled By: riversand963
fbshipit-source-id: c55e8a612db8cc2306644001a5e6d53842a8f754
Summary:
We have a DB with ~4k column families and ~70k files. On shutdown, destroying the 4k ColumnFamilyHandle-s takes over 2 minutes. Most of this time is spent in VersionSet::AddLiveFiles() called from FindObsoleteFiles() from ~ColumnFamilyHandleImpl(). It's just iterating over the list of files in memory. This seems completely unnecessary as no obsolete files are actually found since the CFs are not even dropped. This PR fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5238
Differential Revision: D15056342
Pulled By: siying
fbshipit-source-id: 2aa342ef3770b4aa384ce81f8768e485480e4f08
Summary: PR https://github.com/facebook/rocksdb/pull/4899 implemented the general framework for RocksDB secondary instances. This PR adds the support for WAL tailing in `OpenAsSecondary`, which means after the `OpenAsSecondary` call, the secondary is now able to see primary's writes that are yet to be flushed. The secondary can see primary's writes in the WAL up to the moment of `OpenAsSecondary` call starts.
Differential Revision: D15059905
Pulled By: miasantreble
fbshipit-source-id: 44f71f548a30b38179a7940165e138f622de1f10
Summary:
In some cases, we want to known the smallest and largest sequence numbers of sstable files, to help us get more details.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5231
Differential Revision: D15038087
Pulled By: siying
fbshipit-source-id: c473c1ca07b53efe2f1884fa1ecdc8686f455ed8
Summary:
It's hard to get DBIter to directly use InternalIterator::NextAndGetResult() because the code change would be complicated. Instead, use IteratorWrapper, where Next() is already using NextAndGetResult(). Performance number is hard to measure because it is small and ther is variation. I run readseq many times, and there seems to be 1% gain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5214
Differential Revision: D15003635
Pulled By: siying
fbshipit-source-id: 17af1965c409c2fe90cd85037fbd2c5a1364f82a
Summary:
Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174
Differential Revision: D14884185
Pulled By: al13n321
fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534
Summary:
Savepoints are assumed to be used in a stack-wise fashion (only
the top element should be used), so they were stored by `WriteBatch`
in a member variable `save_points` using an std::stack.
Conceptually this is fine, but the implementation had a few issues:
- the `save_points_` instance variable was a plain pointer to a heap-
allocated `SavePoints` struct. The destructor of `WriteBatch` simply
deletes this pointer. However, the copy constructor of WriteBatch
just copied that pointer, meaning that copying a WriteBatch with
active savepoints will very likely have crashed before. Now a proper
copy of the savepoints is made in the copy constructor, and not just
a copy of the pointer
- `save_points_` was an std::stack, which defaults to `std::deque` for
the underlying container. A deque is a bit over the top here, as we
only need access to the most recent savepoint (i.e. stack.top()) but
never any elements at the front. std::deque is rather expensive to
initialize in common environments. For example, the STL implementation
shipped with GNU g++ will perform a heap allocation of more than 500
bytes to create an empty deque object. Although the `save_points_`
container is created lazily by RocksDB, moving from a deque to a plain
`std::vector` is much more memory-efficient. So `save_points_` is now
a vector.
- `save_points_` was changed from a plain pointer to an `std::unique_ptr`,
making ownership more explicit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192
Differential Revision: D15024074
Pulled By: maysamyabandeh
fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7
Summary:
My compiler doesn't inline DBIter::Next() to arena wrapped iterator, even if it is a direct forward. Adding this annotation makes it inlined. It might not always work but inlinging this function to arena wrapped iterator always feels like the right decision.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5217
Differential Revision: D15004086
Pulled By: siying
fbshipit-source-id: a4cffd79c6fb092669a3a90633c9aa5e494f8a66
Summary:
We found an issue in Periodic Compactions (introduced in #5166) where files were not being picked up for compactions as all the SST files created with older versions of RocksDB have `file_creation_time` as 0. (Note that `file_creation_time` is a new table property introduced in #5166).
To address this, Periodic compactions now fall back to looking at the `creation_time` table property or the file's modification time (as given by the Env) when `file_creation_time` table property is found to be 0.
Here how the file's modification time (and, in turn, the file age) is computed now:
1. Use `file_creation_time` table property if it is > 0.
1. If not, then use `creation_time` table property if it is > 0.
1. If not, then use file's mtime stat metadata given by the underlying Env.
Don't consider the file at all for compaction if the modification time cannot be correctly determined based on the above conditions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5184
Differential Revision: D14907795
Pulled By: sagar0
fbshipit-source-id: 4bb2f3631f9a3e04470c674a1d13544584e1e56c
Summary:
Reorganize the code so that no function call into ReadRangeDelAggregator is needed if there is no tomb range stone.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5202
Differential Revision: D14968155
Pulled By: siying
fbshipit-source-id: 0bd61911293c7a27b4e1b8d57c66d0c4ad6a6a5f
Summary:
Several small changes for Next():
1. Reducing branching by always update local_stats_.next_count_++ even if statistics is null. This should be faster than a branching.
2. Replacing ResetInternalKeysSkippedCounter() in Next() because the valid_ check is not needed in this case.
3. iter_->Valid() should always be true for non merge case. Remove this check.
4. Adding an inline annotation. It ends up with not picked up by my compiler, but it shouldn't hurt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5200
Differential Revision: D15000391
Pulled By: siying
fbshipit-source-id: be97f61c708968234fb8e5cf272b5c2ac07dc4dd
Summary:
In long scans, virtual function calls of Next(), Valid(), key() and value() are not trivial. By introducing NextAndGetResult(), Some of the Next(), Valid() and key() calls are consolidated into one virtual function call to reduce CPU.
Also did some inline tricks and add some "final" randomly in some functions. Even without the "final" annotation, most Next() calls are inlined with -O3, but sometimes with a final it is inlined by O2 too. It doesn't hurt to add those final annotations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5197
Differential Revision: D14945977
Pulled By: siying
fbshipit-source-id: 7003969f9a5f1d5717f0bda503b91d19ba75ed88
Summary:
`GetOverlappingInputsRangeBinarySearch` firstly use binary search
to find a index in the given range `[begin, end]`. But after find
the index, then use linear search to find the `start_index` and
`end_index`. So the search process degraded to linear time.
Here optmize the search process with below changes:
- use `std::lower_bound` and `std::upper_bound` to get
`lg(n)` search complexity.
- use uniformed lambda for search process.
- simplify process for `within_interval` true or false.
- remove function `ExtendFileRangeWithinInterval`
and `ExtendFileRangeOverlappingInterval`.
Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4987
Differential Revision: D14984192
Pulled By: riversand963
fbshipit-source-id: fae4b8e59a21b7e350718d60cdc94dd55ac81e89
Summary:
this PR fixes the following compile warning:
```
db/memtable.cc: In member function ‘virtual void rocksdb::MemTableIterator::Seek(const rocksdb::Slice&)’:
db/memtable.cc:321:22: error: declaration of ‘user_key’ shadows a member of 'this' [-Werror=shadow]
Slice user_key(ExtractUserKey(k));
^
db/memtable.cc: In member function ‘virtual void rocksdb::MemTableIterator::SeekForPrev(const rocksdb::Slice&)’:
db/memtable.cc:338:22: error: declaration of ‘user_key’ shadows a member of 'this' [-Werror=shadow]
Slice user_key(ExtractUserKey(k));
^
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5204
Differential Revision: D14970160
Pulled By: miasantreble
fbshipit-source-id: 388eb089f90c4528cc6d615dd4607fb53ceac705
Summary:
Depending on the config, manual compaction (leveled compaction style) does following compactions:
L0->L1
L1->L2
...
Ln-1 -> Ln
Ln -> Ln
The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
Resolves issue https://github.com/facebook/rocksdb/issues/4995
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138
Differential Revision: D14940106
Pulled By: miasantreble
fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
Summary:
Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget.
Change it to 256KB accordingly.
There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175
Differential Revision: D14954289
Pulled By: siying
fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94
Summary:
This is second attempt for #5101. Original commit message:
`BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.
This patch come with two fixes:
Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter.
Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142
Differential Revision: D14907113
Pulled By: siying
fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb
Summary:
Right now, two separate pieces of code are used to create WAL files in DBImpl::Open function of db_impl_open.cc and DBImpl::SwitchMemtable function of db_impl_write.cc. This code change simply creates 1 function called DBImpl::CreateWAL in db_impl_open.cc which is used to replace existing WAL creation logic in DBImpl::Open and DBImpl::SwitchMemtable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5188
Differential Revision: D14942832
Pulled By: vjnadimpalli
fbshipit-source-id: d49230e04c36176015c8c1b422575872f92157fb
Summary:
Found this when test driving the new MultiGet. If you pass unsorted result with sorted_result = false you'll trigger the ASSERT incorrect even though we'll sort down below.
I've also added simple test cover sorted_result=true/false scenario copied from MultiGetSimple.
anand1976
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5195
Differential Revision: D14935475
Pulled By: yizhang82
fbshipit-source-id: 1d2af5e3a003847d965066a16e3b19da68acf170
Summary:
Before using prefix extractor `InDomain()` should be check. All uses in memtable.cc didn't check `InDomain()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5190
Differential Revision: D14923773
Pulled By: miasantreble
fbshipit-source-id: b3ad60bcca5f3a1a2b929a6eb34b0b7ba6326f04
Summary:
When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete.
The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache.
A unit test is added to reproduce the race condition with duplicate keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147
Differential Revision: D14758815
Pulled By: maysamyabandeh
fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719
Summary:
Since Statistics::measureTime() is deprecated, StatisticsImpl::measureTime() is not implemented. We realized that users might have a wrapped Statistics implementation in which measureTime() is implemented as forwarded to StatisticsImpl, and causes assert failure. In order to make the change less intrusive, we implement StatisticsImpl::measureTime(). We will revisit whether we need to remove it after several releases.
Also, add a test to make sure that a Statistics implementation using the old interface still works.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5181
Differential Revision: D14907089
Pulled By: siying
fbshipit-source-id: 29b6202fd04e30ed6f6adcaeb1000e87f10d1e1a
Summary:
When a new SST file is created via flush or compaction, we dump out the table properties, however only a few table properties are logged. The change here is to log all the table properties
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5168
Differential Revision: D14876928
Pulled By: vjnadimpalli
fbshipit-source-id: 1aca42ad00f9f650761d39e187f8beeb8700149b
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
Summary:
Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory.
Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165
Differential Revision: D14880709
Pulled By: siying
fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
Summary:
Create new function NPHash64() and GetSliceNPHash64(), which are currently
implemented using murmurhash.
Replace the current direct call of murmurhash() to use the new functions
if the hash results are not used in on-disk format.
This will make it easier to try out or switch to alternative functions
in the uses where data format compatibility doesn't need to be considered.
This part shouldn't have any performance impact.
Also, the sharded cache hash function is changed to the new format, because
it falls into this categoery. It doesn't show visible performance impact
in db_bench results. CPU showed by perf is increased from about 0.2% to 0.4%
in an extreme benchmark setting (4KB blocks, no-compression, everything
cached in block cache). We've known that the current hash function used,
our own Hash() has serious hash quality problem. It can generate a lots of
conflicts with similar input. In this use case, it means extra lock contention
for reads from the same file. This slight CPU regression is worthy to me
to counter the potential bad performance with hot keys. And hopefully this
will get further improved in the future with a better hash function.
cache_test's condition is relaxed a little bit to. The new hash is slightly
more skewed in this use case, but I manually checked the data and see
the hash results are still in a reasonable range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5155
Differential Revision: D14834821
Pulled By: siying
fbshipit-source-id: ec9a2c0a2f8ae4b54d08b13a5c2e9cc97aa80cb5