Summary:
We potentially need this information for tracing, profiling and diagnosis.
Closes https://github.com/facebook/rocksdb/pull/4026
Differential Revision: D8555214
Pulled By: riversand963
fbshipit-source-id: 4263e06c00b6d5410b46aa46eb4e358ff2161dd2
Summary:
b555ed30a4 makes the BlockBasedTableIterator to be invalidated if the current position if over the upper bound. However, this can bring performance regression to the case of multiple Seek()s hitting the same data block but all out of upper bound.
For example, if an SST file has a data block containing following keys : {a, z}
The user sets the upper bound to be "x", and it executed following queries:
Seek("b")
Seek("c")
Seek("d")
Before the upper bound optimization, these queries always come to this same current data block of the iterator, but now inside each Seek() the data block is read from the block cache but is returned again.
To prevent this regression case, we keep the current data block iterator if it is upper bound.
Closes https://github.com/facebook/rocksdb/pull/4004
Differential Revision: D8463192
Pulled By: siying
fbshipit-source-id: 8710628b30acde7063a097c3184d6c4333a8ef81
Summary:
The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843.
This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy.
Closes https://github.com/facebook/rocksdb/pull/3996
Differential Revision: D8416186
Pulled By: miasantreble
fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28
Summary:
Don't call the OnTableFileCreated listener callback when a 0 size SST
file gets created by Flush. Doing so causes an assertion failure in db_stress. It is also not correct behavior as we call env->DeleteFile() for such files right before the notification.
Closes https://github.com/facebook/rocksdb/pull/4003
Differential Revision: D8461385
Pulled By: anand1976
fbshipit-source-id: ae92d4f921c2e2cff981ad58f4929ed8b609f35d
Summary:
Add a new DB property to DB::GetProperty(), which returns the option.statistics. Test is updated to pass.
Closes https://github.com/facebook/rocksdb/pull/3966
Differential Revision: D8311139
Pulled By: zhichao-cao
fbshipit-source-id: ea78f4727358c807b0e5a0ea62e09defb10ad9ac
Summary:
Verify table will load SST into `TableCache`
it occupy memory & `TableCache`‘s capacity ...
but no logic use them
it's unnecessary ...
so , we verify them after all sub compact finished
Closes https://github.com/facebook/rocksdb/pull/3979
Differential Revision: D8389946
Pulled By: ajkr
fbshipit-source-id: 54bd4f474f9e7b3accf39c3068b1f36a27ec4c49
Summary:
The SST file sizes changed slightly after the improvement of PR #3970
which reduces the size of the properties block. Before PR #3970 a size
ratio compaction included all of the first four flushed files but it
only includes two files after. We increase the size_ratio universal
compaction option to make that compaction include all four files again.
Closes https://github.com/facebook/rocksdb/pull/3995
Differential Revision: D8426925
Pulled By: fgwu
fbshipit-source-id: 1429c38672e9f4fb4d4881fd4b06db45c4861d62
Summary:
A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong:
Seek(key);
if (!Valid()) SeekToLast();
Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases.
This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev().
Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case.
Closes https://github.com/facebook/rocksdb/pull/3989
Differential Revision: D8385422
Pulled By: siying
fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4
Summary:
format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters.
Closes https://github.com/facebook/rocksdb/pull/3958
Differential Revision: D8294615
Pulled By: maysamyabandeh
fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe
Summary:
CompactFiles checked whether the existing files conflicted with the chosen compaction. But it missed checking whether future files would conflict, i.e., when another compaction was simultaneously writing new files to the same range at the same output level.
Closes https://github.com/facebook/rocksdb/pull/3926
Differential Revision: D8218996
Pulled By: ajkr
fbshipit-source-id: 21cb00a6fed4c8c62d3ed2ff810962e6bdc2fdfb
Summary:
PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings.
Run `make format` to fix formatting as suggested by siying .
Also piggyback two changes:
1) fix singleton destruction order for windows and posix env
2) fix two clang warnings
Closes https://github.com/facebook/rocksdb/pull/3954
Differential Revision: D8272041
Pulled By: miasantreble
fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f
Summary:
format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
Closes https://github.com/facebook/rocksdb/pull/3942
Differential Revision: D8238413
Pulled By: maysamyabandeh
fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
Summary:
Previous commit https://github.com/facebook/rocksdb/pull/3935 unhide a few test options which includes kDirectIO. However it's not supported by RocksDB lite. Need to hide this option from the lite build.
Closes https://github.com/facebook/rocksdb/pull/3943
Differential Revision: D8242757
Pulled By: miasantreble
fbshipit-source-id: 1edfad3a5d01a46bfb7eedee765981ebe02c500a
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881
Differential Revision: D8076288
Pulled By: ajkr
fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
Summary:
DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests.
Closes https://github.com/facebook/rocksdb/pull/3935
Differential Revision: D8230748
Pulled By: maysamyabandeh
fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db
Summary:
DBImpl::FindObsoleteFiles() may call GetChildren() multiple times if different CFs are on the same path. Fix it.
Closes https://github.com/facebook/rocksdb/pull/3885
Differential Revision: D8084634
Pulled By: siying
fbshipit-source-id: b471fbc251f6a05e9243304dc14c0831060cc0b0
Summary:
In order to make valgrind check test to pass in a day, remove some tests that run prohibitively slow under valgrind.
Closes https://github.com/facebook/rocksdb/pull/3924
Differential Revision: D8210184
Pulled By: siying
fbshipit-source-id: 5b06fb08f3cf57571d422d05a0dbddc9f9376f7a
Summary:
This is still WIP, but I'm hoping for early feedback on the overall approach.
This patch implements deletion triggered compaction, which till now only
worked for leveled, for universal style. SST files are marked for
compaction by the CompactOnDeletionCollertor table property. This is
expected to be used when free disk space is low and the user wants to
reclaim space by deleting a bunch of keys. The deletions are expected to
be dense. In such a situation, we want to avoid a full compaction due to
its space overhead.
The strategy used in this case is similar to leveled. We pick one file
from the set of files marked for compaction. We then expand the inputs
to a clean cut on the same level, and then pick overlapping files from
the next non-mepty level. Picking files from the next level can cause
the key range to expand, and we opportunistically expand inputs in the
source level to include files wholly in this key range.
The main side effect of this is that it breaks the property of no time
range overlap between levels. This shouldn't break any functionality.
Closes https://github.com/facebook/rocksdb/pull/3860
Differential Revision: D8124397
Pulled By: anand1976
fbshipit-source-id: bfa2a9dd6817930e991b35d3a8e7e61304ed3dcf
Summary:
Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
Closes https://github.com/facebook/rocksdb/pull/3894
Differential Revision: D8118775
Pulled By: maysamyabandeh
fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
Summary:
Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609).
There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change.
To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed.
Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised.
Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion.
Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler.
Test plan
```
$ make -j16
$ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion
```
Expected behavior:
All tests should pass.
Closes https://github.com/facebook/rocksdb/pull/3898
Differential Revision: D8149421
Pulled By: riversand963
fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438
Summary:
Add flush_before_backup to rocksdb_backup_engine_create_new_backup. make c api able to control the flush before backup behavior.
Closes https://github.com/facebook/rocksdb/pull/3897
Differential Revision: D8157676
Pulled By: ajkr
fbshipit-source-id: 88998c62f89f087bf8672398fd7ddafabbada505
Summary:
Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache.
Closes https://github.com/facebook/rocksdb/pull/3877
Differential Revision: D8100895
Pulled By: yiwu-arbug
fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da
Summary:
Explicitly specify the underlying type of enums help developers understand the physical storage.
Closes https://github.com/facebook/rocksdb/pull/3892
Differential Revision: D8107027
Pulled By: riversand963
fbshipit-source-id: a00efecbba46df4a3c8eed0994a2d4972ad1a1d3
Summary:
DBTest.GroupCommitTest would often fail when run under valgrind because its sleeps were insufficient to guarantee a group commit had multiple entries. Instead we can use sync point to force a leader to wait until a non-leader thread has enqueued its work, thus guaranteeing a leader can do group commit work for multiple threads.
Closes https://github.com/facebook/rocksdb/pull/3883
Differential Revision: D8079429
Pulled By: ajkr
fbshipit-source-id: 61dc50fad29d2c85547842f681288de60fa29049
Summary:
By using WritableFileWriter rather than WritableFile directly, we can buffer multiple Append() calls to one write() file system call, which will be expensive to underlying Env without its own write buffering.
Closes https://github.com/facebook/rocksdb/pull/3882
Differential Revision: D8080673
Pulled By: siying
fbshipit-source-id: e0db900cb3c178166aa738f3985db65e3ae2cf1b
Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601
Differential Revision: D7253114
Pulled By: miasantreble
fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
Summary:
Change `keys_` from `set<string>` to `vector<set<string>>` so that each column
family's keys are stored in one set.
ajkr When you have a chance, can you PTAL? Thanks!
Closes https://github.com/facebook/rocksdb/pull/3871
Differential Revision: D8056447
Pulled By: riversand963
fbshipit-source-id: 650d0f9cad02b1bc005fc329ad76edbf053e6386
Summary:
`RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion.
Closes https://github.com/facebook/rocksdb/pull/3875
Differential Revision: D8063667
Pulled By: ajkr
fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41
Summary:
Right now ReverseBytewiseComparator::FindShortestSeparator() doesn't really shorten key, and ReverseBytewiseComparator::FindShortestSuccessor() seems to return wrong results. The code is confusing too as it uses BytewiseComparatorImpl::FindShortestSeparator() but the function actually won't do anything if the the first key is larger than the second.
Implement ReverseBytewiseComparator::FindShortestSeparator() and override ReverseBytewiseComparator::FindShortestSuccessor() to be empty.
Closes https://github.com/facebook/rocksdb/pull/3836
Differential Revision: D7959762
Pulled By: siying
fbshipit-source-id: 93acb621c16ce6f23e087ae4e19f7d84d1254683
Summary:
this will fix the failing clang_check test
Closes https://github.com/facebook/rocksdb/pull/3868
Differential Revision: D8050880
Pulled By: miasantreble
fbshipit-source-id: 749932e2e4025f835c961c068d601e522a126da6
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
Summary:
log_ contract specifies that it should not be modified unless both mutex_ and log_write_mutex_ are held. log_.erase however does that with only holding mutex_. This causes a race condition with two_write_queues since logs_.back is read with holding only log_write_mutex_ (which is correct according to logs_ contract) but logs_.erase is called concurrently. This is probably the cause of logs_.back returning nullptr in https://github.com/facebook/rocksdb/issues/3852 although I could not reproduce it.
Fixes https://github.com/facebook/rocksdb/issues/3852
Closes https://github.com/facebook/rocksdb/pull/3859
Differential Revision: D8026103
Pulled By: maysamyabandeh
fbshipit-source-id: ee394e00fe4aa520d884c5ef87981e9d6b5ccb28
Summary:
TSAN reports a false alarm for lock-order-inversion in DBWriteTest.IOErrorOnWALWritePropagateToWriteThreadFollower but Open and FlushWAL are not run concurrently. Suppressing the error by skipping FlushWAL in the test until TSAN is fixed.
The alternative would be to use
```
TSAN_OPTIONS="suppressions=tsan-suppressions.txt" ./db_write_test
```
but it does not seem straightforward to integrate it to our test infra.
Closes https://github.com/facebook/rocksdb/pull/3854
Differential Revision: D8000202
Pulled By: maysamyabandeh
fbshipit-source-id: fde33483d963a7ad84d3145123821f64960a4802
Summary:
This feature was introduced for universal compaction in cc01985d. At that point we thought it'd be used only to prevent long-running universal full compactions from blocking short-lived upper-level compactions. Now we have a level compaction user who could benefit from it since they use more expensive compression algorithm in the bottom level. So enable it for level.
Closes https://github.com/facebook/rocksdb/pull/3835
Differential Revision: D7957179
Pulled By: ajkr
fbshipit-source-id: 177285d2cef3b650b6a4d81dc5db84bc441c9fe4
Summary:
Currently manual_wal_flush if set in the options will be used only for the wal files created during wal switch. The configuration thus does not affect the first wal file. The patch fixes that and also update the related unit tests.
This PR is built on top of https://github.com/facebook/rocksdb/pull/3756
Closes https://github.com/facebook/rocksdb/pull/3824
Differential Revision: D7909153
Pulled By: maysamyabandeh
fbshipit-source-id: 024ed99d2555db06bf096c902b998e432bb7b9ce
Summary:
Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.
This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
Closes https://github.com/facebook/rocksdb/pull/3829
Differential Revision: D7915443
Pulled By: ajkr
fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279
Summary:
`ReadaheadRandomAccessFile` had an unwritten assumption, which was that its wrapped file's `Read()` function always copies into the provided scratch buffer. Actually this was not true when the wrapped file was `PosixMmapReadableFile`, whose `Read()` implementation does no copying and instead returns a `Slice` pointing directly into the `mmap`'d memory region. This PR:
- prevents `ReadaheadRandomAccessFile` from ever wrapping mmap readable files
- adds an assert for the assumption `ReadaheadRandomAccessFile` makes about the wrapped file's use of scratch buffer
Closes https://github.com/facebook/rocksdb/pull/3813
Differential Revision: D7891513
Pulled By: ajkr
fbshipit-source-id: dc64a55222d6af280c39a1852ee39e9e9d7cde7d
Summary:
tsan flavor of this test occasionally times out in our test infra. The patch split the test to two, each working on half of the option range.
Before:
[ OK ] FaultTest/FaultInjectionTest.FaultTest/0 (5918 ms)
[ OK ] FaultTest/FaultInjectionTest.FaultTest/1 (5336 ms)
After:
[ OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/0 (2930 ms)
[ OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/1 (2676 ms)
[ OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/2 (2759 ms)
[ OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/3 (2546 ms)
Closes https://github.com/facebook/rocksdb/pull/3819
Differential Revision: D7894975
Pulled By: maysamyabandeh
fbshipit-source-id: 809f1411cbcc27f8aa71a6b29a16b039f51b67c9
Summary:
The origin commit #3635 will hurt performance for users who aren't using range deletions, because unneeded std::set operations, so it was reverted by commit 44653c7b7a. (see #3672)
To fix this, move the set to and add a check in , i.e., file will be added only if is non-nullptr.
The db_bench command which find the performance regression:
> ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 > --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 > --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 > -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none
Before and after the modification, I re-run this command on the machine, the results of are as follows:
**fillrandom**
Table | P50 | P75 | P99 | P99.9 | P99.99 |
---- | --- | --- | --- | ----- | ------ |
before commit | 5.92 | 8.57 | 19.63 | 980.97 | 12196.00 |
after commit | 5.91 | 8.55 | 19.34 | 965.56 | 13513.56 |
**seekrandomwhilewriting**
Table | P50 | P75 | P99 | P99.9 | P99.99 |
---- | --- | --- | --- | ----- | ------ |
before commit | 1418.62 | 1867.01 | 3823.28 | 4980.99 | 9240.00 |
after commit | 1450.54 | 1880.61 | 3962.87 | 5429.60 | 7542.86 |
Closes https://github.com/facebook/rocksdb/pull/3800
Differential Revision: D7874245
Pulled By: ajkr
fbshipit-source-id: 2e8bec781b3f7399246babd66395c88619534a17
Summary:
Delete archive directory before WAL folder
since archive may be contained as a subfolder.
Also improve loop readability.
Closes https://github.com/facebook/rocksdb/pull/3797
Differential Revision: D7866378
Pulled By: riversand963
fbshipit-source-id: 0c45d97677ce6fbefa3f8d602ef5e2a2a925e6f5
Summary:
ManualCompactionTest.Test occasionally times out in tsan flavor of our test infra. The patch reduces the number of keys to make the test run faster. The change does not seem to negatively impact the coverage of the test.
Closes https://github.com/facebook/rocksdb/pull/3802
Differential Revision: D7865596
Pulled By: maysamyabandeh
fbshipit-source-id: b4f60e32c3ae1677e25506f71c766e33fa985785
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
Summary:
When the dummy record insertion fails, there is no need to explicitly delete the block as it will be registered for cleanup regardless.
Closes https://github.com/facebook/rocksdb/pull/3688
Differential Revision: D7537741
Pulled By: miasantreble
fbshipit-source-id: fcd3a3d3d382ee8e2c7ced0a4980e683d93a16d6
Summary:
A minor change: I wrapped TransactionLogIterator for the C API.
I needed that for the golang binding.
Closes https://github.com/facebook/rocksdb/pull/3304
Differential Revision: D6628736
Pulled By: miasantreble
fbshipit-source-id: 3374f3c64b1d7b225696b8767090917761e2f30a
Summary:
Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default.
I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically.
Closes https://github.com/facebook/rocksdb/pull/3775
Differential Revision: D7792357
Pulled By: ajkr
fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c
Summary:
We use `queued_for_flush_` to indicate a column family has been added to the
flush queue. Similarly and to be consistent in our naming, we need to use `queued_for_compaction_` to indicate a column family has been added to the compaction queue. In the past we used
`pending_compaction_` which can also be ambiguous.
Closes https://github.com/facebook/rocksdb/pull/3781
Differential Revision: D7790063
Pulled By: riversand963
fbshipit-source-id: 6786b11a4fcaea36dc9b4672233dbe042f921804
Summary:
With ColumnFamilyData::pending_flush_, we have the following code snippet in DBImpl::ScheedulePendingFlush
```
if (!cfd->pending_flush() && cfd->imm()->IsFlushPending()) {
...
}
```
`Pending` is ambiguous, and I feel `queued_for_flush` is a better name,
especially for the sake of readability.
Closes https://github.com/facebook/rocksdb/pull/3777
Differential Revision: D7783066
Pulled By: riversand963
fbshipit-source-id: f1bd8c8bfe5eafd2c94da0d8566c9b2b6bb57229
Summary:
sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want.
Closes https://github.com/facebook/rocksdb/pull/3767
Differential Revision: D7760136
Pulled By: siying
fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f
Summary:
Currently, the `rocksdb_options_set_ratelimiter` in `c.cc` will change the input to nil, which make it is
not possible to use the shared rate limiter create by `rocksdb_ratelimiter_create` in different rocksdb option.
In this pr, I changed it to shared ptr.
Closes https://github.com/facebook/rocksdb/pull/3758
Differential Revision: D7749740
Pulled By: ajkr
fbshipit-source-id: c6121f8ca75402afdb4b295ce63c2338d253a1b5
Summary:
There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes.
These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing.
Closes https://github.com/facebook/rocksdb/pull/3602
Differential Revision: D7251562
Pulled By: al13n321
fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d
Summary:
This reverts commit 73f21a7b21.
It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error:
"Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory"
This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file.
Closes https://github.com/facebook/rocksdb/pull/3762
Differential Revision: D7730035
Pulled By: siying
fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77
Summary:
1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the
number of keys successfully read
2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in
DBImpl::GetImpl(), but not MultiGet
Closes https://github.com/facebook/rocksdb/pull/3730
Differential Revision: D7677364
Pulled By: anand1976
fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5
Summary:
The WriteBatch::Iterate will try with a larger sequence number if the memtable reports a duplicate. This status is specified with TryAgain status. So far the assumption was that the last entry in the batch will never return TryAgain, which is correct when WAL is created via WritePrepared since it always appends a batch separator if a natural one does not exist. However when reading a WAL generated by WriteCommitted this batch separator might not exist. Although WritePrepared is not supposed to be able to read the WAL generated by WriteCommitted we should avoid confusing scenarios in which the behavior becomes unpredictable. The path fixes that by allowing TryAgain even for the last entry of the write batch.
Closes https://github.com/facebook/rocksdb/pull/3747
Differential Revision: D7708391
Pulled By: maysamyabandeh
fbshipit-source-id: bfaddaa9b14a4cdaff6977f6f63c789a6ab1ee0d
Summary:
Fix the following gcc-8 warnings:
- conflicting C language linkage declaration [-Werror]
- writing to an object with no trivial copy-assignment [-Werror=class-memaccess]
- array subscript -1 is below array bounds [-Werror=array-bounds]
Solves https://github.com/facebook/rocksdb/issues/3716
Closes https://github.com/facebook/rocksdb/pull/3736
Differential Revision: D7684161
Pulled By: yiwu-arbug
fbshipit-source-id: 47c0423d26b74add251f1d3595211eee1e41e54a
Summary:
Right now in `SyncClosedLogs`, `CopyFile`, and `AddRecord`, where `Sync` and `Append` are invoked in a loop, the error status are not checked. This could lead to potential corruption as later calls will overwrite the error status.
Closes https://github.com/facebook/rocksdb/pull/3740
Differential Revision: D7678848
Pulled By: miasantreble
fbshipit-source-id: 4b0b412975989dfe80348f73217b9c4122a4bd77
Summary:
The reason for this initialization is that LLVM UBSAN check will fail due to
uninitialized bool. [StackOverflow post](https://stackoverflow.com/questions/31420154/runtime-error-load-of-value-127-which-is-not-a-valid-value-for-type-bool).
UBSAN log:
> ===== Running external_sst_file_basic_test
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from ExternalSSTFileBasicTest
[ RUN ] ExternalSSTFileBasicTest.Basic
[ OK ] ExternalSSTFileBasicTest.Basic (6 ms)
[ RUN ] ExternalSSTFileBasicTest.NoCopy
db/external_sst_file_ingestion_job.h:23:8: runtime error: load of value 253, which is not a valid value for type 'bool'
miasantreble I've tested this locally using the following command.
```
TEST_TMPDIR=/dev/shm/rocksdb COMPILE_WITH_UBSAN=1 OPT=-g make J=1 -j8 ubsan_check
```
ajkr This PR is related to your review comment in [PR](https://github.com/facebook/rocksdb/pull/3713/). It turns out that, with UBSAN enabled, we must provide a default value for boolean member variables.
Closes https://github.com/facebook/rocksdb/pull/3728
Differential Revision: D7642476
Pulled By: riversand963
fbshipit-source-id: 4c09a4b8d271151cb99ae7393db9e4ad9f29762e
Summary:
this PR fixes a few failed contbuild:
1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406
2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662
3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag.
Closes https://github.com/facebook/rocksdb/pull/3718
Reviewed By: maysamyabandeh
Differential Revision: D7621192
Pulled By: miasantreble
fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146
Summary:
When there are many range deletions in a range, we want to trigger manual compaction on this range to reclaim disk space as soon as possible and speed up read.
After this change, we can collect informations of range deletions and store them into user properties which can guide our manual compaction.
Closes https://github.com/facebook/rocksdb/pull/3695
Differential Revision: D7570322
Pulled By: ajkr
fbshipit-source-id: c358fa43b0aac6cc954d2eadc7d3bd8015373369
Summary:
RocksDB supports ingestion of external ssts. If ingestion_options.move_files is true, when performing ingestion, RocksDB first tries to link external ssts. If external SST file resides on a different FS, or the underlying FS does not support hard link, then RocksDB performs actual file copy. However, no matter which choice is made, current code increase bytes-written when updating compaction stats, which is inaccurate when RocksDB does NOT copy file.
Rename a sync point.
Closes https://github.com/facebook/rocksdb/pull/3713
Differential Revision: D7604151
Pulled By: riversand963
fbshipit-source-id: dd0c0d9b9a69c7d9ffceafc3d9c23371aa413586
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662
Differential Revision: D7426121
Pulled By: Dayvedde
fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
Summary:
Add `compaction_reason` as part of event log for event `compaction started`.
Add counters for each `CompactionReason`.
Closes https://github.com/facebook/rocksdb/pull/3679
Differential Revision: D7550348
Pulled By: riversand963
fbshipit-source-id: a19cff3a678c785aa5ef41aac78b9a5968fcc34d
Summary:
In `cf_options_type_info`, the deprecated options are all considered to have offset zero in the `MutableCFOptions` struct. Previously we weren't checking in `GetMutableOptionsFromStrings` whether the provided option was deprecated or not and simply writing the provided value to the offset specified by `cf_options_type_info`. That meant setting any deprecated option would overwrite the first element in the struct, which is `write_buffer_size`. `db_stress` hit this often since it calls `SetOptions` with `soft_rate_limit=0` and `hard_rate_limit=0`, which are both deprecated so cause `write_buffer_size` to be set to zero, which causes it to crash on the following assertion:
```
db_stress: db/memtable.cc:106: rocksdb::MemTable::MemTable(const rocksdb::InternalKeyComparator&, const rocksdb::ImmutableCFOptions&, const rocksdb::MutableCFOptions&, rocksdb::WriteBufferManager*, rocksdb::SequenceNumber, uint32_t): Assertion `!ShouldScheduleFlush()' failed.
```
We fix it by skipping deprecated options (and logging a warning) when users provide them to `SetOptions`. I didn't want to fail the call for compatibility reasons.
Closes https://github.com/facebook/rocksdb/pull/3700
Differential Revision: D7572596
Pulled By: ajkr
fbshipit-source-id: bd5d84e14c0c39f30c5d4c6df7c1503d2c28ecf1
Summary:
1. Remove redundant text.
2. Make terminology consistent across all comments and doc of RocksDB. Also do
our best to conform to conventions. Specifically, use 'callback' instead of
'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)).
Closes https://github.com/facebook/rocksdb/pull/3693
Differential Revision: D7560396
Pulled By: riversand963
fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8
Summary:
In this case, we add input files of compaction, not outputs.
Closes https://github.com/facebook/rocksdb/pull/3686
Differential Revision: D7556781
Pulled By: ajkr
fbshipit-source-id: ae135bb6eda60db8f275a9ba2d21c18aaadef5b7
Summary:
- inflate the argument passed as `max_compact_bytes_per_del_file` by a bit (10%). The intent of this argument is prevent L0 files from being intra-L0 compacted multiple times. Without compression, some intra-L0 compactions exceed this limit (and thus aren't executed), even though none of their files have gone through intra-L0 before.
- fix `FindIntraL0Compaction` as it was rejecting some valid intra-L0 compactions. In particular, `compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len), whereas `new_compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len+1). The former is more correct for checking whether we've found an eligible span.
Closes https://github.com/facebook/rocksdb/pull/3684
Differential Revision: D7530396
Pulled By: ajkr
fbshipit-source-id: cad4f50902bdc428ac9ff6fffb13eb288648d85e
Summary:
Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex.
Closes https://github.com/facebook/rocksdb/pull/3683
Differential Revision: D7529393
Pulled By: maysamyabandeh
fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102
Differential Revision: D6951697
Pulled By: ajkr
fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
Summary:
Primitive types constness does not affect the signature of the
method and has no influence on whether the overriding method would
actually have that const bool instead of just bool. In addition,
it is rarely useful but does produce a compatibility warnings
in VS 2015 compiler.
Closes https://github.com/facebook/rocksdb/pull/3663
Differential Revision: D7475739
Pulled By: ajkr
fbshipit-source-id: fb275378b5acc397399420ae6abb4b6bfe5bd32f
Summary:
currently rocksdb lite build fails due to the following errors:
> db/db_sst_test.cc:29:51: error: ‘FlushJobInfo’ does not name a type
virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
^
db/db_sst_test.cc:29:16: error: ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’ marked ‘override’, but does not override
virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
^
db/db_sst_test.cc:24:7: error: ‘class rocksdb::FlushedFileCollector’ has virtual functions and accessible non-virtual destructor [-Werror=non-virtual-dtor]
class FlushedFileCollector : public EventListener {
^
db/db_sst_test.cc: In member function ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’:
db/db_sst_test.cc:31:35: error: request for member ‘file_path’ in ‘info’, which is of non-class type ‘const int’
flushed_files_.push_back(info.file_path);
^
cc1plus: all warnings being treated as errors
make: *** [db/db_sst_test.o] Error 1
Closes https://github.com/facebook/rocksdb/pull/3676
Differential Revision: D7493006
Pulled By: miasantreble
fbshipit-source-id: 77dff0a5b23e27db51be9b9798e3744e6fdec64f
Summary:
Ttl-triggered and snapshot-release-triggered compactions should not be considered as manual compactions. This is a bug.
Closes https://github.com/facebook/rocksdb/pull/3678
Differential Revision: D7498151
Pulled By: sagar0
fbshipit-source-id: a2d5bed05268a4dc93d54ea97a9ae44b366df15d
Summary:
Level Compaction with TTL.
As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.
Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.
This functionality can be controlled by the newly introduced column family option -- ttl.
TODO for later:
- Make ttl mutable
- Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
- Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
Closes https://github.com/facebook/rocksdb/pull/3591
Differential Revision: D7275442
Pulled By: sagar0
fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
Summary:
The is an optimization to reduce lookup in the CommitCache when querying IsInSnapshot. The optimization takes the smallest uncommitted data at the time that the snapshot was taken and if the sequence number of the read data is lower than that number it assumes the data as committed.
To implement this optimization two changes are required: i) The AddPrepared function must be called sequentially to avoid out of order insertion in the PrepareHeap (otherwise the top of the heap does not indicate the smallest prepare in future too), ii) non-2PC transactions also call AddPrepared if they do not commit in one step.
Closes https://github.com/facebook/rocksdb/pull/3649
Differential Revision: D7388630
Pulled By: maysamyabandeh
fbshipit-source-id: b79506238c17467d590763582960d4d90181c600
Summary:
Manual compactions should be cancelled, just like scheduled compactions are cancelled, if sfm->EnoughRoomForCompaction is not true.
Closes https://github.com/facebook/rocksdb/pull/3670
Differential Revision: D7457683
Pulled By: amytai
fbshipit-source-id: 669b02fdb707f75db576d03d2c818fb98d1876f5
Summary:
This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Closes https://github.com/facebook/rocksdb/pull/3488
Differential Revision: D6967893
Pulled By: maysamyabandeh
fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1
Summary:
When using two_write_queue, the published seq and the last allocated sequence could be ahead of the LastSequence, even if both write queues are stopped as in WriteRecoverableState. The patch fixes a bug in WriteRecoverableState in which LastSequence was used as a reference but the result was applied to last fetched sequence and last published seq.
Closes https://github.com/facebook/rocksdb/pull/3665
Differential Revision: D7446099
Pulled By: maysamyabandeh
fbshipit-source-id: 1449bed9aed8e9db6af85946efd347cb8efd3c0b
Summary:
Currently if the CommitTimeWriteBatch is set to be used only as a state that is required only for recovery , the user cannot see that in DB until it is restarted. This while the state is already inserted into the DB after the memtable flush. It would be useful for debugging if make this state visible to the user after the flush by committing it. The patch does it by a invoking a callback that does the commit on the recoverable state.
Closes https://github.com/facebook/rocksdb/pull/3661
Differential Revision: D7424577
Pulled By: maysamyabandeh
fbshipit-source-id: 137f9408662f0853938b33fa440f27f04c1bbf5c
Summary:
Possible interleaved execution of background compaction thread calling `FindObsoleteFiles (no full scan) / PurgeObsoleteFiles` and user thread calling `FindObsoleteFiles (full scan) / PurgeObsoleteFiles` can lead to race condition on which RocksDB attempts to delete a file twice. The second attempt will fail and return `IO error`. This may occur to other files, but this PR targets sst.
Also add a unit test to verify that this PR fixes the issue.
The newly added unit test `obsolete_files_test` has a test case for this scenario, implemented in `ObsoleteFilesTest#RaceForObsoleteFileDeletion`. `TestSyncPoint`s are used to coordinate the interleaving the `user_thread` and background compaction thread. They execute as follows
```
timeline user_thread background_compaction thread
t1 | FindObsoleteFiles(full_scan=false)
t2 | FindObsoleteFiles(full_scan=true)
t3 | PurgeObsoleteFiles
t4 | PurgeObsoleteFiles
V
```
When `user_thread` invokes `FindObsoleteFiles` with full scan, it collects ALL files in RocksDB directory, including the ones that background compaction thread have collected in its job context. Then `user_thread` will see an IO error when trying to delete these files in `PurgeObsoleteFiles` because background compaction thread has already deleted the file in `PurgeObsoleteFiles`.
To fix this, we make RocksDB remember which (SST) files have been found by threads after calling `FindObsoleteFiles` (see `DBImpl#files_grabbed_for_purge_`). Therefore, when another thread calls `FindObsoleteFiles` with full scan, it will not collect such files.
ajkr could you take a look and comment? Thanks!
Closes https://github.com/facebook/rocksdb/pull/3638
Differential Revision: D7384372
Pulled By: riversand963
fbshipit-source-id: 01489516d60012e722ee65a80e1449e589ce26d3
Summary:
Currently log_writer->AddRecord in WriteImpl is protected from concurrent calls via FlushWAL only if two_write_queues_ option is set. The patch fixes the problem by i) skip log_writer->AddRecord in FlushWAL if manual_wal_flush is not set, ii) protects log_writer->AddRecord in WriteImpl via log_write_mutex_ if manual_wal_flush_ is set but two_write_queues_ is not.
Fixes#3599
Closes https://github.com/facebook/rocksdb/pull/3656
Differential Revision: D7405608
Pulled By: maysamyabandeh
fbshipit-source-id: d6cc265051c77ae49c7c6df4f427350baaf46934
Summary:
Currently AddPrepared is performed only on the first sub-batch if there are duplicate keys in the write batch. This could cause a problem if the transaction takes too long to commit and the seq number of the first sub-patch moved to old_prepared_ but not the seq of the later ones. The patch fixes this by calling AddPrepared for all sub-patches.
Closes https://github.com/facebook/rocksdb/pull/3651
Differential Revision: D7388635
Pulled By: maysamyabandeh
fbshipit-source-id: 0ccd80c150d9bc42fe955e49ddb9d7ca353067b4
Summary:
RangeDelAggregator will remember the files whose range tombstones have been added,
so the caller can check whether the file has been added before call AddTombstones.
Closes https://github.com/facebook/rocksdb/pull/3635
Differential Revision: D7354604
Pulled By: ajkr
fbshipit-source-id: 9b9f7ec130556028df417e650711554b46d8d107
Summary:
Summary
========
`InlineSkipList<>::Insert` takes the `key` parameter as a C-string. Then, it performs multiple comparisons with it requiring the `GetLengthPrefixedSlice()` to be spawn in `MemTable::KeyComparator::operator()(const char* prefix_len_key1, const char* prefix_len_key2)` on the same data over and over. The patch tries to optimize that.
Rough performance comparison
=====
Big keys, no compression.
```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom : 4.222 micros/op 236836 ops/sec; 80.4 MB/s
```
```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom : 4.064 micros/op 246059 ops/sec; 83.5 MB/s
```
TODO
======
In ~~a separated~~ this PR:
- [x] Go outside the write path. Maybe even eradicate the C-string-taking variant of `KeyIsAfterNode` entirely.
- [x] Try to cache the transformations applied by `KeyComparator` & friends in situations where we havy many comparisons with the same key.
Closes https://github.com/facebook/rocksdb/pull/3516
Differential Revision: D7059300
Pulled By: ajkr
fbshipit-source-id: 6f027dbb619a488129f79f79b5f7dbe566fb2dbb