Summary:
Preliminary user-timestamp support for delete.
If ["a", ts=100] exists, you can delete it by calling `DB::Delete(write_options, key)` in which `write_options.timestamp` points to a `ts` higher than 100.
Implementation
A new ValueType, i.e. `kTypeDeletionWithTimestamp` is added for deletion marker with timestamp.
The reason for a separate `kTypeDeletionWithTimestamp`: RocksDB may drop tombstones (keys with kTypeDeletion) when compacting them to the bottom level. This is OK and useful if timestamp is disabled. When timestamp is enabled, should we still reuse `kTypeDeletion`, we may drop the tombstone with a more recent timestamp, causing deleted keys to re-appear.
Test plan (dev server)
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6253
Reviewed By: ltamasi
Differential Revision: D20995328
Pulled By: riversand963
fbshipit-source-id: a9e5c22968ad76f98e3dc6ee0151265a3f0df619
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
Summary:
Towards making compaction logic compatible with user timestamp.
When computing boundaries and overlapping ranges for inputs of compaction, We need to compare SSTs by user key without timestamp.
Test plan (devserver):
```
make check
```
Several individual tests:
```
./version_set_test --gtest_filter=VersionStorageInfoTimestampTest.GetOverlappingInputs
./db_with_timestamp_compaction_test
./db_with_timestamp_basic_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6645
Reviewed By: ltamasi
Differential Revision: D20960012
Pulled By: riversand963
fbshipit-source-id: ad377fa9eb481bf7a8a3e1824aaade48cdc653a4
Summary:
(Based on Yanqin's idea) Add a new field in readoptions as lower timestamp bound for iterator. When the parameter is not supplied (nullptr), the iterator returns the latest visible version of a record. When it is supplied, the existing timestamp field is the upper bound. Together the two serves as a bounded time window. The iterator returns all versions of a record falling in the window.
SeekRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit e860f8840):
seekrandom : 7.836 micros/op 4082449 ops/sec; (0 of 73481999 found)
This PR:
seekrandom : 7.764 micros/op 4120935 ops/sec; (0 of 71303999 found)
db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=seekrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6544
Reviewed By: ltamasi
Differential Revision: D20844069
Pulled By: riversand963
fbshipit-source-id: d97f2bf38a323c8c6a68db213b2d3c694b1c1f74
Summary:
Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.
Create an iterator with timestamp.
```
...
read_opts.timestamp = &ts;
auto* iter = db->NewIterator(read_opts);
// target is key without timestamp.
for (iter->Seek(target); iter->Valid(); iter->Next()) {}
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
delete iter;
read_opts.timestamp = &ts1;
// lower_bound and upper_bound are without timestamp.
read_opts.iterate_lower_bound = &lower_bound;
read_opts.iterate_upper_bound = &upper_bound;
auto* iter1 = db->NewIterator(read_opts);
// Do Seek or SeekToFirst()
delete iter1;
```
Test plan (dev server)
```
$make check
```
Simple benchmarking (dev server)
1. The overhead introduced by this PR even when timestamp is disabled.
key size: 16 bytes
value size: 100 bytes
Entries: 1000000
Data reside in main memory, and try to stress iterator.
Repeated three times on master and this PR.
- Seek without next
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
```
master: 159047.0 ops/sec
this PR: 158922.3 ops/sec (2% drop in throughput)
- Seek and next 10 times
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
```
master: 109539.3 ops/sec
this PR: 107519.7 ops/sec (2% drop in throughput)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255
Differential Revision: D19438227
Pulled By: riversand963
fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
Summary:
Cleanup some code without any real change in functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6440
Differential Revision: D20015891
Pulled By: riversand963
fbshipit-source-id: 33e18754b0f002006a6d4805e9aaf84c0c8ad25a
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314
Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run.
Differential Revision: D19458717
fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce
Summary:
Fix: when `db_iter` falls back to using seek by `FindValueForCurrentKeyUsingSeek`, `is_blob_` flag is not properly set on encountering BlobIndex.
Also patch existing test for the mentioned code path.
Signed-off-by: tabokie <xy.tao@outlook.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6051
Differential Revision: D18596274
Pulled By: ltamasi
fbshipit-source-id: 8e4714af263b99dc2c379707d50db88fe6799278
Summary:
Doing some code reordering in DBIter::Seek() and DBIter::SeekForPrev().
The logic largely remains the same, except slight difference when handling some stats when valid_ = false, where they are not supposed to be used anyway.
Also remove prefix_start_key_, which sometimes point a part of seek target, some times prefix_start_buf_, which is confusing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5794
Test Plan: Run all tests.
Differential Revision: D17375257
fbshipit-source-id: 7339a23898cecd3a8475bf72340fcd6f82b933c5
Summary:
Move definition and implementation for ArenaWrappedDBIter into its own .h/.cc files. Also, change inlining of functions to better comply with the Google C++ style guide.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5801
Test Plan: make check
Differential Revision: D17371012
Pulled By: anand1976
fbshipit-source-id: c1361abc2851575111e357a63d88be3b3d6cb341
Summary:
Use delete to disable automatic generated methods instead of private, and put the constructor together for more clear.This modification cause the unused field warning, so add unused attribute to disable this warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5009
Differential Revision: D17288733
fbshipit-source-id: 8a767ce096f185f1db01bd28fc88fef1cdd921f3
Summary:
There are a number of fixes in this PR (with most bugs found via the added stress tests):
1. Re-enable reseek optimization. This was initially disabled to avoid infinite loops in https://github.com/facebook/rocksdb/pull/3955 but this can be resolved by remembering not to reseek after a reseek has already been done. This problem only affects forward iteration in `DBIter::FindNextUserEntryInternal`, as we already disable reseeking in `DBIter::FindValueForCurrentKeyUsingSeek`.
2. Verify that ReadOption.snapshot can be safely used for iterator creation. Some snapshots would not give correct results because snaphsot validation would not be enforced, breaking some assumptions in Prev() iteration.
3. In the non-snapshot Get() case, reads done at `LastPublishedSequence` may not be enough, because unprepared sequence numbers are not published. Use `std::max(published_seq, max_visible_seq)` to do lookups instead.
4. Add stress test to test reading own writes.
5. Minor bug in the allow_concurrent_memtable_write case where we forgot to pass in batch_per_txn_.
6. Minor performance optimization in `CalcMaxUnpreparedSequenceNumber` by assigning by reference instead of value.
7. Add some more comments everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5573
Differential Revision: D16276089
Pulled By: lth
fbshipit-source-id: 18029c944eb427a90a87dee76ac1b23f37ec1ccb
Summary:
PR #5111 reduced the number of key comparisons when iterating with
upper/lower bounds; however, this caused a regression for MyRocks.
Reverting to the previous behavior in BlockBasedTableIterator as a hotfix.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5428
Differential Revision: D15721038
Pulled By: ltamasi
fbshipit-source-id: 5450106442f1763bccd17f6cfd648697f2ae8b6c
Summary:
util/ means for lower level libraries. trace_replay is highly integrated to DB and sometimes call DB. Move it out to a separate directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5376
Differential Revision: D15550938
Pulled By: siying
fbshipit-source-id: f46dce5ceffdc05a73f26379c7bb1b79ebe6c207
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387
Differential Revision: D15579036
Pulled By: siying
fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375
Differential Revision: D15550935
Pulled By: siying
fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
Summary:
Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111
Differential Revision: D15330061
Pulled By: siying
fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024
Summary:
Right now, DBIter::Next() always checks whether an entry is for the same user key as the previous entry to see whether the key should be hidden to the user. However, if previous entry's sequence number is 0, the check is not needed because 0 is the oldest possible sequence number.
We could extend it from seqnum 0 case to simply prev_seqno >= current_seqno. However, it is less robust with bug or unexpected situations, while the gain is relatively low. We can always extend it later when needed.
In a readseq benchmark with full formed LSM-tree, number of key comparisons called is reduced from 2.981 to 2.165. readseq against a fully compacted DB, no key comparison is called. Performance in this benchmark didn't show obvious improvement, which is expected because key comparisons only takes small percentage of CPU. But it may show up to be more effective if users have an expensive customized comparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5244
Differential Revision: D15067257
Pulled By: siying
fbshipit-source-id: b7e1ef3ec4fa928cba509683d2b3246e35d270d9
Summary:
It's hard to get DBIter to directly use InternalIterator::NextAndGetResult() because the code change would be complicated. Instead, use IteratorWrapper, where Next() is already using NextAndGetResult(). Performance number is hard to measure because it is small and ther is variation. I run readseq many times, and there seems to be 1% gain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5214
Differential Revision: D15003635
Pulled By: siying
fbshipit-source-id: 17af1965c409c2fe90cd85037fbd2c5a1364f82a
Summary:
My compiler doesn't inline DBIter::Next() to arena wrapped iterator, even if it is a direct forward. Adding this annotation makes it inlined. It might not always work but inlinging this function to arena wrapped iterator always feels like the right decision.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5217
Differential Revision: D15004086
Pulled By: siying
fbshipit-source-id: a4cffd79c6fb092669a3a90633c9aa5e494f8a66
Summary:
Several small changes for Next():
1. Reducing branching by always update local_stats_.next_count_++ even if statistics is null. This should be faster than a branching.
2. Replacing ResetInternalKeysSkippedCounter() in Next() because the valid_ check is not needed in this case.
3. iter_->Valid() should always be true for non merge case. Remove this check.
4. Adding an inline annotation. It ends up with not picked up by my compiler, but it shouldn't hurt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5200
Differential Revision: D15000391
Pulled By: siying
fbshipit-source-id: be97f61c708968234fb8e5cf272b5c2ac07dc4dd
Summary:
In long scans, virtual function calls of Next(), Valid(), key() and value() are not trivial. By introducing NextAndGetResult(), Some of the Next(), Valid() and key() calls are consolidated into one virtual function call to reduce CPU.
Also did some inline tricks and add some "final" randomly in some functions. Even without the "final" annotation, most Next() calls are inlined with -O3, but sometimes with a final it is inlined by O2 too. It doesn't hurt to add those final annotations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5197
Differential Revision: D14945977
Pulled By: siying
fbshipit-source-id: 7003969f9a5f1d5717f0bda503b91d19ba75ed88
Summary:
WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads.
The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call.
Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots.
The following benchmark shows +0.26% higher throughput in seekrandom benchmark.
Benchmark:
./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench
./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG 10 runs] : 20355 ops/sec; 225.2 MB/sec
seekrandom [MEDIAN 10 runs] : 20425 ops/sec; 225.9 MB/sec
./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG 10 runs] : 20409 ops/sec; 225.8 MB/sec
seekrandom [MEDIAN 10 runs] : 20487 ops/sec; 226.6 MB/sec
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049
Differential Revision: D14366459
Pulled By: maysamyabandeh
fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef
Summary:
Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it.
Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098
Differential Revision: D14603753
Pulled By: siying
fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae
Summary:
Introduce CPU timers for iterator seek and next operations. Seek
counter includes SeekToFirst, SeekToLast and SeekForPrev, w/ the
caveat that SeekToLast timer doesn't include some post processing
time if upper bound is defined.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5076
Differential Revision: D14525218
Pulled By: fredfsh
fbshipit-source-id: 03ba25df3b22b06c072621e4de0eacfa1445f0d9
Summary:
The patch reorders DBIter fields to put 1-byte fields together and let the compiler optimize the memory usage by using less 64-bit allocations for bools and enums.
This might have a negative side effect of putting the variables that are accessed together into different cache lines and hence increasing the cache misses. Not sure what benchmark would verify that thought. I ran simple, single-threaded seekrandom benchmarks but the variance in the results is too much to be conclusive.
./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench
./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5078
Differential Revision: D14562676
Pulled By: maysamyabandeh
fbshipit-source-id: 2284655d46e079b6e9a860e94be5defb6f482167
Summary:
Now that v2 is fully functional, the v1 aggregator is removed.
The v2 aggregator has been renamed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778
Differential Revision: D13495930
Pulled By: abhimadan
fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6
Summary:
RangeDelAggregatorV2 now supports ShouldDelete calls on
snapshot stripes and creation of range tombstone compaction iterators.
RangeDelAggregator is no longer used on any non-test code path, and will
be removed in a future commit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758
Differential Revision: D13439254
Pulled By: abhimadan
fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401
Summary:
Removed `one_time_use` flag, which removed the need for some
tests, and changed all `NewRangeTombstoneIterator` methods to return
`FragmentedRangeTombstoneIterators`.
These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692
Differential Revision: D13106570
Pulled By: abhimadan
fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
Summary:
The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649
Differential Revision: D13146964
Pulled By: abhimadan
fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
Summary:
Currently, `Statistics` can record tick by `recordTick()` whose second parameter is an `uint64_t`.
That means tick can only increase.
If we want to reduce tick, we have to work around like `RecordTick(statistics_, NO_ITERATORS, uint64_t(-1));`.
That's kind of a hack.
So, this PR divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE`, making the counters increase only.
Fixes#3013 .
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4498
Differential Revision: D10395010
Pulled By: sagar0
fbshipit-source-id: cfb523b22a37411c794b4e9da090f1ae30293db2
Summary:
Even during `DBIter::Prev()`, there is a case where we need to use `RangeDelPositioningMode::kForwardTraversal`. In particular, when we hit too many internal keys for a single user key, we use seek to find the newest internal key. If it's a merge operand, we then scan forwards, collecting the merge operands. This forward scan should be using `RangeDelPositioningMode::kForwardTraversal`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4481
Differential Revision: D10319507
Pulled By: ajkr
fbshipit-source-id: b5ce7352461f3a7696b28a5136ae0076f2bde51f
Summary:
In the current trace_and replay, Get an WriteBatch are traced. This pull request track down the Seek() and SeekForPrev() to the trace file. <target_key, timestamp, column_family_id> are write to the file.
Replay of Iterator is not supported in the current implementation.
Tested with trace_analyzer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4228
Differential Revision: D9201381
Pulled By: zhichao-cao
fbshipit-source-id: 6f9cc9cb3c20260af741bee065ec35c5c96354ab
Summary:
This fixes the same performance issue that #3992 fixes but with much more invasive cleanup.
I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.
With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.
ajkr please take a look when you have a minute!
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4014
Differential Revision: D8773253
Pulled By: ajkr
fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
Summary:
This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.
There are the places where reads had to be modified.
- Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
- Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
- Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
Closes https://github.com/facebook/rocksdb/pull/3955
Differential Revision: D8286688
Pulled By: lth
fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601
Differential Revision: D7253114
Pulled By: miasantreble
fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
Summary:
Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped.
Closes https://github.com/facebook/rocksdb/pull/3525
Differential Revision: D7033694
Pulled By: sagar0
fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8
Summary:
ForwardIterator::SVCleanup() sometimes didn't pin superversion when it was supposed to. See the added test for the scenario. Here's the ASAN output of the added test without the fix (using `COMPILE_WITH_ASAN=1 make`): https://pastebin.com/9rD0Ywws
Closes https://github.com/facebook/rocksdb/pull/3415
Differential Revision: D6817414
Pulled By: al13n321
fbshipit-source-id: bc80c44ea78a3a1fa885dfa448a26111f91afb24
Summary:
In DBIter, Prev() calls FindValueForCurrentKey() to search the current value backward. If it finds that there are too many stale value being skipped, it falls back to FindValueForCurrentKeyUsingSeek(), seeking directly to the key with snapshot sequence. After introducing read_callback, however, the key it seeks to might not be visible, according to read_callback. It thus needs to keep searching forward until the first visible value.
Closes https://github.com/facebook/rocksdb/pull/3382
Differential Revision: D6756148
Pulled By: yiwu-arbug
fbshipit-source-id: 064e39b1eec5e083af1c10142600f26d1d2697be
Summary:
A proper implementation of Iterator::Refresh() for WritePreparedTxnDB would require release and acquire another snapshot. Since MyRocks don't make use of Iterator::Refresh(), we just simply mark it as not supported.
Closes https://github.com/facebook/rocksdb/pull/3290
Differential Revision: D6599931
Pulled By: yiwu-arbug
fbshipit-source-id: 4e1632d967316431424f6e458254ecf9a97567cf
Summary:
Since #1665, on merge error, iterator will be set to corrupted status, but it doesn't invalidate the iterator. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3226
Differential Revision: D6499094
Pulled By: yiwu-arbug
fbshipit-source-id: 80222930f949e31f90a6feaa37ddc3529b510d2c
Summary:
Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
Closes https://github.com/facebook/rocksdb/pull/3205
Differential Revision: D6438959
Pulled By: maysamyabandeh
fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
Summary:
When Seek a key less than `lower_bound`, should return `lower_bound`.
ajkr PTAL
Closes https://github.com/facebook/rocksdb/pull/3199
Differential Revision: D6421126
Pulled By: ajkr
fbshipit-source-id: a06c825830573e0040630704f6bcb3f7f48626f7