Summary:
Now that v2 is fully functional, the v1 aggregator is removed.
The v2 aggregator has been renamed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778
Differential Revision: D13495930
Pulled By: abhimadan
fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6
Summary:
RangeDelAggregatorV2 now supports ShouldDelete calls on
snapshot stripes and creation of range tombstone compaction iterators.
RangeDelAggregator is no longer used on any non-test code path, and will
be removed in a future commit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758
Differential Revision: D13439254
Pulled By: abhimadan
fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401
Summary:
Removed `one_time_use` flag, which removed the need for some
tests, and changed all `NewRangeTombstoneIterator` methods to return
`FragmentedRangeTombstoneIterators`.
These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692
Differential Revision: D13106570
Pulled By: abhimadan
fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
Summary:
The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649
Differential Revision: D13146964
Pulled By: abhimadan
fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
Summary:
`FindFile()` and `FindFileInRange()` actually works as the same
of `std::lower_bound()`. Use `std::lower_bound()` to reduce the
repeated code.
- change `FindFile()` and `FindFileInRange()` to use `std::lower_bound()`
Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4372
Differential Revision: D9919677
Pulled By: ajkr
fbshipit-source-id: f74aaa30e2f80e410e299c5a5bca4eaf2a7a26de
Summary:
Do not consider the range tombstone sentinel key as causing 2 adjacent
sstables in a level to overlap. When a range tombstone's end key is the
largest key in an sstable, the sstable's end key is so to a "sentinel"
value that is the smallest key in the next sstable with a sequence
number of kMaxSequenceNumber. This "sentinel" is guaranteed to not
overlap in internal-key space with the next sstable. Unfortunately,
GetOverlappingFiles uses user-keys to determine overlap and was thus
considering 2 adjacent sstables in a level to overlap if they were
separated by this sentinel key. This in turn would cause compactions to
be larger than necessary.
Note that this conflicts with
https://github.com/facebook/rocksdb/pull/2769 and cases
`DBRangeDelTest.CompactionTreatsSplitInputLevelDeletionAtomically` to
fail.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4050
Differential Revision: D8844423
Pulled By: ajkr
fbshipit-source-id: df3f9f1db8f4cff2bff77376b98b83c2ae1d155b
Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601
Differential Revision: D7253114
Pulled By: miasantreble
fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
Summary:
A new LevelIterator was recently created. Rename the old one to make unity build happy. It's also not a good idea to have two classes in the same name anyway.
Closes https://github.com/facebook/rocksdb/pull/3499
Differential Revision: D6979325
Pulled By: siying
fbshipit-source-id: 3a032d93fe205650a08e92e5262594731ec726bb
Summary:
Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization.
The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators.
The two-level-iterator now is only used by partitioned index, so it is simplified.
Closes https://github.com/facebook/rocksdb/pull/3406
Differential Revision: D6809041
Pulled By: siying
fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa
Summary:
ForwardIterator::SVCleanup() sometimes didn't pin superversion when it was supposed to. See the added test for the scenario. Here's the ASAN output of the added test without the fix (using `COMPILE_WITH_ASAN=1 make`): https://pastebin.com/9rD0Ywws
Closes https://github.com/facebook/rocksdb/pull/3415
Differential Revision: D6817414
Pulled By: al13n321
fbshipit-source-id: bc80c44ea78a3a1fa885dfa448a26111f91afb24
Summary:
Fixes the following ASAN error:
```
==2108042==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7fc50ae9b868 at pc 0x7fc5112aff55 bp 0x7fff9eb9dc10 sp 0x7fff9eb9dc08
=== How to use this, how to get the raw stack trace, and more: fburl.com/ASAN ===
READ of size 8 at 0x7fc50ae9b868 thread T0
SCARINESS: 23 (8-byte-read-stack-use-after-scope)
#0 rocksdb/dbformat.h:164 rocksdb::InternalKeyComparator::user_comparator() const
#1 librocksdb_src_rocksdb_lib.so+0x1429a7d rocksdb::RangeDelAggregator::InitRep(std::vector<...> const&)
#2 librocksdb_src_rocksdb_lib.so+0x142ceae rocksdb::RangeDelAggregator::AddTombstones(std::unique_ptr<...>)
#3 librocksdb_src_rocksdb_lib.so+0x1382d88 rocksdb::ForwardIterator::RebuildIterators(bool)
#4 librocksdb_src_rocksdb_lib.so+0x1382362 rocksdb::ForwardIterator::ForwardIterator(rocksdb::DBImpl*, rocksdb::ReadOptions const&, rocksdb::ColumnFamilyData*, rocksdb::SuperVersion*)
#5 librocksdb_src_rocksdb_lib.so+0x11f433f rocksdb::DBImpl::NewIterator(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*)
#6 rocksdb/src/include/rocksdb/db.h:382 rocksdb::DB::NewIterator(rocksdb::ReadOptions const&)
#7 rocksdb/db_range_del_test.cc:807 rocksdb::DBRangeDelTest_TailingIteratorRangeTombstoneUnsupported_Test::TestBody()
#18 rocksdb/db_range_del_test.cc:1006 main
Address 0x7fc50ae9b868 is located in stack of thread T0 at offset 104 in frame
#0 librocksdb_src_rocksdb_lib.so+0x13825af rocksdb::ForwardIterator::RebuildIterators(bool)
```
Closes https://github.com/facebook/rocksdb/pull/3300
Differential Revision: D6612989
Pulled By: ajkr
fbshipit-source-id: e7ea2ed914c1b80a8a29d71d92440a6bd9cbcc80
Summary:
This reverts the previous commit 1d7048c598, which broke the build.
Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627
Differential Revision: D5476473
Pulled By: sagar0
fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary:
Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks:
[a b c d][f g h]
Then the index key for the first block will be 'e'.
then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already.
This diff can reduce RA in most cases.
Closes https://github.com/facebook/rocksdb/pull/2239
Differential Revision: D4990693
Pulled By: lightmark
fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff
Summary:
to void future bug that caused by the mix of userkey/internalkey
Closes https://github.com/facebook/rocksdb/pull/2084
Differential Revision: D4825889
Pulled By: lightmark
fbshipit-source-id: 28411db
Summary:
merger.h was always a confusing name for me, simply give the file a better name
Closes https://github.com/facebook/rocksdb/pull/1836
Differential Revision: D4505357
Pulled By: IslamAbdelRahman
fbshipit-source-id: 07b28d8
Summary:
change the iterator status to NotSupported as soon as a range tombstone
is encountered by a ForwardIterator.
Closes https://github.com/facebook/rocksdb/pull/1593
Differential Revision: D4246294
Pulled By: ajkr
fbshipit-source-id: aef9f49
Summary:
Previously we used TableCache::NewIterator() for multiple purposes (data
block iterator and range deletion iterator), and returned non-ok status in
the data block iterator. In one case where the caller only used the range
deletion block iterator (9e7cf3469b/db/version_set.cc (L965-L973)),
we didn't check/free the data block iterator containing non-ok status, which
caused a valgrind error.
So, this diff decouples creation of data block and range deletion block iterators,
and updates the callers accordingly. Both functions can return non-ok status
in an InternalIterator. Since the non-ok status is returned in an iterator that the
callers will definitely use, it should be more usable/less error-prone.
Closes https://github.com/facebook/rocksdb/pull/1513
Differential Revision: D4181423
Pulled By: ajkr
fbshipit-source-id: 835b8f5
Summary:
In ForwardIterator::SeekInternal(), we may end up passing empty Slice representing an internal key to InternalKeyComparator::Compare.
and when we try to extract the user key from this empty Slice, we will create a slice with size = 0 - 8 ( which will overflow and cause us to read invalid memory as well )
Scenarios to reproduce these issues are in the unit tests
Closes https://github.com/facebook/rocksdb/pull/1467
Differential Revision: D4136660
Pulled By: lightmark
fbshipit-source-id: 151e128
Summary: As offline discussion with Siying, revert this since it has bug with seek.
Test Plan: make check -j64
Reviewers: yiwu, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D65559
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()
Also add tests in db_iter_test and db_iterator_test
Pass all tests
Cheers!
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64149
Summary:
This diff update ForwardIterator to support pinning keys and values, which will allow DBIter to take advantage of that and eliminate memcpy when executing merge operators
This diff is stacked on D61305
Test Plan:
existing tests (updated them to test tailing iterator)
new test
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60009
Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too.
Test Plan: I am adding unit tests.
Reviewers: andrewkr, IslamAbdelRahman, yiwu
Reviewed By: yiwu
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61503
Summary: I realized I again is wrong about the naming convention. Let me change it to the correct one.
Test Plan: Run unit tests.
Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D55041
Summary: We want to provide a way to detect whether an iterator is stale and needs to be recreated. Add a iterator property to return version number.
Test Plan: Add two unit tests for it.
Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54921
Summary:
List of changes:
1) Fix the snprintf() usage in cases where wrong variable was used to determine the output buffer size.
2) Remove unnecessary checks before calling delete operator.
3) Increase code correctness by using size_t type when getting vector's size.
4) Unify the coding style by removing namespace::std usage at the top of the file to confirm to the majority usage.
5) Fix various lint errors pointed out by 'arc lint'.
Test Plan:
Code review and build:
git diff
make clean
make -j 32 commit-prereq
arc lint
Reviewers: kradhakrishnan, sdong, rven, anthony, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51849
Summary:
db_tailing_iter_test was failing on some platforms because of
an incorrect allocation and use. This diff fixes the issue.
Test Plan:
db_tailing_iter_test
Run valgrind for db_tailing_iter_test
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D50835
Summary:
Under a tailing workload, there were increased block cache
misses when a memtable was flushed because we were rebuilding iterators
in that case since the version set changed. This was exacerbated in the
case of iterate_upper_bound, since file iterators which were over the
iterate_upper_bound would have been deleted and are now brought back as
part of the Rebuild, only to be deleted again. We now renew the iterators
and only build iterators for files which are added and delete file
iterators for files which are deleted.
Refer to https://reviews.facebook.net/D50463 for previous version
Test Plan: DBTestTailingIterator.TailingIteratorTrimSeekToNext
Reviewers: anthony, IslamAbdelRahman, igor, tnovak, yhchiang, sdong
Reviewed By: sdong
Subscribers: yhchiang, march, dhruba, leveldb, lovro
Differential Revision: https://reviews.facebook.net/D50679
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
Test Plan: Run all existing tests.
Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48549
Summary:
It looks like in some cases an assert in SeekInternal failed when computing the
hints for the next level because user_key was the same as the largest key and
not strictly smaller. Relaxing the assert to expect smaller or equal keys.
Test Plan: make clean all check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46443
Summary: When computing the hint for GetNextLevelIndex(), ForwardIterator was doing a redundant comparison. This patch fixes the comparison (using https://github.com/facebook/rocksdb/blob/master/db/version_set.cc#L158 as a reference) and moves it inside an assert because we expect `level_files[f_idx]` to contain the next key after Seek(), so user_key should always be smaller than the largest key.
Test Plan: make clean all check
Reviewers: rven, anthony, yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: tnovak, sdong, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46227
Summary:
This diff fixes a case when the forward iterator misses a new
insert when the mutable iterator is not current. The test is also
improved and the check for deleted iterators is made more informative.
Test Plan: DBTailingIteratorTest.*Trim
Reviewers: tnovak, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46167
Summary:
I noticed that memtable iterator usually crosses the `iterate_upper_bound`
threshold when tailing. Changes introduced in D43833 made `NeedToSeekImmutable`
always return true in such case, even when `Seek()` only needs to rewind the
memtable iterator. In a test I ran, this caused the "tailing efficiency"
(ratio of calls to `Seek()` that only affect the memtable versus all seeks)
to drop almost to zero.
This diff attempts to fix the regression by using a different flag to indicate
that `current_` is over the limit instead of resetting `valid_` in
`UpdateCurrent()`.
Test Plan: `DBTestTailingIterator.TailingIteratorUpperBound`
Reviewers: sdong, rven
Reviewed By: rven
Subscribers: dhruba, march
Differential Revision: https://reviews.facebook.net/D45909
Summary:
The immutable memtable iterators are allocated from an arena and there
is no benefit from deleting these. Also the immutable memtables
themselves will continue to be in memory until the version set
containing it is alive. We will not remove immutable memtable iterators
over the upper bound. We now add immutable iterators to the test.
Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext
Reviewers: tnovak, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45597
Summary:
We have earlier added a feature to delete file iterators when the
current key is over the iterate upper bound. We now add a whitebox test
to check if the file iterators were actually deleted.
Test Plan: Add check for a range which has deleted iterators.
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45321
Summary:
After deleting file iterators which are over the iterate upper
bound, we also need to check for null pointers in
ResetIncompletIterators.
Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext
Reviewers: tnovak, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45525
Summary: Removing two unused variables that prevented compilation.
Test Plan: make all
Reviewers: rven, sdong, yhchiang, anthony, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D44991
Summary:
This diff improves the memory utilization for tailing iterators RocksDB,
by freeing file iterators which are over the upper bound.
It is an updating on Siying's original diff for improving the memory usage for
tailing iterators. The changes for the seek and next path are now complete
and a test has been added to exercise these paths while deleting file iterators
which are above the upper bound.
Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext
Reviewers: march, tnovak, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D43833
Summary:
Reseek mutable_iter if it is invalid in Next and immutable_iter
is invalid.
Test Plan: DBTestTailingIterator.TailingIteratorSeekToNext
Reviewers: tnovak, march, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D44865
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44193