Summary:
Current implementation of `current_over_upper_bound_` fails to take into consideration that keys might be invalid in either base iterator or delta iterator. Calling key() in such scenario will lead to assertion failure and runtime errors.
This PR addresses the bug by adding check for valid keys before calling `IsOverUpperBound()`, also added test coverage for iterate_upper_bound usage in BaseDeltaIterator
Also recommit https://github.com/facebook/rocksdb/pull/4656 (It was reverted earlier due to bugs)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4702
Differential Revision: D13146643
Pulled By: miasantreble
fbshipit-source-id: 6d136929da12d0f2e2a5cea474a8038ec5cdf1d0
Summary:
Full block (use_block_based_builder=false) Bloom filter has clear CPU saving benefits but with limitation of using temp memory when building an SST file proportional to the SST file size. We reduced the chance of having large SST files with multi-level universal compaction. Now we change to a default with better performance.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4735
Differential Revision: D13266674
Pulled By: siying
fbshipit-source-id: 7594a4c3e32568a5a2adce22bb0e46553e55c602
Summary:
Currently tests are failing on master with the following message:
> util/jemalloc_nodump_allocator.cc:132:8: error: unused parameter ‘options’ [-Werror=unused-parameter]
Status NewJemallocNodumpAllocator(
This PR attempts to fix the issue
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4738
Differential Revision: D13278804
Pulled By: miasantreble
fbshipit-source-id: 64a6204aa685bd85d8b5080655cafef9980fac2f
Summary:
The fix in #4727 for double snapshot release was incomplete since it does not properly remove the duplicate entires in the snapshot list after finding that a snapshot is still valid. The patch does that and also improves the unit test to show the issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4734
Differential Revision: D13266260
Pulled By: maysamyabandeh
fbshipit-source-id: 351e2c40cca45a87b757774c11af74182314911e
Summary:
Add option to limit tcache usage by allocation size. This is to reduce total tcache size in case there are many user threads accessing the allocator and incur non-trivial memory usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4736
Differential Revision: D13269305
Pulled By: yiwu-arbug
fbshipit-source-id: 95a9b7fc67facd66837c849137e30e137112e19d
Summary:
**Summary:**
Simplified the code layout by moving FIFOCompactionPicker to a separate file.
**Why?:**
While trying to add ttl functionality to universal compaction, I found that `FIFOCompactionPicker` class and its impl methods to be interspersed between `LevelCompactionPicker` methods which kind-of made the code a little hard to traverse. So I moved `FIFOCompactionPicker` to a separate compaction_picker_fifo.h/cc file, similar to `UniversalCompactionPicker`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4724
Differential Revision: D13227914
Pulled By: sagar0
fbshipit-source-id: 89471766ea67fa4d87664a41c057dd7df4b3d4e3
Summary:
There is a race condition in DBFlushTest.SyncFail, as illustrated below.
```
time thread1 bg_flush_thread
| Flush(wait=false, cfd)
| refs_before=cfd->current()->TEST_refs() PickMemtable calls cfd->current()->Ref()
V
```
The race condition between thread1 getting the ref count of cfd's current
version and bg_flush_thread incrementing the cfd's current version makes it
possible for later assertion on refs_before to fail. Therefore, we add test
sync points to enforce the order and assert on the ref count before and after
PickMemtable is called in bg_flush_thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4633
Differential Revision: D12967131
Pulled By: riversand963
fbshipit-source-id: a99d2bacb7869ec5d8d03b24ef2babc0e6ae1a3b
Summary:
there is chance that
* the caller tries to repair the db when holding the db_lock, in
that case the env implementation might not set the `lock`
parameter of Repairer::Run().
* the caller somehow never calls Repairer::Run().
either way, the desctructor of Repair will compare the uninitialized
db_lock_ with nullptr, and tries to unlock it. there is good chance
that the db_lock_ is not nullptr, then boom.
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4683
Differential Revision: D13260287
Pulled By: riversand963
fbshipit-source-id: 878a119d2e9f10a0fa17ee62cf3fb24b33d49fa5
Summary:
Add a dummy main() in sst_file_reader_test for ROCKSDB_LITE to fix link failure in regression
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4725
Differential Revision: D13252885
Pulled By: anand1976
fbshipit-source-id: 0e22b964815e2bf01aff7d03ed4ae59d44fa86f1
Summary:
Currently the garbage collection of items in old_commit_map_ was done upon ::ReleaseSnapshot. The assumption behind this method was that the sequence number of snapshots are unique, which is incorrect. In the very rare cases that two consecutive snapshot have the same sequence number this could lead the release of the first snapshot affect the old_commit_map_ that is necessary to service the reads of the second snapshot. The bug would be triggered only if i) two snapshot have the same seq, ii) both of them are very old (older than the last ~4m transactions), and iii) there is commit entry overlapping with the snapshot seq number.
It is fixed by doing the cleanup of old_commit_map_ in UpdateSnapshot: the new list of snapshots are compared with the old one and the missing sequence numbers are concluded released. If two snapshots have the same seq number, after the release of one of them, the seq number still appears in the snapshot least and thus not cleaned up prematurely.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4727
Differential Revision: D13246495
Pulled By: maysamyabandeh
fbshipit-source-id: 93b87a5042afd8060889df245526d3f5d29de9fe
Summary:
Fix block based table reader not using memory_allocator when allocating index blocks and compression dictionary blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4678
Differential Revision: D13054594
Pulled By: yiwu-arbug
fbshipit-source-id: 379f25bcc665395662511c4f873f4b7b55104ce2
Summary:
Removed `one_time_use` flag, which removed the need for some
tests, and changed all `NewRangeTombstoneIterator` methods to return
`FragmentedRangeTombstoneIterators`.
These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692
Differential Revision: D13106570
Pulled By: abhimadan
fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
Summary:
If user do not end the trace manually, the tracing will continue which can potential use up all the storage space and cause problem. In this PR, the max trace file size is added to the TraceOptions and user can set the value if they need or the default is 64GB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4610
Differential Revision: D12893400
Pulled By: zhichao-cao
fbshipit-source-id: acf4b5a6076bb691778bdfbac4864e1006758953
Summary:
A user friendly sst file reader is useful when we want to access sst
files outside of RocksDB. For example, we can generate an sst file
with SstFileWriter and send it to other places, then use SstFileReader
to read the file and process the entries in other ways.
Also rename the original SstFileReader to SstFileDumper because of
name conflict, and seems SstFileDumper is more appropriate for tools.
TODO: there is only a very simple test now, because I want to get some feedback first.
If the changes look good, I will add more tests soon.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4717
Differential Revision: D13212686
Pulled By: ajkr
fbshipit-source-id: 737593383264c954b79e63edaf44aaae0d947e56
Summary:
DeleteRange is now ready for production use. Change the header comment to reflect this, and update HISTORY.md with the feature's status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4709
Differential Revision: D13209055
Pulled By: abhimadan
fbshipit-source-id: 65423eb1a4927cf593c38254cd87c322f73ae137
Summary:
Adding sanity check test for mapping of `Histograms` and `HistogramsNameMap`
```
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from StatisticsTest
[ RUN ] StatisticsTest.SanityTickers
[ OK ] StatisticsTest.SanityTickers (0 ms)
[ RUN ] StatisticsTest.SanityHistograms
[ OK ] StatisticsTest.SanityHistograms (0 ms)
[----------] 2 tests from StatisticsTest (0 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (0 ms total)
[ PASSED ] 2 tests.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4720
Differential Revision: D13217061
Pulled By: ajkr
fbshipit-source-id: 6427f4e684c36b2f3c3440808b74fee86a364683
Summary:
Added back `NO_ITERATORS` and moved `NO_ITERATOR_CREATED` to the end of `toCppTickers`.
This is a leftover fix which is needed in addition to a138e351bc to correctly convert java tickers to c++ tickers. a138e351bc only updated `toJavaTickerType` but both `toJavaTickerType` and `toCppTickers` need to be changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4719
Differential Revision: D13208847
Pulled By: sagar0
fbshipit-source-id: 53a42f3d6ffe04034acfde972d73040b92b4c1af
Summary:
The error message of databases/rocksdb-lite (FreeBSD port) is as follows:
```
tools/db_bench_tool.cc:1976:16: error: private field 'trace_options_' is not used [-Werror,-Wunused-private-field]
TraceOptions trace_options_;
^
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4715
Differential Revision: D13207902
Pulled By: ajkr
fbshipit-source-id: be3c612eba656aeddb77e35e2f201dd25dc92f7e
Summary:
Make CompactionOptionsFIFO's ttl and allow_compaction options to be available in RocksJava.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4609
Differential Revision: D12849503
Pulled By: sagar0
fbshipit-source-id: 47baa97918d252370f234c36c1af15ff2dad7658
Summary:
Previously, every range tombstone iterator was seeked on every
ShouldDelete call, which quickly degraded performance for long range
scans. This PR improves performance by tracking iterator positions and
only advancing iterators when necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4677
Differential Revision: D13205373
Pulled By: abhimadan
fbshipit-source-id: 80c199dace1e19362a4c61c686bf01913eae87cb
Summary:
We haven't been populating `NO_FILE_CLOSES` since v1.5.8 even though it was never marked as deprecated. Start populating it again. Conveniently `DeleteTableReader` has an unused `void*` argument that we can use...
Blame: 63f216ee0aCloses#4700.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4703
Differential Revision: D13146769
Pulled By: ajkr
fbshipit-source-id: ad8d6fb0493e701f60a165a3bca1787d255be008
Summary:
…ons (#4676)"
This reverts commit b32d087dbb.
`MemoryAllocator` needs to be with `Cache`, since cache entry can
outlive DB and block based table. The cache needs to hold reference to
memory allocator when deleting cache entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697
Differential Revision: D13133490
Pulled By: yiwu-arbug
fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a
Summary:
The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649
Differential Revision: D13146964
Pulled By: abhimadan
fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
Summary:
Since a range tombstone seen at one level will cover all keys
in the range at lower levels, there was a short-circuiting check in Get
that reported a key was not found at most one file after the range
tombstone was discovered. However, this was incorrect for merge
operands, since a deletion might only cover some merge operands,
which implies that the key should be found. This PR fixes this logic in
the Version portion of Get, and removes the logic from the MemTable
portion of Get, since the perforamnce benefit provided there is minimal.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4698
Differential Revision: D13142484
Pulled By: abhimadan
fbshipit-source-id: cbd74537c806032f2bfa564724d01a80df7c8f10
Summary:
- Added back the `NO_ITERATORS` that was removed in 5945e16dfc.
- Marked it as deprecated since it is no longer populated, but kept for API compatibility.
- Made sure the new tickers, `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETED`, are appended at the end of the enum, in case people are relying on the int values.
The change where `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETED` were introduced is unreleased so I believe it is ok to change their ordering.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4701
Differential Revision: D13142887
Pulled By: ajkr
fbshipit-source-id: 29a336ce5b46632ce50ad42ccc4a29013f71d6d6
Summary:
Found a callsite for `moo_translate` in the Scuba warnings and realized we have a few calls to `define()` left in fbcode.
- I ran the `DefineCodemod` script against fbcode
- Fixed broken tests, and ensured that tests that are explicitly testing the behaviour of `define()` were not changed.
bypass-lint
Reviewed By: kmeht
Differential Revision: D12968447
fbshipit-source-id: d8fd3649a2ce9868b8938d293e1bebf1a6d2fad8
Summary:
WriteBufferManger is not invoked when allocating memory for memtable if the limit is not set even if a cache is passed. It is inconsistent from the comment syas. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4695
Differential Revision: D13112722
Pulled By: siying
fbshipit-source-id: 0b27eef63867f679cd06033ea56907c0569597f4
Summary:
This is a quick fix for the uninitialized bugs in `LiveFileMetaData` and `SstFileMetaData` that were uncovered in #4686.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4693
Differential Revision: D13113189
Pulled By: ajkr
fbshipit-source-id: 18e798d031d2a59d0b55fc010c135e0126f4042d
Summary:
This fixes an assertion.
An atomic flush can have multiple flush jobs. Some of them may fail. If any of
them fails, we need to rollback all of them.
For the flush jobs that do fail, we already call `RollbackMemTableFlush` in
`FlushJob::Run`. The tricky part is for flush jobs that have completed
successfully. We need to call `RollbackMemTableFlush` for them as well.
The newly added DBAtomicFlushTest.AtomicFlushRollbackSomeJobs will SigAbort
without the corresponding change in AtomicFlushMemTablesToOutputFiles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4641
Differential Revision: D12943649
Pulled By: riversand963
fbshipit-source-id: c66a4a664a1e0938e938fd41edc5a70c34cdd868
Summary:
Rather than storing a `vector<RangeTombstone>`, we now store a
`vector<RangeTombstoneStack>` and a `vector<SequenceNumber>`. A
`RangeTombstoneStack` contains the start and end keys of a range tombstone
fragment, and indices into the seqnum vector to indicate which sequence
numbers the fragment is located at. The diagram below illustrates an
example:
```
tombstones_: [a, b) [c, e) [h, k)
| \ / \ / |
| \ / \ / |
v v v v
tombstone_seqs_: [ 5 3 10 7 2 8 6 ]
```
This format allows binary searching the tombstone list to use less key
comparisons, which helps in cases where there are many overlapping
tombstones. Also, this format makes it easier to add DBIter-like
semantics to `FragmentedRangeTombstoneIterator` in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4632
Differential Revision: D13053103
Pulled By: abhimadan
fbshipit-source-id: e8220cc712fcf5be4d602913bb23ace8ea5f8ef0
Summary:
DBTest.SanitizeNumThreads Sometimes fails. The test waited for 10ms timeout and expect all threads scheduled to be executed. This can be a source of flakiness. Make a check every 1ms and up to 10s.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4659
Differential Revision: D13074174
Pulled By: siying
fbshipit-source-id: b1d5ff87a326a4fc9eab8d1cc307bbb940dfe70c
Summary:
The new flag makes it possible to constrain iterator traversal
by the upper/lower bound the iterator is expected to pass. This allows
seekrandom results to be more easily comparable between DBs with and
without deletions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4660
Differential Revision: D13053111
Pulled By: abhimadan
fbshipit-source-id: 33e250f2e2d210b54c7726399da30a33f723c33c
Summary:
When iterator becomes invalid, there are two possibilities.
First, all data in the column family have been scanned and there is nothing
more to scan.
Second, an underlying error has occurred, causing `status()` to be !ok.
Therefore, we need to check for both cases when `!iter->Valid()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4648
Differential Revision: D12959601
Pulled By: riversand963
fbshipit-source-id: 49c9382c9ea9e78f2e2b6f3708f0670b822ca8dd
Summary:
`GenSubcompactionBoundaries` calls `VersionSet::ApproximateSize` which gets BlockBasedTableReader for every file and seeks in its index block to find `key`'s offset. If the table or index block aren't in memory already, this involves I/O. This can be improved by releasing DB mutex when calling ApproximateSize.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4630
Differential Revision: D13052653
Pulled By: miasantreble
fbshipit-source-id: cae31d46d10d0860fa8a26b8d5154b2d17d1685f
Summary:
We carry compression type and "cachable" variables for every block in the block cache, while they take well-known values. 8-byte is wasted for each block (2-byte for useful information but it takes 8 bytes because of padding). With this change, these two variables are removed.
The cachable information is only useful in the process of reading the block. We use other information to infer from it. For compressed blocks, the compression type is a part of the block content itself so we can get it from there.
Some code is slightly refactored so that the cachable information can flow better.
Another change is to only use class BlockContents for compressed block, and narrow the class Block to only be used for uncompressed blocks, including blocks in compressed block cache. This can make the Block class less confusing. It also saves tens of bytes for each block in compressed block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4650
Differential Revision: D12969070
Pulled By: siying
fbshipit-source-id: 548b62724e9eb66993026429fd9c7c3acd1f95ed
Summary:
Currently transaction iterator does not apply `ReadOptions.iterate_upper_bound` when iterating. This PR attempts to fix the problem by having `BaseDeltaIterator` enforcing the upper bound check when iterator state is changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4656
Differential Revision: D13039257
Pulled By: miasantreble
fbshipit-source-id: 909eb9f6b4597a4d80418fb139f32ec82c6ec1d1
Summary:
It called the autovector::push_back simply in autovector::emplace_back.
This was not efficient, and then optimazed this function through the
perfect forwarding.
This was the src and result of the benchmark(using the google'benchmark library, the type of elem in
autovector was std::string, and call emplace_back with the "char *" type):
https://gist.github.com/monadbobo/93448b89a42737b08cbada81de75c5cd
PS: The benchmark's result of previous PR was not accurate, and so I update the test case and result.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4606
Differential Revision: D13046813
Pulled By: sagar0
fbshipit-source-id: 19cde1bcadafe899aa454b703acb35737a1cc02d
Summary:
Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy.
It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676
Differential Revision: D13047662
Pulled By: yiwu-arbug
fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd
Summary:
We used to have a bug, which caused every block to be read twice, and none of our tests caught it. Add a very simply unit test to make sure that when reading a data block, we only issue one pread against the SST file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4657
Differential Revision: D13005260
Pulled By: siying
fbshipit-source-id: 03167b554ad2451192b1707415536d7d05e9026c
Summary:
As pointed out in #4059, we miss use string as buffer for file read. Changing to use char array instead.
Closing #4059
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4662
Differential Revision: D13012998
Pulled By: yiwu-arbug
fbshipit-source-id: 41234ba17c0bccea65bd647e362a0e979152bd1e