Summary:
Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.
For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.
Tests:
Update existing unit tests to verify the correct timeout value is being passed
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751
Reviewed By: riversand963
Differential Revision: D21285631
Pulled By: anand1976
fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
Summary:
An assertion that a char == a CompressionType (unsigned char)
originally cast from a char can fail if the original value is negative,
due to numeric promotion. The assertion should pass even if the value
is invalid CompressionType, because the callee
UncompressBlockContentsForCompressionType checks for that and reports
status appropriately.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6780
Test Plan:
Temporarily change kZSTD = 0x88 and see tests fail. Make this
change (in addition), and tests pass.
Reviewed By: siying
Differential Revision: D21328498
Pulled By: pdillinger
fbshipit-source-id: 61caf8d815581ce49261ecb7ab0f396e9ac4bb92
Summary:
Added functions for parsing, serializing, and comparing elements to OptionTypeInfo. These functions allow all of the special cases that could not be handled directly in the map of OptionTypeInfo to be moved into the map. Using these functions, every type can be handled via the map rather than special cased.
By adding these functions, the code for handling options can become more standardized (fewer special cases) and (eventually) handled completely by common classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6422
Test Plan: pass make check
Reviewed By: siying
Differential Revision: D21269005
Pulled By: zhichao-cao
fbshipit-source-id: 9ba71c721a38ebf9ee88259d60bd81b3282b9077
Summary:
Since read threads do not coordinate on loading data into block
cache, two threads between Lookup and Insert can end up loading and
inserting the same data. This is particularly concerning with
cache_index_and_filter_blocks since those are hot and more likely to
be race targets if ejected from (or not pre-populated in) the cache.
Particularly with moves toward disaggregated / network storage, the cost
of redundant retrieval might be high, and we should at least have some
hard statistics from which we can estimate impact.
Example with full filter thrashing "cliff":
$ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10
...
$ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
rocksdb.block.cache.add COUNT : 14181
rocksdb.block.cache.add.failures COUNT : 0
rocksdb.block.cache.add.redundant COUNT : 476
rocksdb.block.cache.data.add COUNT : 12749
rocksdb.block.cache.data.add.redundant COUNT : 18
rocksdb.block.cache.filter.add COUNT : 1003
rocksdb.block.cache.filter.add.redundant COUNT : 217
rocksdb.block.cache.index.add COUNT : 429
rocksdb.block.cache.index.add.redundant COUNT : 241
$ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
rocksdb.block.cache.add COUNT : 1182223
rocksdb.block.cache.add.failures COUNT : 0
rocksdb.block.cache.add.redundant COUNT : 302728
rocksdb.block.cache.data.add COUNT : 31425
rocksdb.block.cache.data.add.redundant COUNT : 12
rocksdb.block.cache.filter.add COUNT : 795455
rocksdb.block.cache.filter.add.redundant COUNT : 130238
rocksdb.block.cache.index.add COUNT : 355343
rocksdb.block.cache.index.add.redundant COUNT : 172478
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681
Test Plan: Some manual testing (above) and unit test covering key metrics is included
Reviewed By: ltamasi
Differential Revision: D21134113
Pulled By: pdillinger
fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87
Summary:
In https://github.com/facebook/rocksdb/pull/6455, we modified the interface of `RandomAccessFileReader::Read` to be able to get rid of memcpy in direct IO mode.
This PR applies the new interface to `BlockFetcher` when reading blocks from SST files in direct IO mode.
Without this PR, in direct IO mode, when fetching and uncompressing compressed blocks, `BlockFetcher` will first copy the raw compressed block into `BlockFetcher::compressed_buf_` or `BlockFetcher::stack_buf_` inside `RandomAccessFileReader::Read` depending on the block size. then during uncompressing, it will copy the uncompressed block into `BlockFetcher::heap_buf_`.
In this PR, we get rid of the first memcpy and directly uncompress the block from `direct_io_buf_` to `heap_buf_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6689
Test Plan: A new unit test `block_fetcher_test` is added.
Reviewed By: anand1976
Differential Revision: D21006729
Pulled By: cheng-chang
fbshipit-source-id: 2370b92c24075692423b81277415feb2aed5d980
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings. There is a mishmash of parameters in use here with more needed in the future. This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389
Reviewed By: siying
Differential Revision: D21163707
Pulled By: zhichao-cao
fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
Summary:
When index block is empty or an error happens while reading it,
`Invalidate()` is called rather than `Initialize()`. So `Seek()` must
not refer to member variables that are only initialized in
`Initialize()` until it is sure `Initialize()` has been called.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6736
Reviewed By: siying
Differential Revision: D21139641
Pulled By: ajkr
fbshipit-source-id: 71c58cc1adbd795dc3729dd5023bf7df1515ff32
Summary:
Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended:
* Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists
* Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition
* std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API
* Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line
* Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20)
* Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor
* Added GCC 9 C++11 & GCC9 C++20 in Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697
Test Plan: make check and CI
Reviewed By: cheng-chang
Differential Revision: D21020318
Pulled By: pdillinger
fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
Summary:
BlockBasedTableBuilder::ParallelCompressionRep::first_block can be read in
Flush() and written in BGWorkWriteRawBlock() concurrently. This commit fixes
the issue by reading first_block out before pushing the block to compression
and write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6640
Test Plan: Run all tests concurrently with TSAN.
Reviewed By: cheng-chang
Differential Revision: D20851370
fbshipit-source-id: 6f039222e8319d31e15f1b45e05c106527253f72
Summary:
In index blocks since `format_version=3`, user keys are written
rather than internal keys. When reading such blocks, the comparator is
obtained via `InternalKeyComparator::user_comparator()`. That function
must not return an unwrapped result as the wrapper class provides
accounting logic to populate `PerfContext::user_key_comparison_count`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6650
Test Plan:
ran db_bench and verified
`PerfContext::user_key_comparison_count` became larger.
Reviewed By: cheng-chang
Differential Revision: D20866325
Pulled By: ajkr
fbshipit-source-id: ad755d46bda31157dacc5b66e532279f19ad538c
Summary:
This PR implements a fault injection mechanism for injecting errors in reads in db_stress. The FaultInjectionTestFS is used for this purpose. A thread local structure is used to track the errors, so that each db_stress thread can independently enable/disable error injection and verify observed errors against expected errors. This is initially enabled only for Get and MultiGet, but can be extended to iterator as well once its proven stable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6538
Test Plan:
crash_test
make check
Reviewed By: riversand963
Differential Revision: D20714347
Pulled By: anand1976
fbshipit-source-id: d7598321d4a2d72bda0ced57411a337a91d87dc7
Summary:
On reading an ingested SST file, `DataBlockIter` will replace seqno encoded in a key with global seqno. However, if the original seqno was part of the prefix used for the next key, the global seqno is by mistake used as part of the prefix to construct the next key, causing wrong result being returned. Although at this point it is only software error while data in the file is not corrupted, the issue can further cause compaction output out of order and corrupted result when the ingested SST participated in compaction. Fixing the issue by save the actual seqno and restore it before the key being used as prefix to construct next key.
The unit test is by Little-Wallace from https://github.com/facebook/rocksdb/issues/6666. Fixing https://github.com/facebook/rocksdb/issues/6666.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6669
Test Plan:
New unit test
Signed-off-by: Yi Wu <yiwu@pingcap.com>
Reviewed By: cheng-chang
Differential Revision: D20931808
Pulled By: ajkr
fbshipit-source-id: f01959c35d6a493954dca981663766c7a5a9e8ab
Summary:
It was incorrectly counting time even for blocks that didn't need decompression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6658
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D20883522
Pulled By: anand1976
fbshipit-source-id: 33c9c4683f54cad150ab260a69e3ef8aa9aff76a
Summary:
https://github.com/facebook/rocksdb/pull/6262 causes CLANG analyze to complain. Add assertion to suppress the warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6641
Test Plan: Run "clang analyze" and make sure it passes.
Reviewed By: anand1976
Differential Revision: D20841722
fbshipit-source-id: 5fa6e0c5cfe7a822214c9b898a408df59d4fd2cd
Summary:
This is a predecessor to the Configurable PR. This change moves the OptionTypeInfo maps closer to where they will be used.
When the Configurable changes are adopted, these values will become static and not associated with the OptionsHelper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6198
Reviewed By: siying
Differential Revision: D20778108
Pulled By: zhichao-cao
fbshipit-source-id: a9f85fc73bc53503656e1958ecc1e764052fd1aa
Summary:
With https://github.com/facebook/rocksdb/issues/6262, TSAN complains about data race of some variables. Those variables are used to estimate file size and are accessed in writer and background threads. Since file size estimation doesn't have to be 100% accurate, we make some variables atomic and use relaxed memory order.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6636
Test Plan: Run all tests with TSAN.
Reviewed By: anand1976
Differential Revision: D20820635
fbshipit-source-id: 1ea45ff38be15e33674ffe06b7d42fc9fe161ea5
Summary:
With https://github.com/facebook/rocksdb/issues/6262, UBSAN fails with "division by zero":
[ RUN ] Timestamp/DBBasicTestWithTimestampCompressionSettings.PutAndGetWithCompaction/3
internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:1066:39: runtime error: division by zero
#0 0x7ffb3117b071 in rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool) internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:1066
https://github.com/facebook/rocksdb/issues/1 0x7ffb311775e1 in rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, bool) internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:848
https://github.com/facebook/rocksdb/issues/2 0x7ffb311771a2 in rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, bool) internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:832
This is caused by not returning immediately after CompressAndVerifyBlock call
in WriteBlock when rep_->status == kBuffered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6633
Test Plan: Run all existing test.
Reviewed By: anand1976
Differential Revision: D20808366
fbshipit-source-id: 09f24b7c0fbaf4c7a8fc48cac61fa6fcb9b85811
Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262
Reviewed By: ajkr
Differential Revision: D20651306
fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
Summary:
These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615
Test Plan: Add a unit test to make sure that SetOptions() can change the options.
Reviewed By: riversand963
Differential Revision: D20755951
fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff
Summary:
The checksum generator should be released if file_writer fails to reset the pointer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6611
Test Plan: pass make asan_check
Reviewed By: riversand963
Differential Revision: D20742964
Pulled By: zhichao-cao
fbshipit-source-id: cde41be2edb3d1e56083c2b93e1510fb32556146
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600
Test Plan: tested with make asan_check
Reviewed By: riversand963
Differential Revision: D20717670
Pulled By: zhichao-cao
fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
Summary:
When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries.
This PR makes it possible to decrease CPU priority of these threads when creating a new backup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602
Test Plan: make check
Reviewed By: siying, zhichao-cao
Differential Revision: D20683216
Pulled By: cheng-chang
fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Summary:
As the first step of reintroducing eviction statistics for the block
cache, the patch switches from using simple function pointers as deleters
to function objects implementing an interface. This will enable using
deleters that have state, like a smart pointer to the statistics object
that is to be updated when an entry is removed from the cache. For now,
the patch adds a deleter template class `SimpleDeleter`, which simply
casts the `value` pointer to its original type and calls `delete` or
`delete[]` on it as appropriate. Note: to prevent object lifecycle
issues, deleters must outlive the cache entries referring to them;
`SimpleDeleter` ensures this by using the ("leaky") Meyers singleton
pattern.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6545
Test Plan: `make asan_check`
Reviewed By: siying
Differential Revision: D20475823
Pulled By: ltamasi
fbshipit-source-id: fe354c33dd96d9bafc094605462352305449a22a
Summary:
Add timestamp support for MultiGet().
timestamp from readoptions is honored, and timestamps can be returned along with values.
MultiReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit 17bef7d3a):
multireadrandom : 104.173 micros/op 307167 ops/sec; (5462999 of 5462999 found)
This PR:
multireadrandom : 104.199 micros/op 307095 ops/sec; (5307999 of 5307999 found)
.\db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=multireadrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6483
Reviewed By: anand1976
Differential Revision: D20498373
Pulled By: riversand963
fbshipit-source-id: 8505f22bc40fd791bc7dd05e48d7e67c91edb627
Summary:
By supporting direct IO in RandomAccessFileReader::MultiRead, the benefits of parallel IO (IO uring) and direct IO can be combined.
In direct IO mode, read requests are aligned and merged together before being issued to RandomAccessFile::MultiRead, so blocks in the original requests might share the same underlying buffer, the shared buffers are returned in `aligned_bufs`, which is a new parameter of the `MultiRead` API.
For example, suppose alignment requirement for direct IO is 4KB, one request is (offset: 1KB, len: 1KB), another request is (offset: 3KB, len: 1KB), then since they all belong to page (offset: 0, len: 4KB), `MultiRead` only reads the page with direct IO into a buffer on heap, and returns 2 Slices referencing regions in that same buffer. See `random_access_file_reader_test.cc` for more examples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6446
Test Plan: Added a new test `random_access_file_reader_test.cc`.
Reviewed By: anand1976
Differential Revision: D20097518
Pulled By: cheng-chang
fbshipit-source-id: ca48a8faf9c3af146465c102ef6b266a363e78d1
Summary:
https://github.com/facebook/rocksdb/pull/6531 removed some code in partitioned index seek logic. By mistake the logic of storing previous index offset is removed, while the logic of using it is preserved, so that the code might use wrong value to determine reseeking condition.
This will trigger a bug, if following a Seek() not going to the last block, SeekToLast() is called, and then Seek() is called which should position the cursor to the block before SeekToLast().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6551
Test Plan: Add a unit test that reproduces the bug. In the same unit test, also some reseek cases are covered to avoid regression.
Reviewed By: pdillinger
Differential Revision: D20493990
fbshipit-source-id: 3919aa4861c0481ec96844e053048da1a934b91d
Summary:
Define a private member variable only in debug mode. Without fix, build will fail
```
In file included from table/block_based/partitioned_index_iterator.cc:9:
./table/block_based/partitioned_index_iterator.h:125:32: error: private field 'icomp_' is not used [-Werror,-Wunused-private-field]
const InternalKeyComparator& icomp_;
```
Test plan (dev server)
1. make check
2. Make sure fixed in Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6547
Reviewed By: siying
Differential Revision: D20480027
Pulled By: pdillinger
fbshipit-source-id: 288bc94280e240c3136335b6c73eb1ccb0db459d
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
Summary:
block_based_table_reader.cc is a giant file, which makes it hard for users to navigate the code. Divide the files to multiple files.
Some class templates cannot be moved to .cc file. They are moved to .h files. It is still better than including them all in block_based_table_reader.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6527
Test Plan: "make all check" and "make release". Also build using cmake.
Differential Revision: D20428455
fbshipit-source-id: ca713c698469f07f35bc0c271358c0874ed4eb28
Summary:
Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.
Create an iterator with timestamp.
```
...
read_opts.timestamp = &ts;
auto* iter = db->NewIterator(read_opts);
// target is key without timestamp.
for (iter->Seek(target); iter->Valid(); iter->Next()) {}
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
delete iter;
read_opts.timestamp = &ts1;
// lower_bound and upper_bound are without timestamp.
read_opts.iterate_lower_bound = &lower_bound;
read_opts.iterate_upper_bound = &upper_bound;
auto* iter1 = db->NewIterator(read_opts);
// Do Seek or SeekToFirst()
delete iter1;
```
Test plan (dev server)
```
$make check
```
Simple benchmarking (dev server)
1. The overhead introduced by this PR even when timestamp is disabled.
key size: 16 bytes
value size: 100 bytes
Entries: 1000000
Data reside in main memory, and try to stress iterator.
Repeated three times on master and this PR.
- Seek without next
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
```
master: 159047.0 ops/sec
this PR: 158922.3 ops/sec (2% drop in throughput)
- Seek and next 10 times
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
```
master: 109539.3 ops/sec
this PR: 107519.7 ops/sec (2% drop in throughput)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255
Differential Revision: D19438227
Pulled By: riversand963
fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
Summary:
In direct IO mode, RandomAccessFileReader::Read allocates an internal aligned buffer, and then copies the result into the scratch buffer. If the result is only temporarily used inside a function, there is no need to do the memcpy and just let the result Slice refer to the internally allocated buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6455
Test Plan: make check
Differential Revision: D20106753
Pulled By: cheng-chang
fbshipit-source-id: 44f505843837bba47a56e3fa2c4dd3bd76486b58
Summary:
Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included.
ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit 72ee067b9):
101.712 micros/op 314602 ops/sec; 36.0 MB/s (5658999 of 5658999 found)
This PR:
100.288 micros/op 319071 ops/sec; 36.5 MB/s (5674999 of 5674999 found)
./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409
Differential Revision: D20200086
Pulled By: riversand963
fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50
Summary:
If file is not mmaped, CuckooTableReader should not try to read table properties from the file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6453
Test Plan: Added a new unit test
Differential Revision: D20103334
Pulled By: cheng-chang
fbshipit-source-id: 48539f14d93f6c1ebe12c3df5a14719e9d7b8726
Summary:
Original author: jeffrey-xiao
If we are writing a global seqno for an ingested file, the range
tombstone metablock gets accessed and put into the cache during
ingestion preparation. At the time, the global seqno of the ingested
file has not yet been determined, so the cached block will not have a
global seqno. When the file is ingested and we read its range tombstone
metablock, it will be returned from the cache with no global seqno. In
that case, we use the actual seqnos stored in the range tombstones,
which are all zero, so the tombstones cover nothing.
This commit removes global_seqno_ variable from Block. When iterating
over a block, the global seqno for the block is determined by the
iterator instead of storing this mutable attribute in Block.
Additionally, this commit adds a regression test to check that keys are
deleted when ingesting a file with a global seqno and range deletion
tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429
Differential Revision: D19961563
Pulled By: ajkr
fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7
Summary:
Cleanup some code without any real change in functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6440
Differential Revision: D20015891
Pulled By: riversand963
fbshipit-source-id: 33e18754b0f002006a6d4805e9aaf84c0c8ad25a
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
`index` can be replaced by `iter`, saving the computation of `index++`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6406
Test Plan: make check
Differential Revision: D19905056
Pulled By: cheng-chang
fbshipit-source-id: add4638959c0d2e4e77a11f3fa04ffabaf0de790
Summary:
Unrevert the previous fix to propagate error status, and an additional fix to not treat a memtable lookup MergeInProgress status as an error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6403
Test Plan:
Unit tests
Tried running stress tests but couldn't repro the stress failure
Differential Revision: D19846721
Pulled By: anand1976
fbshipit-source-id: 7db10cccbdc863d9b559497f0a46b608d2488ca4
Summary:
This reverts commit d70011bccc. The commit is causing some stress test failure due to unexpected Status::MergeInProgress() return for some keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6401
Differential Revision: D19826623
Pulled By: anand1976
fbshipit-source-id: edd634cede9cb7bdd2cb8f46e662ea709b16d2f1
Summary:
In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.
Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216
Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.
Differential Revision: D19171461
Pulled By: zhichao-cao
fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
Summary:
Currently, any IO errors and checksum mismatches while reading data
blocks, are being ignored by the batched MultiGet. Its only looking at
the GetContext state. Fix that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6387
Test Plan: Add unit tests
Differential Revision: D19799819
Pulled By: anand1976
fbshipit-source-id: 46133dccbb04e64067b9fe6cda73e282203db969
Summary:
Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314
Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run.
Differential Revision: D19458717
fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce
Summary:
https://github.com/facebook/rocksdb/pull/6028 introduces a bug for hash index in SST files. If a table reader is created when total order seek is used, prefix_extractor might be passed into table reader as null. While later when prefix seek is used, the same table reader used, hash index is checked but prefix extractor is null and the program would crash.
Fix the issue by fixing http://github.com/facebook/rocksdb/pull/6028 in the way that prefix_extractor is preserved but ReadOptions.total_order_seek is checked
Also, a null pointer check is added so that a bug like this won't cause segfault in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6328
Test Plan: Add a unit test that would fail without the fix. Stress test that reproduces the crash would pass.
Differential Revision: D19586751
fbshipit-source-id: 8de77690167ddf5a77a01e167cf89430b1bfba42
Summary:
Remove the redundant PartitionedFilterBlockBuilder::num_added_ and ::NumAdded since the parent class, FullFilterBlockBuilder, already provides them.
Also rename filters_in_partition_ and filters_per_partition_ to keys_added_to_partition_ and keys_per_partition_ to improve readability.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6299
Test Plan: make check
Differential Revision: D19413278
Pulled By: pdillinger
fbshipit-source-id: 04926ee7874477d659cb2b6ae03f2d995fb747e5
Summary:
With many millions of keys, the old Bloom filter implementation
for the block-based table (format_version <= 4) would have excessive FP
rate due to the limitations of feeding the Bloom filter with a 32-bit hash.
This change computes an estimated inflated FP rate due to this effect
and warns in the log whenever an SST filter is constructed (almost
certainly a "full" not "partitioned" filter) that exceeds 1.5x FP rate
due to this effect. The detailed condition is only checked if 3 million
keys or more have been added to a filter, as this should be a lower
bound for common bits/key settings (< 20).
Recommended remedies include smaller SST file size, using
format_version >= 5 (for new Bloom filter), or using partitioned
filters.
This does not change behavior other than generating warnings for some
constructed filters using the old implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6317
Test Plan:
Example with warning, 15M keys @ 15 bits / key: (working_mem_size_mb is just to stop after building one filter if it's large)
$ ./filter_bench -quick -impl=0 -working_mem_size_mb=1 -bits_per_key=15 -average_keys_per_filter=15000000 2>&1 | grep 'FP rate'
[WARN] [/block_based/filter_policy.cc:292] Using legacy SST/BBT Bloom filter with excessive key count (15.0M @ 15bpk), causing estimated 1.8x higher filter FP rate. Consider using new Bloom with format_version>=5, smaller SST file size, or partitioned filters.
Predicted FP rate %: 0.766702
Average FP rate %: 0.66846
Example without warning (150K keys):
$ ./filter_bench -quick -impl=0 -working_mem_size_mb=1 -bits_per_key=15 -average_keys_per_filter=150000 2>&1 | grep 'FP rate'
Predicted FP rate %: 0.422857
Average FP rate %: 0.379301
$
With more samples at 15 bits/key:
150K keys -> no warning; actual: 0.379% FP rate (baseline)
1M keys -> no warning; actual: 0.396% FP rate, 1.045x
9M keys -> no warning; actual: 0.563% FP rate, 1.485x
10M keys -> warning (1.5x); actual: 0.564% FP rate, 1.488x
15M keys -> warning (1.8x); actual: 0.668% FP rate, 1.76x
25M keys -> warning (2.4x); actual: 0.880% FP rate, 2.32x
At 10 bits/key:
150K keys -> no warning; actual: 1.17% FP rate (baseline)
1M keys -> no warning; actual: 1.16% FP rate
10M keys -> no warning; actual: 1.32% FP rate, 1.13x
25M keys -> no warning; actual: 1.63% FP rate, 1.39x
35M keys -> warning (1.6x); actual: 1.81% FP rate, 1.55x
At 5 bits/key:
150K keys -> no warning; actual: 9.32% FP rate (baseline)
25M keys -> no warning; actual: 9.62% FP rate, 1.03x
200M keys -> no warning; actual: 12.2% FP rate, 1.31x
250M keys -> warning (1.5x); actual: 12.8% FP rate, 1.37x
300M keys -> warning (1.6x); actual: 13.4% FP rate, 1.43x
The reason for the modest inaccuracy at low bits/key is that the assumption of independence between a collision between 32-hash values feeding the filter and an FP in the filter is not quite true for implementations using "simple" logic to compute indices from the stock hash result. There's math on this in my dissertation, but I don't think it's worth the effort just for these extreme cases (> 100 million keys and low-ish bits/key).
Differential Revision: D19471715
Pulled By: pdillinger
fbshipit-source-id: f80c96893a09bf1152630ff0b964e5cdd7e35c68
Summary:
Help users that would benefit most from new Bloom filter
implementation by logging a warning that recommends the using
format_version >= 5.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6312
Test Plan:
$ (for BPK in 10 13 14 19 20 50; do ./filter_bench -quick -impl=0 -bits_per_key=$BPK -m_queries=1 2>&1; done) | grep 'its/key'
Bits/key actual: 10.0647
Bits/key actual: 13.0593
[WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (14) bits/key. Significant filter space and/or accuracy improvement is available with format_verion>=5.
Bits/key actual: 14.0581
[WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (19) bits/key. Significant filter space and/or accuracy improvement is available with format_verion>=5.
Bits/key actual: 19.0542
[WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (20) bits/key. Dramatic filter space and/or accuracy improvement is available with format_verion>=5.
Bits/key actual: 20.0584
[WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (50) bits/key. Dramatic filter space and/or accuracy improvement is available with format_verion>=5.
Bits/key actual: 50.0577
Differential Revision: D19457191
Pulled By: pdillinger
fbshipit-source-id: 073d94cde5c70e03a160f953e1100c15ea83eda4
Summary:
Recent bug fix related to hash index introduced a new bug: hash index can return NotFound but it is not handled by BlockBasedTable::Get(). The end result is that Get() stops being executed too early. Fix it by ignoring NotFound code in Get().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6305
Test Plan: A problematic DB used to return NotFound incorrectly, and now able to return correct result. Will try to construct a unit test too.0
Differential Revision: D19438925
fbshipit-source-id: e751afa8c13728d56511cfeb1bc811ecb99f3217
Summary:
Recent fix to Prefix Hash https://github.com/facebook/rocksdb/pull/6292 caused a bug that the newly created NotFound status in hash index is never reset. This causes reseek or implict reseek to return wrong results sometimes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6302
Test Plan:
Add a unit test that would fail. Not fix.
crash test with hash test would fail in several seconds. With the fix, it will run about several minutes before failing with another failure.
Differential Revision: D19424572
fbshipit-source-id: c5276f36a95fd0e2837e30190476d2fe21ed8566
Summary:
When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target.
Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297
Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix.
Differential Revision: D19404695
fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3
Summary:
kHashSearch index type is incompatible with index_block_restart_interval larger than 1. The patch asserts that and also resets index_block_restart_interval value if it is incompatible with kHashSearch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6294
Differential Revision: D19394229
Pulled By: maysamyabandeh
fbshipit-source-id: 8a12712ab25e81094a7f71ecd43f773dd4fb6acd
Summary:
Reword the error message when keys are not added in strict ascending order.
Specifically, original error message is not clear when application tries to
call SstFileWriter::Merge() with duplicate keys.
Test plan (dev server)
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6261
Differential Revision: D19290398
Pulled By: riversand963
fbshipit-source-id: 4dc30a701414e6894db2eb024e3734470c22b371
Summary:
Right now BlockBasedTable::ApproximateSize() uses default setting about whether to use total order seek. There is no reason for that. There is no reason to do any filtering for approximate size boundary key, and it may introduce bugs. Disable it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6222
Test Plan: Run existing tests
Differential Revision: D19184787
fbshipit-source-id: 64180660bd2800914fff75104172b61c06f0b1c9
Summary:
This reverts commit 54f9092b0c.
It making our daily stress tests fail. Revert it until the issues are fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6220
Differential Revision: D19179881
Pulled By: maysamyabandeh
fbshipit-source-id: 99de0eaf776567fa81110b9ad2608234a16083ce
Summary:
Beside extending index_type to kHashSearch, it clarifies in the code base that this feature is incompatible with index_block_restart_interval > 1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6210
Test Plan:
```
make -j32 crash_test
Differential Revision: D19166567
Pulled By: maysamyabandeh
fbshipit-source-id: 3aaf75a70a8b462d372d43aac69dbd10df303ec7
Summary:
Right now, sometimes file prefetching is still on when mmap is enabled. This causes bug of reading wrong data. In this commit, we remove all those possible paths.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6206
Test Plan: make crash_test with compaction_readahead_size, which used to fail. RUn all existing tests.
Differential Revision: D19149429
fbshipit-source-id: 9e18ea8c566e416aac9647bdd05afe596634791b
Summary:
buf_offset does not need to get the value from req.len for othe final block. It can cause test fail for clan_analyze. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6204
Test Plan: pass make asan_check
Differential Revision: D19145335
Pulled By: zhichao-cao
fbshipit-source-id: 8f6e74565746381b5c5ef598b97d746517b36e5b
Summary:
In the current MultiGet, if the KV-pairs do not belong to the data blocks in the block cache, multiple blocks are read from a SST. It will trigger one block read for each block request and read them in parallel. In some cases, if some data blocks are adjacent in the SST, the reads for these blocks can be combined to a single large read, which can reduce the system calls and reduce the read latency if possible.
Considering to fill the block cache, if multiple data blocks are in the same memory buffer, we need to copy them to the heap separately. Therefore, only in the case that 1) data block compression is enabled, and 2) compressed block cache is null, we can do combined read. Otherwise, extra memory copy is needed, which may cause extra overhead. In the current case, data blocks will be uncompressed to a new memory space.
Also, in the case that 1) data block compression is enabled, and 2) compressed block cache is null, it is possible the data block is actually not compressed. In the current logic, these data blocks will not be added to the uncompressed_cache. So if memory buffer is shared and the data block is not compressed, the data block are copied to the head and fill the cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6089
Test Plan: Added test case to ParallelIO.MultiGet. Pass make asan_check
Differential Revision: D18734668
Pulled By: zhichao-cao
fbshipit-source-id: 67c5615ed373e51e42635fd74b36f8f3a66d5da4
Summary:
The filter bits builder collects all the hashes to add in memory before adding them (because the number of keys is not known until we've walked over all the keys). Existing code uses a std::vector for this, which can mean up to 2x than necessary space allocated (and not freed) and up to ~2x write amplification in memory. Using std::deque uses close to minimal space (for large filters, the only time it matters), no write amplification, frees memory while building, and no need for large contiguous memory area. The only cost is more calls to allocator, which does not appear to matter, at least in benchmark test.
For now, this change only applies to the new (format_version=5) Bloom filter implementation, to ease before-and-after comparison downstream.
Temporary memory use during build is about the only way the new Bloom filter could regress vs. the old (because of upgrade to 64-bit hash) and that should only matter for full filters. This change should largely mitigate that potential regression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6175
Test Plan:
Using filter_bench with -new_builder option and 6M keys per filter is like large full filter (improvement). 10k keys and no -new_builder is like partitioned filters (about the same). (Corresponding configurations run simultaneously on devserver.)
std::vector impl (before)
$ /usr/bin/time -v ./filter_bench -impl=2 -quick -new_builder -working_mem_size_mb=1000 -
average_keys_per_filter=6000000
Build avg ns/key: 52.2027
Maximum resident set size (kbytes): 1105016
$ /usr/bin/time -v ./filter_bench -impl=2 -quick -working_mem_size_mb=1000 -
average_keys_per_filter=10000
Build avg ns/key: 30.5694
Maximum resident set size (kbytes): 1208152
std::deque impl (after)
$ /usr/bin/time -v ./filter_bench -impl=2 -quick -new_builder -working_mem_size_mb=1000 -
average_keys_per_filter=6000000
Build avg ns/key: 39.0697
Maximum resident set size (kbytes): 1087196
$ /usr/bin/time -v ./filter_bench -impl=2 -quick -working_mem_size_mb=1000 -
average_keys_per_filter=10000
Build avg ns/key: 30.9348
Maximum resident set size (kbytes): 1207980
Differential Revision: D19053431
Pulled By: pdillinger
fbshipit-source-id: 2888e748723a19d9ea40403934f13cbb8483430c
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Summary:
Formatter somehow complains some recent lines changed. Apply them to make the formatter happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6138
Test Plan: See CI passes.
Differential Revision: D18895950
fbshipit-source-id: 7d1696cf3e3a682bc10a30cdca748a23c6565255
Summary:
Add overrides needed in FilterPolicy wrapper to fix
rocksdb_filterpolicy_create_bloom_full (see issue https://github.com/facebook/rocksdb/issues/6129). Re-enabled
assertion in BloomFilterPolicy::CreateFilter that was being violated.
Expanded c_test to identify Bloom filter implementations by FP counts.
(Without the fix, updated test will trigger assertion and fail otherwise
without the assertion.)
Fixes https://github.com/facebook/rocksdb/issues/6129
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6132
Test Plan: updated c_test, also run under valgrind.
Differential Revision: D18864911
Pulled By: pdillinger
fbshipit-source-id: 08e81d7b5368b08e501cd402ef5583f2650c19fa
Summary:
BlockBasedTableBuilder uses ExtractUserKey in EnterUnbuffered. This would
cause index filter building error, since user-provided timestamp is supported
by ExtractUserKeyAndStripTimestamp, and it's used in Add. This commit changes
ExtractUserKey to ExtractUserKeyAndStripTimestamp.
A test case is also added by modifying DBBasicTestWithTimestampWithParam_
PutAndGet test in db_basic_test to cover ExtractUserKeyAndStripTimestamp usage
in both kBuffered and kUnbuffered state of BlockBasedTableBuilder.
Before the ExtractUserKeyAndStripTimstamp fix:
```
$ ./db_basic_test --gtest_filter="*PutAndGet*"
Note: Google Test filter = *PutAndGet*
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam
[ RUN ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
[ FAILED ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0, where GetParam() = false (1177 ms)
[ RUN ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1
[ OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1 (1056 ms)
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam (2233 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (2233 ms total)
[ PASSED ] 1 test.
[ FAILED ] 1 test, listed below:
[ FAILED ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0, where GetParam() = false
1 FAILED TEST
```
After the ExtractUserKeyAndStripTimstamp fix:
```
$ ./db_basic_test --gtest_filter="*PutAndGet*"
Note: Google Test filter = *PutAndGet*
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam
[ RUN ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0
[ OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0 (1417 ms)
[ RUN ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1
[ OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1 (1041 ms)
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam (2458 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (2458 ms total)
[ PASSED ] 2 tests.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6100
Differential Revision: D18769654
Pulled By: riversand963
fbshipit-source-id: 76c2cf2c9a5e0d85db95d98e812e6af0c2a15c6b
Summary:
A longstanding bug in our C interface can trigger this
assertion; see issue https://github.com/facebook/rocksdb/issues/6129. Disabling the assertion for now
(for 6.6.0) and will re-enable on fix of that bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6128
Differential Revision: D18854899
Pulled By: pdillinger
fbshipit-source-id: 9eb5294b9f11b208dc1a8cc148aaa31e47ff892b
Summary:
This change enables custom implementations of FilterPolicy to
wrap a variety of NewBloomFilterPolicy and select among them based on
contextual information such as table level and compaction style.
* Moves FilterBuildingContext to public API and elaborates it with more
useful data. (It would be nice to put more general options-like data,
but at the time this object is constructed, we are using internal APIs
ImmutableCFOptions and MutableCFOptions and don't have easy access to
ColumnFamilyOptions that I can tell.)
* Renames BloomFilterPolicy::GetFilterBitsBuilderInternal to
GetBuilderWithContext, because it's now public.
* Plumbs through the table's "level_at_creation" for filter building
context.
* Simplified some tests by adding GetBuilder() to
MockBlockBasedTableTester.
* Adds test as DBBloomFilterTest.ContextCustomFilterPolicy, including
sample wrapper class LevelAndStyleCustomFilterPolicy.
* Fixes a cross-test bug in DBBloomFilterTest.OptimizeFiltersForHits
where it does not reset perf context.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6088
Test Plan: make check, valgrind on db_bloom_filter_test
Differential Revision: D18697817
Pulled By: pdillinger
fbshipit-source-id: 5f987a2d7b07cc7a33670bc08ca6b4ca698c1cf4
Summary:
There's no technological impediment to allowing the Bloom
filter bits/key to be non-integer (fractional/decimal) values, and it
provides finer control over the memory vs. accuracy trade-off. This is
especially handy in using the format_version=5 Bloom filter in place
of the old one, because bits_per_key=9.55 provides the same accuracy as
the old bits_per_key=10.
This change not only requires refining the logic for choosing the best
num_probes for a given bits/key setting, it revealed a flaw in that logic.
As bits/key gets higher, the best num_probes for a cache-local Bloom
filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a
standard Bloom filter. For example, at 16 bits per key, the best
num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%).
This change fixes and refines that logic (for the format_version=5
Bloom filter only, just in case) based on empirical tests to find
accuracy inflection points between each num_probes.
Although bits_per_key is now specified as a double, the new Bloom
filter converts/rounds this to "millibits / key" for predictable/precise
internal computations. Just in case of unforeseen compatibility
issues, we round to the nearest whole number bits / key for the
legacy Bloom filter, so as not to unlock new behaviors for it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6092
Test Plan: unit tests included
Differential Revision: D18711313
Pulled By: pdillinger
fbshipit-source-id: 1aa73295f152a995328cb846ef9157ae8a05522a
Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.
Speed
The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.
Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):
$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
Single filter net ns/op: 26.2825
Random filter net ns/op: 150.459
Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
Single filter net ns/op: 63.2978
Random filter net ns/op: 188.038
Average FP rate %: 1.13823
Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.
The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.
Accuracy
The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.
Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)
Fixes https://github.com/facebook/rocksdb/issues/5857
Fixes https://github.com/facebook/rocksdb/issues/4120
Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.
Compatibility
Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007
Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).
Differential Revision: D18294749
Pulled By: pdillinger
fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
Summary:
Bug in PR https://github.com/facebook/rocksdb/issues/5941 when char is unsigned that should only affect
assertion on unused/invalid filter metadata.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6024
Test Plan: on ARM: ./bloom_test && ./db_bloom_filter_test && ./block_based_filter_block_test && ./full_filter_block_test && ./partitioned_filter_block_test
Differential Revision: D18461206
Pulled By: pdillinger
fbshipit-source-id: 68a7c813a0b5791c05265edc03cdf52c78880e9a
Summary:
Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.
As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816
Test Plan:
make check
make asan_check
asan_crash_test
Differential Revision: D18408676
Pulled By: anand1976
fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
Summary:
The calculation in BlockBasedTable::MultiGet for the required buffer length for reading in compressed blocks is incorrect. It needs to take the 5-byte block trailer into account.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6014
Test Plan: Add a unit test DBBasicTest.MultiGetBufferOverrun that fails in asan_check before the fix, and passes after.
Differential Revision: D18412753
Pulled By: anand1976
fbshipit-source-id: 754dfb66be1d5f161a7efdf87be872198c7e3b72
Summary:
This PR fixes https://github.com/facebook/rocksdb/issues/5975. In ```BlockBasedTable::RetrieveMultipleBlocks()```, we were calling ```MaybeReadBlocksAndLoadToCache()```, which is a no-op if neither uncompressed nor compressed block cache are configured.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5991
Test Plan:
1. Add unit tests that fail with the old code and pass with the new
2. make check and asan_check
Cc spetrunia
Differential Revision: D18272744
Pulled By: anand1976
fbshipit-source-id: e62fa6090d1a6adf84fcd51dfd6859b03c6aebfe
Summary:
According to
https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format,
the block read by BlockBasedTable::ReadMetaBlock is actually the meta index
block. Therefore, it is better to rename the function to ReadMetaIndexBlock.
This PR also applies some format change to existing code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6009
Test Plan: make check
Differential Revision: D18333238
Pulled By: riversand963
fbshipit-source-id: 2c4340a29b3edba53d19c132cbfd04caf6242aed
Summary:
For upcoming new SST filter implementations, we will use a new
64-bit hash function (XXH3 preview, slightly modified). This change
updates hash.{h,cc} for that change, adds unit tests, and out-of-lines
the implementations to keep hash.h as clean/small as possible.
In developing the unit tests, I discovered that the XXH3 preview always
returns zero for the empty string. Zero is problematic for some
algorithms (including an upcoming SST filter implementation) if it
occurs more often than at the "natural" rate, so it should not be
returned from trivial values using trivial seeds. I modified our fork
of XXH3 to return a modest hash of the seed for the empty string.
With hash function details out-of-lines in hash.h, it makes sense to
enable XXH_INLINE_ALL, so that direct calls to XXH64/XXH32/XXH3p
are inlined. To fix array-bounds warnings on some inline calls, I
injected some casts to uintptr_t in xxhash.cc. (Issue reported to Yann.)
Revised: Reverted using XXH_INLINE_ALL for now. Some Facebook
checks are unhappy about #include on xxhash.cc file. I would
fix that by rename to xxhash_cc.h, but to best preserve history I want
to do that in a separate commit (PR) from the uintptr casts.
Also updated filter_bench for this change, improving the performance
predictability of dry run hashing and adding support for 64-bit hash
(for upcoming new SST filter implementations, minor dead code in the
tool for now).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5984
Differential Revision: D18246567
Pulled By: pdillinger
fbshipit-source-id: 6162fbf6381d63c8cc611dd7ec70e1ddc883fbb8
Summary:
This change sets up for alternate implementations underlying
BloomFilterPolicy:
* Refactor BloomFilterPolicy and expose in internal .h file so that it's easy to iterate over / select implementations for testing, regardless of what the best public interface will look like. Most notably updated db_bloom_filter_test to use this.
* Hide FullFilterBitsBuilder from unit tests (alternate derived classes planned); expose the part important for testing (CalculateSpace), as abstract class BuiltinFilterBitsBuilder. (Also cleaned up internally exposed interface to CalculateSpace.)
* Rename BloomTest -> BlockBasedBloomTest for clarity (despite ongoing confusion between block-based table and block-based filter)
* Assert that block-based filter construction interface is only used on BloomFilterPolicy appropriately constructed. (A couple of tests updated to add ", true".)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5967
Test Plan: make check
Differential Revision: D18138704
Pulled By: pdillinger
fbshipit-source-id: 55ef9273423b0696309e251f50b8c1b5e9ec7597
Summary:
Some filtering tests were unfriendly to new implementations of
FilterBitsBuilder because of dynamic_cast to FullFilterBitsBuilder. Most
of those have now been cleaned up, worked around, or at least changed
from crash on dynamic_cast failure to individual test failure.
Also put some clarifying comments on filter-related APIs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5960
Test Plan: make check
Differential Revision: D18121223
Pulled By: pdillinger
fbshipit-source-id: e83827d9d5d96315d96f8e25a99cd70f497d802c
Summary:
- Updated our included xxhash implementation to version 0.7.2 (== the latest dev version as of 2019-10-09).
- Using XXH_NAMESPACE (like other fb projects) to avoid potential name collisions.
- Added fastrange64, and unit tests for it and fastrange32. These are faster alternatives to hash % range.
- Use preview version of XXH3 instead of MurmurHash64A for NPHash64
-- Had to update cache_test to increase probability of passing for any given hash function.
- Use fastrange64 instead of % with uses of NPHash64
-- Had to fix WritePreparedTransactionTest.CommitOfDelayedPrepared to avoid deadlock apparently caused by new hash collision.
- Set default seed for NPHash64 because specifying a seed rarely makes sense for it.
- Removed unnecessary include xxhash.h in a popular .h file
- Rename preview version of XXH3 to XXH3p for clarity and to ease backward compatibility in case final version of XXH3 is integrated.
Relying on existing unit tests for NPHash64-related changes. Each new implementation of fastrange64 passed unit tests when manipulating my local build to select it. I haven't done any integration performance tests, but I consider the improved performance of the pieces being swapped in to be well established.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5909
Differential Revision: D18125196
Pulled By: pdillinger
fbshipit-source-id: f6bf83d49d20cbb2549926adf454fd035f0ecc0d
Summary:
The parts that are used to implement FilterPolicy /
NewBloomFilterPolicy and not used other than for the block-based table
should be consolidated under table/block_based/filter_policy*.
This change is step 2 of 2:
mv util/bloom.cc table/block_based/filter_policy.cc
This gets its own PR so that git has the best chance of following the
rename for blame purposes. Note that low-level shared implementation
details of Bloom filters remain in util/bloom_impl.h, and
util/bloom_test.cc remains where it is for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5966
Test Plan: make check
Differential Revision: D18124930
Pulled By: pdillinger
fbshipit-source-id: 823bc09025b3395f092ef46a46aa5ba92a914d84
Summary:
The parts that are used to implement FilterPolicy /
NewBloomFilterPolicy and not used other than for the block-based table
should be consolidated under table/block_based/filter_policy*. I don't
foresee sharing these APIs with e.g. the Plain Table because they don't
expose hashes for reuse in indexing.
This change is step 1 of 2:
(a) mv table/full_filter_bits_builder.h to
table/block_based/filter_policy_internal.h which I expect to expand
soon to internally reveal more implementation details for testing.
(b) consolidate eventual contents of table/block_based/filter_policy.cc
in util/bloom.cc, which has the most elaborate revision history
(see step 2 ...)
Step 2 soon to follow:
mv util/bloom.cc table/block_based/filter_policy.cc
This gets its own PR so that git has the best chance of following the
rename for blame purposes. Note that low-level shared implementation
details of Bloom filters are in util/bloom_impl.h.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5963
Test Plan: make check
Differential Revision: D18121199
Pulled By: pdillinger
fbshipit-source-id: 8f21732c3d8909777e3240e4ac3123d73140326a
Summary:
Several error paths in opening of a plain table would leak memory. PR https://github.com/facebook/rocksdb/issues/5940 opened the leak to one more error path, which happens to have been (mistakenly) exercised by CuckooTableDBTest.AdaptiveTable. That test has been fixed, and the exercising of
plain table error cases (more than before) has been added as BadOptions1 and BadOptions2
to PlainTableDBTest. This effectively moved the memory leak to plain_table_db_test.
Also here is a cheap fix for the memory leak, without (yet?) changing the signature of
ReadTableProperties. This fixes ASAN on unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5951
Test Plan: make COMPILE_WITH_ASAN=1 check
Differential Revision: D18051940
Pulled By: pdillinger
fbshipit-source-id: e2952930c09a2b46c4f1ff09818c5090426929de
Summary:
A recent change introduced readahead inside VerifyChecksum(). However it is not compatible with mmap mode and generated wrong checksum verification failure. Fix it by not enabling readahead in mmap
mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5945
Test Plan: Add a unit test that used to fail.
Differential Revision: D18021443
fbshipit-source-id: 6f2eb600f81b26edb02222563a4006869d576bff
Summary:
Amongst other things, PR https://github.com/facebook/rocksdb/issues/5504 refactored the filter block readers so that
only the filter block contents are stored in the block cache (as opposed to the
earlier design where the cache stored the filter block reader itself, leading to
potentially dangling pointers and concurrency bugs). However, this change
introduced a performance hit since with the new code, the metadata fields are
re-parsed upon every access. This patch reunites the block contents with the
filter bits reader to eliminate this overhead; since this is still a self-contained
pure data object, it is safe to store it in the cache. (Note: this is similar to how
the zstd digest is handled.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5936
Test Plan:
make asan_check
filter_bench results for the old code:
```
$ ./filter_bench -quick
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 26.7153
Number of filters: 16669
Total memory (MB): 200.009
Bits/key actual: 10.0647
----------------------------
Inside queries...
Dry run (46b) ns/op: 33.4258
Single filter ns/op: 42.5974
Random filter ns/op: 217.861
----------------------------
Outside queries...
Dry run (25d) ns/op: 32.4217
Single filter ns/op: 50.9855
Random filter ns/op: 219.167
Average FP rate %: 1.13993
----------------------------
Done. (For more info, run with -legend or -help.)
$ ./filter_bench -quick -use_full_block_reader
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 26.5172
Number of filters: 16669
Total memory (MB): 200.009
Bits/key actual: 10.0647
----------------------------
Inside queries...
Dry run (46b) ns/op: 32.3556
Single filter ns/op: 83.2239
Random filter ns/op: 370.676
----------------------------
Outside queries...
Dry run (25d) ns/op: 32.2265
Single filter ns/op: 93.5651
Random filter ns/op: 408.393
Average FP rate %: 1.13993
----------------------------
Done. (For more info, run with -legend or -help.)
```
With the new code:
```
$ ./filter_bench -quick
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 25.4285
Number of filters: 16669
Total memory (MB): 200.009
Bits/key actual: 10.0647
----------------------------
Inside queries...
Dry run (46b) ns/op: 31.0594
Single filter ns/op: 43.8974
Random filter ns/op: 226.075
----------------------------
Outside queries...
Dry run (25d) ns/op: 31.0295
Single filter ns/op: 50.3824
Random filter ns/op: 226.805
Average FP rate %: 1.13993
----------------------------
Done. (For more info, run with -legend or -help.)
$ ./filter_bench -quick -use_full_block_reader
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 26.5308
Number of filters: 16669
Total memory (MB): 200.009
Bits/key actual: 10.0647
----------------------------
Inside queries...
Dry run (46b) ns/op: 33.2968
Single filter ns/op: 58.6163
Random filter ns/op: 291.434
----------------------------
Outside queries...
Dry run (25d) ns/op: 32.1839
Single filter ns/op: 66.9039
Random filter ns/op: 292.828
Average FP rate %: 1.13993
----------------------------
Done. (For more info, run with -legend or -help.)
```
Differential Revision: D17991712
Pulled By: ltamasi
fbshipit-source-id: 7ea205550217bfaaa1d5158ebd658e5832e60f29
Summary:
Plain table SSTs could crash sst_dump because of a bug in
PlainTableReader that can leave table_properties_ as null. Even if it
was intended not to keep the table properties in some cases, they were
leaked on the offending code path.
Steps to reproduce:
$ db_bench --benchmarks=fillrandom --num=2000000 --use_plain_table --prefix-size=12
$ sst_dump --file=0000xx.sst --show_properties
from [] to []
Process /dev/shm/dbbench/000014.sst
Sst file format: plain table
Raw user collected properties
------------------------------
Segmentation fault (core dumped)
Also added missing unit testing of plain table full_scan_mode, and
an assertion in NewIterator to check for regression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5940
Test Plan: new unit test, manual, make check
Differential Revision: D18018145
Pulled By: pdillinger
fbshipit-source-id: 4310c755e824c4cd6f3f86a3abc20dfa417c5e07
Summary:
Partition Filters make use of a top-level index to find the partition that might have the bloom hash of the key. The index is with internal key format (before format version 3). Each partition contains the i) blooms of the keys in that range ii) bloom of prefixes of keys in that range, iii) the bloom of the prefix of the last key in the previous partition.
When ::SeekForPrev(key), we first perform a prefix bloom test on the SST file. The partition however is identified using the full internal key, rather than the prefix key. The reason is to be compatible with the internal key format of the top-level index. This creates a corner case. Example:
- SST k, Partition N: P1K1, P1K2
- SST k, top-level index: P1K2
- SST k+1, Partition 1: P2K1, P3K1
- SST k+1 top-level index: P3K1
When SeekForPrev(P1K3), it should point us to P1K2. However SST k top-level index would reject P1K3 since it is out of range.
One possible fix would be to search with the prefix P1 (instead of full internal key P1K3) however the details of properly comparing prefix with full internal key might get complicated. The fix we apply in this PR is to look into the last partition anyway even if the key is out of range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5907
Differential Revision: D17889918
Pulled By: maysamyabandeh
fbshipit-source-id: 169fd7b3c71dbc08808eae5a8340611ebe5bdc1e
Summary:
Since we do not evict a file's blocks from block cache before that file
is deleted, we require a file's cache ID prefix is both unique and
non-reusable. However, the Windows functionality we were relying on only
guaranteed uniqueness. That meant a newly created file could be assigned
the same cache ID prefix as a deleted file. If the newly created file
had block offsets matching the deleted file, full cache keys could be
exactly the same, resulting in obsolete data blocks returned from cache
when trying to read from the new file.
We noticed this when running on FAT32 where compaction was writing out
of order keys due to reading obsolete blocks from its input files. The
functionality is documented as behaving the same on NTFS, although I
wasn't able to repro it there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5844
Test Plan:
we had a reliable repro of out-of-order keys on FAT32 that
was fixed by this change
Differential Revision: D17752442
fbshipit-source-id: 95d983f9196cf415f269e19293b97341edbf7e00
Summary:
RocksDB has a MultiGet() API that implements batched key lookup for higher performance (https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L468). Currently, batching is implemented in BlockBasedTableReader::MultiGet() for SST file lookups. One of the ways it improves performance is by pipelining bloom filter lookups (by prefetching required cachelines for all the keys in the batch, and then doing the probe) and thus hiding the cache miss latency. The same concept can be extended to the memtable as well. This PR involves implementing a pipelined bloom filter lookup in DynamicBloom, and implementing MemTable::MultiGet() that can leverage it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5818
Test Plan:
Existing tests
Performance Test:
Ran the below command which fills up the memtable and makes sure there are no flushes and then call multiget. Ran it on master and on the new change and see atleast 1% performance improvement across all the test runs I did. Sometimes the improvement was upto 5%.
TEST_TMPDIR=/data/users/$USER/benchmarks/feature/ numactl -C 10 ./db_bench -benchmarks="fillseq,multireadrandom" -num=600000 -compression_type="none" -level_compaction_dynamic_level_bytes -write_buffer_size=200000000 -target_file_size_base=200000000 -max_bytes_for_level_base=16777216 -reads=90000 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 -statistics -memtable_whole_key_filtering=true -memtable_bloom_size_ratio=10
Differential Revision: D17578869
Pulled By: vjnadimpalli
fbshipit-source-id: 23dc651d9bf49db11d22375bf435708875a1f192
Summary:
Example: using the tool before and after PR https://github.com/facebook/rocksdb/issues/5784 shows that
the refactoring, presumed performance-neutral, actually sped up SST
filters by about 3% to 8% (repeatable result):
Before:
- Dry run ns/op: 22.4725
- Single filter ns/op: 51.1078
- Random filter ns/op: 120.133
After:
+ Dry run ns/op: 22.2301
+ Single filter run ns/op: 47.4313
+ Random filter ns/op: 115.9
Only tests filters for the block-based table (full filters and
partitioned filters - same implementation; not block-based filters),
which seems to be the recommended format/implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5825
Differential Revision: D17804987
Pulled By: pdillinger
fbshipit-source-id: 0f18a9c254c57f7866030d03e7fa4ba503bac3c5