rocksdb

Author	SHA1	Message	Date
Zhongyi Xie	eea1cad850	avoid updating index type during iterator creation (#5288 ) Summary: Right now there is a potential race condition where two threads are created to iterate through the DB (https://gist.github.com/miasantreble/88f5798a397ee7cb8e7baff9db2d9e85). The problem is that in `BlockBasedTable::NewIndexIterator`, if both threads failed to find index_reader from block cache, they will call `CreateIndexReader->UpdateIndexType()` which creates a race to update `index_type` in the shared rep_ object. By checking the code, we realize the index type is always populated by `PrefetchIndexAndFilterBlocks` during the table `Open` call, so there is no need to update index type every time during iterator creation. This PR attempts to fix the race condition by removing the unnecessary call to `UpdateIndexType` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5288 Differential Revision: D15252509 Pulled By: miasantreble fbshipit-source-id: 6e3258652121d5c76d267f7ac457e15c5e84756e	2019-05-07 20:20:40 -07:00
Siying Dong	4479dff208	Reduce binary search when reseek into the same data block (#5256 ) Summary: Right now, when Seek() is called again, RocksDB always does a binary search against the files and index blocks, even if they end up with the same file/block. Improve it as following: 1. in LevelIterator, reseek first try to check the boundary of the current file. If it falls into the same file, skip the binary search to find the file 2. in block based table iterator, reseek skip to reseek the iterator block if the seek key is larger than the current key and lower than the index key (boundary of the current block and the next block). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5256 Differential Revision: D15105072 Pulled By: siying fbshipit-source-id: 39634bdb4a881082451fa39cecd7ecf12160bf80	2019-05-01 14:26:30 -07:00
Andrew Kryczka	b02d0c238d	Init compression dict handle before reading meta-blocks (#5267 ) Summary: At least one of the meta-block loading functions (`ReadRangeDelBlock`) uses the same block reading function (`NewDataBlockIterator`) as data block reads, which means it uses the dictionary handle. However, the dictionary handle was uninitialized while reading meta-blocks, causing readers to receive an error. This situation was only noticed when `cache_index_and_filter_blocks=true`. This PR initializes the handle to null while reading meta-blocks to prevent the error. It also adds support to `db_stress` / `db_crashtest.py` for `cache_index_and_filter_blocks`. Fixes #5263. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5267 Differential Revision: D15149264 Pulled By: maysamyabandeh fbshipit-source-id: 991d38a306c62db5976778bfb050fa3cd4a0671b	2019-04-30 09:50:49 -07:00
Sagar Vemuri	3548e4220d	Improve explicit user readahead performance (#5246 ) Summary: Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`. 1. Stop creating new table readers when the user explicitly sets readahead size. 2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases). 3. Add `readahead_size` to db_bench. Benchmarks: https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737 For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%. For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%. Test Plan: Updated `DBIteratorTest.ReadAhead` test to make sure that: - no new table readers are created for iterators on setting ReadOptions.readahead_size - At least "readahead" number of bytes are actually getting read on each iterator read. TODO later: - Use similar logic for compactions as well. - This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246 Differential Revision: D15107946 Pulled By: sagar0 fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea	2019-04-26 21:24:10 -07:00
Yuchi Chen	78a6e07c83	Fix compilation errors for 32bits/LITE/ios build. (#5220 ) Summary: When I build RocksDB for 32bits/LITE/iOS environment, some errors like the following. ` table/block_based_table_reader.cc:971:44: error: implicit conversion loses integer precision: 'uint64_t' (aka 'unsigned long long') to 'size_t' (aka 'unsigned long') [-Werror,-Wshorten-64-to-32] size_t block_size = props_block_handle.size(); ~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~^~~~~~ ./util/file_reader_writer.h:177:8: error: private field 'env_' is not used [-Werror,-Wunused-private-field] Env* env_; ^ ` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5220 Differential Revision: D15023481 Pulled By: siying fbshipit-source-id: 1b5d121d3016f2b0a8a9a2cc1bd638479357f9f7	2019-04-22 16:02:16 -07:00
Siying Dong	992dfc7811	Introduce InternalIteratorBase::NextAndGetResult() (#5197 ) Summary: In long scans, virtual function calls of Next(), Valid(), key() and value() are not trivial. By introducing NextAndGetResult(), Some of the Next(), Valid() and key() calls are consolidated into one virtual function call to reduce CPU. Also did some inline tricks and add some "final" randomly in some functions. Even without the "final" annotation, most Next() calls are inlined with -O3, but sometimes with a final it is inlined by O2 too. It doesn't hurt to add those final annotations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5197 Differential Revision: D14945977 Pulled By: siying fbshipit-source-id: 7003969f9a5f1d5717f0bda503b91d19ba75ed88	2019-04-18 11:12:39 -07:00
yiwu-arbug	f1239d5f10	Avoid per-key upper bound check in BlockBasedTableIterator (#5142 ) Summary: This is second attempt for #5101. Original commit message: `BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons. This patch come with two fixes: Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter. Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142 Differential Revision: D14907113 Pulled By: siying fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb	2019-04-16 11:37:47 -07:00
anand76	fefd4b98c5	Introduce a new MultiGet batching implementation (#5011 ) Summary: This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching. Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to - 1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch() 2. Bloom filter cachelines can be prefetched, hiding the cache miss latency The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress. Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32). Batch Sizes 1 \| 2 \| 4 \| 8 \| 16 \| 32 Random pattern (Stride length 0) 4.158 \| 4.109 \| 4.026 \| 4.05 \| 4.1 \| 4.074 - Get 4.438 \| 4.302 \| 4.165 \| 4.122 \| 4.096 \| 4.075 - MultiGet (no batching) 4.461 \| 4.256 \| 4.277 \| 4.11 \| 4.182 \| 4.14 - MultiGet (w/ batching) Good locality (Stride length 16) 4.048 \| 3.659 \| 3.248 \| 2.99 \| 2.84 \| 2.753 4.429 \| 3.728 \| 3.406 \| 3.053 \| 2.911 \| 2.781 4.452 \| 3.45 \| 2.833 \| 2.451 \| 2.233 \| 2.135 Good locality (Stride length 256) 4.066 \| 3.786 \| 3.581 \| 3.447 \| 3.415 \| 3.232 4.406 \| 4.005 \| 3.644 \| 3.49 \| 3.381 \| 3.268 4.393 \| 3.649 \| 3.186 \| 2.882 \| 2.676 \| 2.62 Medium locality (Stride length 4096) 4.012 \| 3.922 \| 3.768 \| 3.61 \| 3.582 \| 3.555 4.364 \| 4.057 \| 3.791 \| 3.65 \| 3.57 \| 3.465 4.479 \| 3.758 \| 3.316 \| 3.077 \| 2.959 \| 2.891 dbbench command used (on a DB with 4 levels, 12 million keys)- TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011 Differential Revision: D14348703 Pulled By: anand1976 fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b	2019-04-11 14:28:26 -07:00
Levi Tamasi	59ef2ba559	Evict the uncompression dictionary from the block cache upon table close (#5150 ) Summary: The uncompression dictionary object has a Statistics pointer that might dangle if the database closed. This patch evicts the dictionary from the block cache when a table is closed, similarly to how index and filter readers are handled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5150 Differential Revision: D14782422 Pulled By: ltamasi fbshipit-source-id: 0cec9336c742c479aa92206e04521767f1aa9622	2019-04-04 16:21:12 -07:00
Siying Dong	ebcc8ae1d3	Revert "Avoid per-key upper bound check in BlockBasedTableIterator (#5101 )" (#5132 ) Summary: This reverts commit `f29dc1b906`. In BlockBasedTableIterator, index_iter_->key() is sometimes a user key, so it is wrong to call ExtractUserKey() against it. This is a bug introduced by #5101. Temporarily revert the diff to keep the branch clean. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5132 Differential Revision: D14718584 Pulled By: siying fbshipit-source-id: 0ac55dc9b5dbc18c7809092146bdf7eb9364b9ad	2019-04-02 10:00:38 -07:00
Yi Wu	f29dc1b906	Avoid per-key upper bound check in BlockBasedTableIterator (#5101 ) Summary: `BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5101 Differential Revision: D14678707 Pulled By: siying fbshipit-source-id: 2372446116753c7892ea4cec7b4b49ef87ba463e	2019-03-29 13:11:46 -07:00
Siying Dong	89ab1381f8	Apply automatic formatting to some files (#5114 ) Summary: Following files were run through automatic formatter: db/db_impl.cc db/db_impl.h db/db_impl_compaction_flush.cc db/db_impl_debug.cc db/db_impl_files.cc db/db_impl_readonly.h db/db_impl_write.cc db/dbformat.cc db/dbformat.h table/block.cc table/block.h table/block_based_filter_block.cc table/block_based_filter_block.h table/block_based_filter_block_test.cc table/block_based_table_builder.cc table/block_based_table_reader.cc table/block_based_table_reader.h table/block_builder.cc table/block_builder.h table/block_fetcher.cc table/block_prefix_index.cc table/block_prefix_index.h table/block_test.cc table/format.cc table/format.h I could easily run all the files, but I don't want people to feel that I'm doing it for lines of code changes :) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5114 Differential Revision: D14633040 Pulled By: siying fbshipit-source-id: 3f346cb53bf21e8c10704400da548dfce1e89a52	2019-03-27 16:24:45 -07:00
Yi Wu	d69241586e	Fix perf_context.user_key_comparison_count for range scan (#5098 ) Summary: Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it. Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098 Differential Revision: D14603753 Pulled By: siying fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae	2019-03-27 10:34:27 -07:00
Siying Dong	2b4d5ceb47	Remove some "using std::..." from header files. (#5113 ) Summary: The code convention we are following, Google C++ Style, discourage alias in header files, especially public headers: https://google.github.io/styleguide/cppguide.html#Aliases Remove some of them. Might removed some from .cc files as well to be consistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113 Differential Revision: D14633030 Pulled By: siying fbshipit-source-id: b990edc919d5de60295992284f980195e501d424	2019-03-27 10:28:21 -07:00
Yi Wu	75133b1b6b	Fix SstFileReader not able to open ingested file (#5097 ) Summary: Since `SstFileReader` don't know largest seqno of a file, it will fail this check when it open a file with global seqno: `ca89ac2ba9/table/block_based_table_reader.cc (L730)` Changes: * Pass largest_seqno=kMaxSequenceNumber from `SstFileReader` and allow it to bypass the above check. * `BlockBasedTable::VerifyChecksum` also double check if checksum will match when excluding global seqno (this is to make the new test in sst_table_reader_test pass). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5097 Differential Revision: D14607434 Pulled By: riversand963 fbshipit-source-id: 9008599227c5fccbf9b73fee46b3bf4a1523f023	2019-03-26 10:25:18 -07:00
Yi Wu	7ca9eb7542	Fix BlockBasedTableIterator construction missing index_key_is_full parameter Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5104 Differential Revision: D14619000 Pulled By: maysamyabandeh fbshipit-source-id: c2895794a3f31b826c149dcb698c1952dacc2332	2019-03-26 10:13:01 -07:00
Michael Liu	ca89ac2ba9	Apply modernize-use-override (2nd iteration) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. Reviewed By: Orvid Differential Revision: D14090024 fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a	2019-02-14 14:41:36 -08:00
Yanqin Jin	2d049ab7e8	Checksum properties block for block-based table (#4956 ) Summary: Always enable properties block checksum verification for block-based table. For external SST file ingested with 'write_global_seqno==true', we use 'DecodeEntrySlow' to parse its blocks' contents so that the process will not die upon failing the assertion possibly caused by corruption. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4956 Differential Revision: D14012741 Pulled By: riversand963 fbshipit-source-id: 8b766e6f54b36f8f9e074c0e19e0926ec3cce186	2019-02-11 11:50:01 -08:00
Andrew Kryczka	8ec3e72551	Cache dictionary used for decompressing data blocks (#4881 ) Summary: - If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`. - If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle. - New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count" Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881 Differential Revision: D13663801 Pulled By: ajkr fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f	2019-01-23 18:15:47 -08:00
Andrew Kryczka	27054d837b	Call NewDataBlockIterator with correct arguments in DB::Get (#4913 ) Summary: The pointer `get_context` was passed as the value for the boolean argument `index_key_is_full`. Luckily the pointer was always non-null so evaluated to true which is the correct value for the boolean argument. But we were missing out on batch updates to stats since we were not passing anything for the `GetContext*` argument and it defaults to `nullptr`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4913 Differential Revision: D13791449 Pulled By: ajkr fbshipit-source-id: dbe40bf406c64d34cb5298604145d18b9e0ca9be	2019-01-23 15:39:05 -08:00
Andrew Kryczka	01013ae766	Digest ZSTD compression dictionary once when writing SST file (#4849 ) Summary: This is essentially a re-submission of #4251 with a few improvements: - Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict` - Eliminated `Init` functions. Instead do all initialization work in constructors. - Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849 Differential Revision: D13606039 Pulled By: ajkr fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447	2019-01-18 19:12:57 -08:00
Alexander Zinoviev	80bf8975fd	Add a new per level counter for block cache hit (#4796 ) Summary: Add a new per level counter for block cache hits, increase it by one on every successful attempt to get an entry from cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4796 Differential Revision: D13513688 Pulled By: zinoale fbshipit-source-id: 104df038f1232e3356e162eb2d8ca138e34a8281	2018-12-21 13:20:05 -08:00
Andrew Kryczka	e0be1bc4f1	fix DeleteRange memory leak for mmap and block cache (#4810 ) Summary: Previously we were cleaning up range tombstone meta-block by calling `ReleaseCachedEntry`, which wouldn't work if `value != nullptr && cache_handle == nullptr`. This happened at least in the case with mmap reads and block cache both enabled. I noticed `NewDataBlockIterator` intends to handle all these cases, so migrated to that instead of `NewUnfragmentedRangeTombstoneIterator`. Also changed the table-opening logic to fail on `ReadRangeDelBlock` failure, since that can cause data corruption. Added a test case to verify this behavior. Note the test case does not fail on `TryReopen` because failure to preload table handlers is not considered critical. However, it does fail on any read involving that file since it cannot return correct data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4810 Differential Revision: D13534296 Pulled By: ajkr fbshipit-source-id: 55dde1111717cea6ec4bf38418daab81ccef3599	2018-12-20 21:59:49 -08:00
Abhishek Madan	cad248f5c6	Prepare FragmentedRangeTombstoneIterator for use in compaction (#4740 ) Summary: To support the flush/compaction use cases of RangeDelAggregator in v2, FragmentedRangeTombstoneIterator now supports dropping tombstones that cannot be read in the compaction output file. Furthermore, FragmentedRangeTombstoneIterator supports the "snapshot striping" use case by allowing an iterator to be split by a list of snapshots. RangeDelAggregatorV2 will use these changes in a follow-up change. In the process of making these changes, other miscellaneous cleanups were also done in these files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4740 Differential Revision: D13287382 Pulled By: abhimadan fbshipit-source-id: f5aeb03e1b3058049b80c02a558ee48f723fa48c	2018-12-11 12:10:48 -08:00
Anand Ananthabhotla	1b01d23be2	Add PerfContext counters for index/filter block cache stats (#4540 ) Summary: Add counters to track block cache index/filter hits and misses. We currently count aggregate hits and misses, which includes index/filter/data blocks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4540 Differential Revision: D10459652 Pulled By: anand1976 fbshipit-source-id: 0c59eee7f12f5103dcb6686f0e7995babe63d425	2018-12-07 15:07:56 -08:00
Sagar Vemuri	0463f61837	Refactor BlockBasedTable::Open (#4636 ) Summary: Refactored and simplified `BlockBasedTable::Open` to be similar to `BlockBasedTableBuilder::Finish` as both these functions complement each other. Also added `BlockBasedTableBuilder::WriteFooter` along the way. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4636 Differential Revision: D12933319 Pulled By: sagar0 fbshipit-source-id: 1ff1d02f6d80a63b5ba720a1fc75e71c7344137b	2018-12-07 13:18:44 -08:00
Yi Wu	512a5e3ef8	Fix BlockBasedTable not always using memory allocator if available (#4678 ) Summary: Fix block based table reader not using memory_allocator when allocating index blocks and compression dictionary blocks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4678 Differential Revision: D13054594 Pulled By: yiwu-arbug fbshipit-source-id: 379f25bcc665395662511c4f873f4b7b55104ce2	2018-11-28 18:01:24 -08:00
Abhishek Madan	8fe1e06ca0	Clean up FragmentedRangeTombstoneList (#4692 ) Summary: Removed `one_time_use` flag, which removed the need for some tests, and changed all `NewRangeTombstoneIterator` methods to return `FragmentedRangeTombstoneIterators`. These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones` and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692 Differential Revision: D13106570 Pulled By: abhimadan fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845	2018-11-28 15:29:02 -08:00
Yi Wu	05d9d82181	Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697 ) Summary: …ons (#4676)" This reverts commit `b32d087dbb`. `MemoryAllocator` needs to be with `Cache`, since cache entry can outlive DB and block based table. The cache needs to hold reference to memory allocator when deleting cache entry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697 Differential Revision: D13133490 Pulled By: yiwu-arbug fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a	2018-11-21 11:29:57 -08:00
Abhishek Madan	6bee36a786	Modify FragmentedRangeTombstoneList member layout (#4632 ) Summary: Rather than storing a `vector<RangeTombstone>`, we now store a `vector<RangeTombstoneStack>` and a `vector<SequenceNumber>`. A `RangeTombstoneStack` contains the start and end keys of a range tombstone fragment, and indices into the seqnum vector to indicate which sequence numbers the fragment is located at. The diagram below illustrates an example: ``` tombstones_: [a, b) [c, e) [h, k) \| \ / \ / \| \| \ / \ / \| v v v v tombstone_seqs_: [ 5 3 10 7 2 8 6 ] ``` This format allows binary searching the tombstone list to use less key comparisons, which helps in cases where there are many overlapping tombstones. Also, this format makes it easier to add DBIter-like semantics to `FragmentedRangeTombstoneIterator` in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4632 Differential Revision: D13053103 Pulled By: abhimadan fbshipit-source-id: e8220cc712fcf5be4d602913bb23ace8ea5f8ef0	2018-11-14 17:52:17 -08:00
Siying Dong	b82e57d425	Remove two variables from BlockContents class and don't use class Block for compressed block (#4650 ) Summary: We carry compression type and "cachable" variables for every block in the block cache, while they take well-known values. 8-byte is wasted for each block (2-byte for useful information but it takes 8 bytes because of padding). With this change, these two variables are removed. The cachable information is only useful in the process of reading the block. We use other information to infer from it. For compressed blocks, the compression type is a part of the block content itself so we can get it from there. Some code is slightly refactored so that the cachable information can flow better. Another change is to only use class BlockContents for compressed block, and narrow the class Block to only be used for uncompressed blocks, including blocks in compressed block cache. This can make the Block class less confusing. It also saves tens of bytes for each block in compressed block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4650 Differential Revision: D12969070 Pulled By: siying fbshipit-source-id: 548b62724e9eb66993026429fd9c7c3acd1f95ed	2018-11-13 17:02:55 -08:00
Yi Wu	b32d087dbb	Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676 ) Summary: Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy. It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676 Differential Revision: D13047662 Pulled By: yiwu-arbug fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd	2018-11-13 13:48:38 -08:00
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	2018-11-09 11:19:58 -08:00
Yi Wu	f560c8f5c8	s/CacheAllocator/MemoryAllocator/g (#4590 ) Summary: Rename the interface, as it is mean to be a generic interface for memory allocation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590 Differential Revision: D10866340 Pulled By: yiwu-arbug fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16	2018-10-26 14:30:30 -07:00
Abhishek Madan	7528130e38	Cache fragmented range tombstones in BlockBasedTableReader (#4493 ) Summary: This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses. On the same DB used in #4449, running `readrandom` results in the following: ``` readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found) ``` Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results): ``` Tombstones? \| avg micros/op \| stddev micros/op \| avg ops/s \| stddev ops/s ----------------- \| ------------- \| ---------------- \| ------------ \| ------------ None \| 0.6186 \| 0.04637 \| 1,625,252.90 \| 124,679.41 500 Expanded \| 0.6019 \| 0.03628 \| 1,666,670.40 \| 101,142.65 500 Unexpanded \| 0.6435 \| 0.03994 \| 1,559,979.40 \| 104,090.52 1k Expanded \| 0.6034 \| 0.04349 \| 1,665,128.10 \| 125,144.57 1k Unexpanded \| 0.6261 \| 0.03093 \| 1,600,457.50 \| 79,024.94 5k Expanded \| 0.6163 \| 0.05926 \| 1,636,668.80 \| 154,888.85 5k Unexpanded \| 0.6402 \| 0.04002 \| 1,567,804.70 \| 100,965.55 10k Expanded \| 0.6036 \| 0.05105 \| 1,667,237.70 \| 142,830.36 10k Unexpanded \| 0.6128 \| 0.02598 \| 1,634,633.40 \| 72,161.82 25k Expanded \| 0.6198 \| 0.04542 \| 1,620,980.50 \| 116,662.93 25k Unexpanded \| 0.5478 \| 0.0362 \| 1,833,059.10 \| 121,233.81 50k Expanded \| 0.5104 \| 0.04347 \| 1,973,107.90 \| 184,073.49 50k Unexpanded \| 0.4528 \| 0.03387 \| 2,219,034.50 \| 170,984.32 ``` After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493 Differential Revision: D10842844 Pulled By: abhimadan fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509	2018-10-25 19:26:44 -07:00
Zhongyi Xie	21bf7421ca	use per-level perf context for bloom filter related counters (#4581 ) Summary: PR https://github.com/facebook/rocksdb/pull/4226 introduced per-level perf context which allows breaking down perf context by levels. This PR takes advantage of the feature to populate a few counters related to bloom filters Pull Request resolved: https://github.com/facebook/rocksdb/pull/4581 Differential Revision: D10518010 Pulled By: miasantreble fbshipit-source-id: 011244561783ec860d32d5b0fa6bce6e78d70ef8	2018-10-24 12:21:38 -07:00
Zhongyi Xie	d6ec288703	Add PerfContextByLevel to provide per level perf context information (#4226 ) Summary: Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level. This will be helpful when analyzing point and range query performance as well as tuning bloom filter Also replaced __thread with thread_local keyword for perf_context Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226 Differential Revision: D10369509 Pulled By: miasantreble fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82	2018-10-17 11:19:40 -07:00
Zhongyi Xie	ce1fc5af09	fix unused param `allocator` in compression.h (#4453 ) Summary: this should fix currently failing contrun test: rocksdb-contrun-no_compression, rocksdb-contrun-tsan, rocksdb-contrun-tsan_crash Pull Request resolved: https://github.com/facebook/rocksdb/pull/4453 Differential Revision: D10202626 Pulled By: miasantreble fbshipit-source-id: 850b07f14f671b5998c22d8239e2a55b2fc1e355	2018-10-04 13:24:22 -07:00
Igor Canadi	1cf5deb8fd	Introduce CacheAllocator, a custom allocator for cache blocks (#4437 ) Summary: This is a conceptually simple change, but it touches many files to pass the allocator through function calls. We introduce CacheAllocator, which can be used by clients to configure custom allocator for cache blocks. Our motivation is to hook this up with folly's `JemallocNodumpAllocator` (`f43ce6d686/folly/experimental/JemallocNodumpAllocator.h`), but there are many other possible use cases. Additionally, this commit cleans up memory allocation in `util/compression.h`, making sure that all allocations are wrapped in a unique_ptr as soon as possible. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437 Differential Revision: D10132814 Pulled By: yiwu-arbug fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564	2018-10-02 17:24:58 -07:00
Andrew Kryczka	2c14662213	Revert "Digest ZSTD compression dictionary once per SST file (#4251 )" (#4347 ) Summary: Reverting is needed to unblock a user building against master, who is blocked for multiple days due to a thread-safety issue in `GetEmptyDict`. We haven't been able to fix it quickly, so reverting. Simply ran `git revert 6c40806e51a89386d2b066fddf73d3fd03a36f65`. There were no merge conflicts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4347 Differential Revision: D9668365 Pulled By: ajkr fbshipit-source-id: 0c56334f0a23cf5ee0233d4e4679eae6709739cd	2018-09-06 09:58:34 -07:00
cngzhnp	64324e329e	Support pragma once in all header files and cleanup some warnings (#4339 ) Summary: As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited. Besides this, try to fix some warnings about loss of data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339 Differential Revision: D9654990 Pulled By: ajkr fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848	2018-09-05 18:13:31 -07:00
Andrew Kryczka	6c40806e51	Digest ZSTD compression dictionary once per SST file (#4251 ) Summary: In RocksDB, for a given SST file, all data blocks are compressed with the same dictionary. When we compress a block using the dictionary's raw bytes, the compression library first has to digest the dictionary to get it into a usable form. This digestion work is redundant and ideally should be done once per file. ZSTD offers APIs for the caller to create and reuse a digested dictionary object (`ZSTD_CDict`). In this PR, we call `ZSTD_createCDict` once per file to digest the raw bytes. Then we use `ZSTD_compress_usingCDict` to compress each data block using the pre-digested dictionary. Once the file's created `ZSTD_freeCDict` releases the resources held by the digested dictionary. There are a couple other changes included in this PR: - Changed the parameter object for (un)compression functions from `CompressionContext`/`UncompressionContext` to `CompressionInfo`/`UncompressionInfo`. This avoids the previous pattern, where `CompressionContext`/`UncompressionContext` had to be mutated before calling a (un)compression function depending on whether dictionary should be used. I felt that mutation was error-prone so eliminated it. - Added support for digested uncompression dictionaries (`ZSTD_DDict`) as well. However, this PR does not support reusing them across uncompression calls for the same file. That work is deferred to a later PR when we will store the `ZSTD_DDict` objects in block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4251 Differential Revision: D9257078 Pulled By: ajkr fbshipit-source-id: 21b8cb6bbdd48e459f1c62343780ab66c0a64438	2018-08-23 19:28:18 -07:00
Siying Dong	dc064f302e	Suppress two CLANG Analyze warning (#4291 ) Summary: Suppress two CLANG analyze warnings. They don't seem to be real bugs Pull Request resolved: https://github.com/facebook/rocksdb/pull/4291 Differential Revision: D9407333 Pulled By: siying fbshipit-source-id: 2ed63d88fa0b217fdccb1572d7508467c2203dc8	2018-08-20 16:57:38 -07:00
Fenggang Wu	19ec44fd39	Improve point-lookup performance using a data block hash index (#4174 ) Summary: Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.` The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174 Differential Revision: D8965914 Pulled By: fgwu fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198	2018-08-15 14:30:03 -07:00
Maysam Yabandeh	caf0f53a74	Index value delta encoding (#3983 ) Summary: Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4. The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size. Results with sysbench read-only using 4k blocks and using 16 index restart interval: Format 2: 19585 rocksdb read-only range=100 Format 3: 19569 rocksdb read-only range=100 Format 4: 19352 rocksdb read-only range=100 Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983 Differential Revision: D8361343 Pulled By: maysamyabandeh fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651	2018-08-09 16:58:40 -07:00
Yanqin Jin	54de56844d	Remove random writes from SST file ingestion (#4172 ) Summary: RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172 Differential Revision: D8961465 Pulled By: riversand963 fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071	2018-07-27 16:12:23 -07:00
Siying Dong	2a81633da2	Fix bug when seeking backward against an out-of-bound iterator (#4187 ) Summary: `92ee3350e0` introduces an out-of-bound check in BlockBasedTableIterator::Valid(). However, this flag is not reset when re-seeking in backward direction. This caused the iterator to be invalide by mistake. Fix it by always resetting the out-of-bound flag in every seek. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4187 Differential Revision: D8996600 Pulled By: siying fbshipit-source-id: b6235ea614f71381e50e7904c4fb036300604ac1	2018-07-25 17:14:01 -07:00
Zhongyi Xie	f95a5b2464	Avoid unnecessary big for-loop when reporting ticker stats stored in GetContext (#3490 ) Summary: Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker` Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490 Differential Revision: D6969154 Pulled By: miasantreble fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029	2018-07-20 16:58:13 -07:00
Siying Dong	8425c8bd4d	BlockBasedTableReader: automatically adjust tail prefetch size (#4156 ) Summary: Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others. Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10% Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156 Differential Revision: D8916847 Pulled By: siying fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758	2018-07-20 14:43:37 -07:00
Andrew Kryczka	b5613227a9	Smaller tail readahead when not reading index/filters (#4159 ) Summary: In all cases during `BlockBasedTable::Open`, we issue at least three read requests to the file's tail: (1) footer, (2) metaindex block, and (3) properties block. Depending on the config, we may also read other metablocks like filter and index. This PR issues smaller readahead when we expect to do only the three necessary reads mentioned above. Then, 4KB should be enough (ignoring the case where there are lots of user-defined properties). We can keep doing 512KB readahead when additional reads are expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4159 Differential Revision: D8924002 Pulled By: ajkr fbshipit-source-id: cfc713275de4d05ce11f18571f1d72e27ccd3356	2018-07-19 16:13:22 -07:00

1 2 3 4 5

250 Commits