Commit Graph

239 Commits

Author SHA1 Message Date
Siying Dong
89ab1381f8 Apply automatic formatting to some files (#5114)
Summary:
Following files were run through automatic formatter:
db/db_impl.cc
db/db_impl.h
db/db_impl_compaction_flush.cc
db/db_impl_debug.cc
db/db_impl_files.cc
db/db_impl_readonly.h
db/db_impl_write.cc
db/dbformat.cc
db/dbformat.h
table/block.cc
table/block.h
table/block_based_filter_block.cc
table/block_based_filter_block.h
table/block_based_filter_block_test.cc
table/block_based_table_builder.cc
table/block_based_table_reader.cc
table/block_based_table_reader.h
table/block_builder.cc
table/block_builder.h
table/block_fetcher.cc
table/block_prefix_index.cc
table/block_prefix_index.h
table/block_test.cc
table/format.cc
table/format.h

I could easily run all the files, but I don't want people to feel that
I'm doing it for lines of code changes :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5114

Differential Revision: D14633040

Pulled By: siying

fbshipit-source-id: 3f346cb53bf21e8c10704400da548dfce1e89a52
2019-03-27 16:24:45 -07:00
Yi Wu
d69241586e Fix perf_context.user_key_comparison_count for range scan (#5098)
Summary:
Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it.

Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098

Differential Revision: D14603753

Pulled By: siying

fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae
2019-03-27 10:34:27 -07:00
Siying Dong
2b4d5ceb47 Remove some "using std::..." from header files. (#5113)
Summary:
The code convention we are following, Google C++ Style, discourage
alias in header files, especially public headers:
https://google.github.io/styleguide/cppguide.html#Aliases
Remove some of them. Might removed some from .cc files as well to be consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113

Differential Revision: D14633030

Pulled By: siying

fbshipit-source-id: b990edc919d5de60295992284f980195e501d424
2019-03-27 10:28:21 -07:00
Yi Wu
75133b1b6b Fix SstFileReader not able to open ingested file (#5097)
Summary:
Since `SstFileReader` don't know largest seqno of a file, it will fail this check when it open a file with global seqno: ca89ac2ba9/table/block_based_table_reader.cc (L730)
Changes:
* Pass largest_seqno=kMaxSequenceNumber from `SstFileReader` and allow it to bypass the above check.
* `BlockBasedTable::VerifyChecksum` also double check if checksum will match when excluding global seqno (this is to make the new test in sst_table_reader_test pass).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5097

Differential Revision: D14607434

Pulled By: riversand963

fbshipit-source-id: 9008599227c5fccbf9b73fee46b3bf4a1523f023
2019-03-26 10:25:18 -07:00
Yi Wu
7ca9eb7542 Fix BlockBasedTableIterator construction missing index_key_is_full parameter
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5104

Differential Revision: D14619000

Pulled By: maysamyabandeh

fbshipit-source-id: c2895794a3f31b826c149dcb698c1952dacc2332
2019-03-26 10:13:01 -07:00
Michael Liu
ca89ac2ba9 Apply modernize-use-override (2nd iteration)
Summary:
Use C++11’s override and remove virtual where applicable.
Change are automatically generated.

Reviewed By: Orvid

Differential Revision: D14090024

fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a
2019-02-14 14:41:36 -08:00
Yanqin Jin
2d049ab7e8 Checksum properties block for block-based table (#4956)
Summary:
Always enable properties block checksum verification for block-based table. For external SST file ingested with 'write_global_seqno==true', we use 'DecodeEntrySlow' to parse its blocks' contents so that the process will not die upon failing the assertion possibly caused by corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4956

Differential Revision: D14012741

Pulled By: riversand963

fbshipit-source-id: 8b766e6f54b36f8f9e074c0e19e0926ec3cce186
2019-02-11 11:50:01 -08:00
Andrew Kryczka
8ec3e72551 Cache dictionary used for decompressing data blocks (#4881)
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881

Differential Revision: D13663801

Pulled By: ajkr

fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
2019-01-23 18:15:47 -08:00
Andrew Kryczka
27054d837b Call NewDataBlockIterator with correct arguments in DB::Get (#4913)
Summary:
The pointer `get_context` was passed as the value for the boolean argument `index_key_is_full`. Luckily the pointer was always non-null so evaluated to true which is the correct value for the boolean argument. But we were missing out on batch updates to stats since we were not passing anything for the `GetContext*` argument and it defaults to `nullptr`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4913

Differential Revision: D13791449

Pulled By: ajkr

fbshipit-source-id: dbe40bf406c64d34cb5298604145d18b9e0ca9be
2019-01-23 15:39:05 -08:00
Andrew Kryczka
01013ae766 Digest ZSTD compression dictionary once when writing SST file (#4849)
Summary:
This is essentially a re-submission of #4251 with a few improvements:

- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849

Differential Revision: D13606039

Pulled By: ajkr

fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
2019-01-18 19:12:57 -08:00
Alexander Zinoviev
80bf8975fd Add a new per level counter for block cache hit (#4796)
Summary:
Add a new per level counter for block cache hits, increase it by one on every successful attempt to get an entry from cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4796

Differential Revision: D13513688

Pulled By: zinoale

fbshipit-source-id: 104df038f1232e3356e162eb2d8ca138e34a8281
2018-12-21 13:20:05 -08:00
Andrew Kryczka
e0be1bc4f1 fix DeleteRange memory leak for mmap and block cache (#4810)
Summary:
Previously we were cleaning up range tombstone meta-block by calling `ReleaseCachedEntry`, which wouldn't work if `value != nullptr && cache_handle == nullptr`. This happened at least in the case with mmap reads and block cache both enabled. I noticed `NewDataBlockIterator` intends to handle all these cases, so migrated to that instead of `NewUnfragmentedRangeTombstoneIterator`.

Also changed the table-opening logic to fail on `ReadRangeDelBlock` failure, since that can cause data corruption. Added a test case to verify this behavior. Note the test case does not fail on `TryReopen` because failure to preload table handlers is not considered critical. However, it does fail on any read involving that file since it cannot return correct data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4810

Differential Revision: D13534296

Pulled By: ajkr

fbshipit-source-id: 55dde1111717cea6ec4bf38418daab81ccef3599
2018-12-20 21:59:49 -08:00
Abhishek Madan
cad248f5c6 Prepare FragmentedRangeTombstoneIterator for use in compaction (#4740)
Summary:
To support the flush/compaction use cases of RangeDelAggregator
in v2, FragmentedRangeTombstoneIterator now supports dropping tombstones
that cannot be read in the compaction output file. Furthermore,
FragmentedRangeTombstoneIterator supports the "snapshot striping" use
case by allowing an iterator to be split by a list of snapshots.
RangeDelAggregatorV2 will use these changes in a follow-up change.

In the process of making these changes, other miscellaneous cleanups
were also done in these files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4740

Differential Revision: D13287382

Pulled By: abhimadan

fbshipit-source-id: f5aeb03e1b3058049b80c02a558ee48f723fa48c
2018-12-11 12:10:48 -08:00
Anand Ananthabhotla
1b01d23be2 Add PerfContext counters for index/filter block cache stats (#4540)
Summary:
Add counters to track block cache index/filter hits and misses. We currently count aggregate hits and misses, which includes index/filter/data blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4540

Differential Revision: D10459652

Pulled By: anand1976

fbshipit-source-id: 0c59eee7f12f5103dcb6686f0e7995babe63d425
2018-12-07 15:07:56 -08:00
Sagar Vemuri
0463f61837 Refactor BlockBasedTable::Open (#4636)
Summary:
Refactored and simplified `BlockBasedTable::Open` to be similar to `BlockBasedTableBuilder::Finish` as both these functions complement each other. Also added `BlockBasedTableBuilder::WriteFooter` along the way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4636

Differential Revision: D12933319

Pulled By: sagar0

fbshipit-source-id: 1ff1d02f6d80a63b5ba720a1fc75e71c7344137b
2018-12-07 13:18:44 -08:00
Yi Wu
512a5e3ef8 Fix BlockBasedTable not always using memory allocator if available (#4678)
Summary:
Fix block based table reader not using memory_allocator when allocating index blocks and compression dictionary blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4678

Differential Revision: D13054594

Pulled By: yiwu-arbug

fbshipit-source-id: 379f25bcc665395662511c4f873f4b7b55104ce2
2018-11-28 18:01:24 -08:00
Abhishek Madan
8fe1e06ca0 Clean up FragmentedRangeTombstoneList (#4692)
Summary:
Removed `one_time_use` flag, which removed the need for some
tests, and changed all `NewRangeTombstoneIterator` methods to return
`FragmentedRangeTombstoneIterators`.

These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692

Differential Revision: D13106570

Pulled By: abhimadan

fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
2018-11-28 15:29:02 -08:00
Yi Wu
05d9d82181 Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697)
Summary:
…ons (#4676)"

This reverts commit b32d087dbb.

`MemoryAllocator` needs to be with `Cache`, since cache entry can
outlive DB and block based table. The cache needs to hold reference to
memory allocator when deleting cache entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697

Differential Revision: D13133490

Pulled By: yiwu-arbug

fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a
2018-11-21 11:29:57 -08:00
Abhishek Madan
6bee36a786 Modify FragmentedRangeTombstoneList member layout (#4632)
Summary:
Rather than storing a `vector<RangeTombstone>`, we now store a
`vector<RangeTombstoneStack>` and a `vector<SequenceNumber>`. A
`RangeTombstoneStack` contains the start and end keys of a range tombstone
fragment, and indices into the seqnum vector to indicate which sequence
numbers the fragment is located at. The diagram below illustrates an
example:

```
tombstones_:     [a, b) [c, e) [h, k)
                   | \   /  \   /  |
                   |  \ /    \ /   |
                   v   v      v    v
tombstone_seqs_: [ 5 3 10 7 2 8 6  ]
```

This format allows binary searching the tombstone list to use less key
comparisons, which helps in cases where there are many overlapping
tombstones. Also, this format makes it easier to add DBIter-like
semantics to `FragmentedRangeTombstoneIterator` in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4632

Differential Revision: D13053103

Pulled By: abhimadan

fbshipit-source-id: e8220cc712fcf5be4d602913bb23ace8ea5f8ef0
2018-11-14 17:52:17 -08:00
Siying Dong
b82e57d425 Remove two variables from BlockContents class and don't use class Block for compressed block (#4650)
Summary:
We carry compression type and "cachable" variables for every block in the block cache, while they take well-known values. 8-byte is wasted for each block (2-byte for useful information but it takes 8 bytes because of padding). With this change, these two variables are removed.

The cachable information is only useful in the process of reading the block. We use other information to infer from it. For compressed blocks, the compression type is a part of the block content itself so we can get it from there.

Some code is slightly refactored so that the cachable information can flow better.

Another change is to only use class BlockContents for compressed block, and narrow the class Block to only be used for uncompressed blocks, including blocks in compressed block cache. This can make the Block class less confusing. It also saves tens of bytes for each block in compressed block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4650

Differential Revision: D12969070

Pulled By: siying

fbshipit-source-id: 548b62724e9eb66993026429fd9c7c3acd1f95ed
2018-11-13 17:02:55 -08:00
Yi Wu
b32d087dbb Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676)
Summary:
Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy.

It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676

Differential Revision: D13047662

Pulled By: yiwu-arbug

fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd
2018-11-13 13:48:38 -08:00
Sagar Vemuri
dc3528077a Update all unique/shared_ptr instances to be qualified with namespace std (#4638)
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638

Differential Revision: D12934992

Pulled By: sagar0

fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
2018-11-09 11:19:58 -08:00
Yi Wu
f560c8f5c8 s/CacheAllocator/MemoryAllocator/g (#4590)
Summary:
Rename the interface, as it is mean to be a generic interface for memory allocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590

Differential Revision: D10866340

Pulled By: yiwu-arbug

fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16
2018-10-26 14:30:30 -07:00
Abhishek Madan
7528130e38 Cache fragmented range tombstones in BlockBasedTableReader (#4493)
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.

On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom   :       0.983 micros/op 1017076 ops/sec;   78.3 MB/s (63103 of 100000 found)
```

Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
   Tombstones?    | avg micros/op | stddev micros/op |  avg ops/s   | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None              |        0.6186 |          0.04637 | 1,625,252.90 | 124,679.41
500 Expanded      |        0.6019 |          0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded    |        0.6435 |          0.03994 | 1,559,979.40 | 104,090.52
1k Expanded       |        0.6034 |          0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded     |        0.6261 |          0.03093 | 1,600,457.50 |  79,024.94
5k Expanded       |        0.6163 |          0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded     |        0.6402 |          0.04002 | 1,567,804.70 | 100,965.55
10k Expanded      |        0.6036 |          0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded    |        0.6128 |          0.02598 | 1,634,633.40 |  72,161.82
25k Expanded      |        0.6198 |          0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded    |        0.5478 |          0.0362  | 1,833,059.10 | 121,233.81
50k Expanded      |        0.5104 |          0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded    |        0.4528 |          0.03387 | 2,219,034.50 | 170,984.32
```

After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493

Differential Revision: D10842844

Pulled By: abhimadan

fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
2018-10-25 19:26:44 -07:00
Zhongyi Xie
21bf7421ca use per-level perf context for bloom filter related counters (#4581)
Summary:
PR https://github.com/facebook/rocksdb/pull/4226 introduced per-level perf context which allows breaking down perf context by levels.
This PR takes advantage of the feature to populate a few counters related to bloom filters
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4581

Differential Revision: D10518010

Pulled By: miasantreble

fbshipit-source-id: 011244561783ec860d32d5b0fa6bce6e78d70ef8
2018-10-24 12:21:38 -07:00
Zhongyi Xie
d6ec288703 Add PerfContextByLevel to provide per level perf context information (#4226)
Summary:
Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level.
This will be helpful when analyzing point and range query performance as well as tuning bloom filter
Also replaced __thread with thread_local keyword for perf_context
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226

Differential Revision: D10369509

Pulled By: miasantreble

fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82
2018-10-17 11:19:40 -07:00
Zhongyi Xie
ce1fc5af09 fix unused param allocator in compression.h (#4453)
Summary:
this should fix currently failing contrun test: rocksdb-contrun-no_compression, rocksdb-contrun-tsan, rocksdb-contrun-tsan_crash
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4453

Differential Revision: D10202626

Pulled By: miasantreble

fbshipit-source-id: 850b07f14f671b5998c22d8239e2a55b2fc1e355
2018-10-04 13:24:22 -07:00
Igor Canadi
1cf5deb8fd Introduce CacheAllocator, a custom allocator for cache blocks (#4437)
Summary:
This is a conceptually simple change, but it touches many files to
pass the allocator through function calls.

We introduce CacheAllocator, which can be used by clients to configure
custom allocator for cache blocks. Our motivation is to hook this up
with folly's `JemallocNodumpAllocator`
(f43ce6d686/folly/experimental/JemallocNodumpAllocator.h),
but there are many other possible use cases.

Additionally, this commit cleans up memory allocation in
`util/compression.h`, making sure that all allocations are wrapped in a
unique_ptr as soon as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437

Differential Revision: D10132814

Pulled By: yiwu-arbug

fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564
2018-10-02 17:24:58 -07:00
Andrew Kryczka
2c14662213 Revert "Digest ZSTD compression dictionary once per SST file (#4251)" (#4347)
Summary:
Reverting is needed to unblock a user building against master, who is blocked for multiple days due to a thread-safety issue in `GetEmptyDict`. We haven't been able to fix it quickly, so reverting.

Simply ran `git revert 6c40806e51a89386d2b066fddf73d3fd03a36f65`. There were no merge conflicts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4347

Differential Revision: D9668365

Pulled By: ajkr

fbshipit-source-id: 0c56334f0a23cf5ee0233d4e4679eae6709739cd
2018-09-06 09:58:34 -07:00
cngzhnp
64324e329e Support pragma once in all header files and cleanup some warnings (#4339)
Summary:
As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited.

Besides this, try to fix some warnings about loss of data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339

Differential Revision: D9654990

Pulled By: ajkr

fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848
2018-09-05 18:13:31 -07:00
Andrew Kryczka
6c40806e51 Digest ZSTD compression dictionary once per SST file (#4251)
Summary:
In RocksDB, for a given SST file, all data blocks are compressed with the same dictionary. When we compress a block using the dictionary's raw bytes, the compression library first has to digest the dictionary to get it into a usable form. This digestion work is redundant and ideally should be done once per file.

ZSTD offers APIs for the caller to create and reuse a digested dictionary object (`ZSTD_CDict`). In this PR, we call `ZSTD_createCDict` once per file to digest the raw bytes. Then we use `ZSTD_compress_usingCDict` to compress each data block using the pre-digested dictionary. Once the file's created `ZSTD_freeCDict` releases the resources held by the digested dictionary.

There are a couple other changes included in this PR:

- Changed the parameter object for (un)compression functions from `CompressionContext`/`UncompressionContext` to `CompressionInfo`/`UncompressionInfo`. This avoids the previous pattern, where `CompressionContext`/`UncompressionContext` had to be mutated before calling a (un)compression function depending on whether dictionary should be used. I felt that mutation was error-prone so eliminated it.
- Added support for digested uncompression dictionaries (`ZSTD_DDict`) as well. However, this PR does not support reusing them across uncompression calls for the same file. That work is deferred to a later PR when we will store the `ZSTD_DDict` objects in block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4251

Differential Revision: D9257078

Pulled By: ajkr

fbshipit-source-id: 21b8cb6bbdd48e459f1c62343780ab66c0a64438
2018-08-23 19:28:18 -07:00
Siying Dong
dc064f302e Suppress two CLANG Analyze warning (#4291)
Summary:
Suppress two CLANG analyze warnings. They don't seem to be real bugs
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4291

Differential Revision: D9407333

Pulled By: siying

fbshipit-source-id: 2ed63d88fa0b217fdccb1572d7508467c2203dc8
2018-08-20 16:57:38 -07:00
Fenggang Wu
19ec44fd39 Improve point-lookup performance using a data block hash index (#4174)
Summary:
Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`

The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174

Differential Revision: D8965914

Pulled By: fgwu

fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
2018-08-15 14:30:03 -07:00
Maysam Yabandeh
caf0f53a74 Index value delta encoding (#3983)
Summary:
Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the  block cache more efficiently. The feature is enabled with using format_version 4.

The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
Results with sysbench read-only using 4k blocks and using 16 index restart interval:
Format 2:
19585   rocksdb read-only range=100
Format 3:
19569   rocksdb read-only range=100
Format 4:
19352   rocksdb read-only range=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983

Differential Revision: D8361343

Pulled By: maysamyabandeh

fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
2018-08-09 16:58:40 -07:00
Yanqin Jin
54de56844d Remove random writes from SST file ingestion (#4172)
Summary:
RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
`global_seqno`. Since random write is not supported in some non-POSIX compliant
file systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update `global_seqno` during
file ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172

Differential Revision: D8961465

Pulled By: riversand963

fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
2018-07-27 16:12:23 -07:00
Siying Dong
2a81633da2 Fix bug when seeking backward against an out-of-bound iterator (#4187)
Summary:
92ee3350e0 introduces an out-of-bound check in BlockBasedTableIterator::Valid(). However, this flag is not reset when re-seeking in backward direction. This caused the iterator to be invalide by mistake. Fix it by always resetting the out-of-bound flag in every seek.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4187

Differential Revision: D8996600

Pulled By: siying

fbshipit-source-id: b6235ea614f71381e50e7904c4fb036300604ac1
2018-07-25 17:14:01 -07:00
Zhongyi Xie
f95a5b2464 Avoid unnecessary big for-loop when reporting ticker stats stored in GetContext (#3490)
Summary:
Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490

Differential Revision: D6969154

Pulled By: miasantreble

fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029
2018-07-20 16:58:13 -07:00
Siying Dong
8425c8bd4d BlockBasedTableReader: automatically adjust tail prefetch size (#4156)
Summary:
Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others.
Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10%
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156

Differential Revision: D8916847

Pulled By: siying

fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758
2018-07-20 14:43:37 -07:00
Andrew Kryczka
b5613227a9 Smaller tail readahead when not reading index/filters (#4159)
Summary:
In all cases during `BlockBasedTable::Open`, we issue at least three read requests to the file's tail: (1) footer, (2) metaindex block, and (3) properties block. Depending on the config, we may also read other metablocks like filter and index.

This PR issues smaller readahead when we expect to do only the three necessary reads mentioned above. Then, 4KB should be enough (ignoring the case where there are lots of user-defined properties). We can keep doing 512KB readahead when additional reads are expected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4159

Differential Revision: D8924002

Pulled By: ajkr

fbshipit-source-id: cfc713275de4d05ce11f18571f1d72e27ccd3356
2018-07-19 16:13:22 -07:00
Siying Dong
8f06b4fa01 Separate some IndexBlockIter logic from BlockIter (#4136)
Summary:
Some logic only related to IndexBlockIter is separated from BlockIter to IndexBlockIter. This is done by writing an exclusive Seek() and SeekForPrev() for DataBlockIter, and all metadata block iter and tombstone block iter now use data block iter. Dealing with the BinarySeek() sharing problem by passing in the comparator to use.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4136

Reviewed By: maysamyabandeh

Differential Revision: D8859673

Pulled By: siying

fbshipit-source-id: 703e5e6824b82b7cbf4721f3594b94127797ca9e
2018-07-16 10:13:18 -07:00
Nathan VanBenschoten
ef7815b803 Support range deletion tombstones in IngestExternalFile SSTs (#3778)
Summary:
Fixes #3391.

This change adds a `DeleteRange` method to `SstFileWriter` and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778

Differential Revision: D8821836

Pulled By: anand1976

fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844
2018-07-13 22:43:09 -07:00
Maysam Yabandeh
d4ad32d7bd Refactor BlockIter (#4121)
Summary:
BlockIter is getting crowded including details that specific only to either index or data blocks. The patch moves down such details to DataBlockIter and IndexBlockIter, both inheriting from BlockIter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4121

Differential Revision: D8816832

Pulled By: maysamyabandeh

fbshipit-source-id: d492e74155c11d8a0c1c85cd7ee33d24c7456197
2018-07-12 17:27:31 -07:00
Nikhil Benesch
5cd8240b86 Test range deletions with more configurations (#4021)
Summary:
Run the basic range deletion tests against the standard set of
configurations. This testing exposed that files with hash indexes and
partitioned indexes were not handling the case where the file contained
only range deletions--i.e., where the index was empty.

Additionally file a TODO about the fact that range deletions are broken
when allow_mmap_reads = true is set.

/cc ajkr nvanbenschoten

Best viewed with ?w=1: https://github.com/facebook/rocksdb/pull/4021/files?w=1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4021

Differential Revision: D8811860

Pulled By: ajkr

fbshipit-source-id: 3cc07e6d6210a2a00b932866481b3d5c59775343
2018-07-11 15:57:49 -07:00
Maysam Yabandeh
2462763b2e Fix mis-spoken assert on prefetch_filter and prefetch_index (#4077)
Summary:
We can have prefetch_index without prefetch_filter but not the other way around. The assert statement is fixed.
Closes https://github.com/facebook/rocksdb/pull/4077

Differential Revision: D8694472

Pulled By: maysamyabandeh

fbshipit-source-id: ccd2804d9d9cdafb1c3e65062c7bc38603e69004
2018-06-29 09:28:12 -07:00
Maysam Yabandeh
29ffbb8a50 Charging block cache more accurately (#4073)
Summary:
Currently the block cache is charged only by the size of the raw data block and excludes the overhead of the c++ objects that contain the raw data block. The patch improves the accuracy of the charge by including the c++ object overhead into it.
Closes https://github.com/facebook/rocksdb/pull/4073

Differential Revision: D8686552

Pulled By: maysamyabandeh

fbshipit-source-id: 8472f7fc163c0644533bc6942e20cdd5725f520f
2018-06-29 08:57:20 -07:00
Maysam Yabandeh
235ab9dd32 Pin mmap files in ReadOnlyDB (#4053)
Summary:
https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case.
Closes https://github.com/facebook/rocksdb/pull/4053

Differential Revision: D8662546

Pulled By: maysamyabandeh

fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd
2018-06-27 17:13:34 -07:00
Zhongyi Xie
408205a36b use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899)
Summary:
Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file.
This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at.
Closes https://github.com/facebook/rocksdb/pull/3899

Differential Revision: D8157459

Pulled By: miasantreble

fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41
2018-06-26 15:57:26 -07:00
Sagar Vemuri
189f0c27aa Make BlockBasedTableIterator compaction-aware (#4048)
Summary:
Pass in `for_compaction` to `BlockBasedTableIterator` via `BlockBasedTableReader::NewIterator`.

In 7103559f49, `for_compaction` was set in `BlockBasedTable::Rep` via `BlockBasedTable::SetupForCompaction`. In hindsight it was not the right decision; it also caused TSAN to complain.
Closes https://github.com/facebook/rocksdb/pull/4048

Differential Revision: D8601056

Pulled By: sagar0

fbshipit-source-id: 30127e898c15c38c1080d57710b8c5a6d64a0ab3
2018-06-25 13:19:27 -07:00
Maysam Yabandeh
80ade9ad83 Pin top-level index on partitioned index/filter blocks (#4037)
Summary:
Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead.
Closes https://github.com/facebook/rocksdb/pull/4037

Differential Revision: D8596218

Pulled By: maysamyabandeh

fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2
2018-06-22 15:27:46 -07:00
Sagar Vemuri
7103559f49 Improve direct IO range scan performance with readahead (#3884)
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.

**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.

**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.

**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.

**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```

Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```

```
Before:
seekrandom   :   37939.906 micros/op 26 ops/sec;   29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom   :   8527.720 micros/op 117 ops/sec;  129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884

Differential Revision: D8082143

Pulled By: sagar0

fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
2018-06-21 11:13:08 -07:00