Commit Graph

136 Commits

Author SHA1 Message Date
Hui Xiao
920386f2b7 Detect (new) Bloom/Ribbon Filter construction corruption (#9342)
Summary:
Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, https://github.com/facebook/rocksdb/pull/9345, (optional) https://github.com/facebook/rocksdb/pull/9393
**Context:**
(Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following:
a) set of keys to add to filter
b) set of hashes to add to filter (64-bit hash applied to each key)
c) set of Bloom indices to set in filter, with duplicates
d) set of Bloom indices to set in filter, deduplicated
e) final filter and its checksum

This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level.
- b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`)
- c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO.

Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default.

**Summary:**
- Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()`
- Ensured hash entries, final filter and banding and their [cache reservation ](https://github.com/facebook/rocksdb/issues/9073) are released properly despite corruption
   - See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design
   -  Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries`
- Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()`
- When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9342

Test Plan:
- Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption`
- Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl
   - For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true  -quick -runs 10 | grep 'Build avg'` to make sure nothing break.
- Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()`
   -  FastLocalBloom
       - Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s**
       - After change:
          -  `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)**
          - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)**
          - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)**
    -  Standard128Ribbon
       - Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s**
       - After change:
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)**
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)**
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)**
- Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true`
- Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status.

Reviewed By: pdillinger

Differential Revision: D33746928

Pulled By: hx235

fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
2022-02-01 17:42:35 -08:00
Si Ke
93b1de4f45 Enable db_test running in Centos 32 bit OS and Alpine 32 bit OS (#9294)
Summary:
Closes https://github.com/facebook/rocksdb/issues/9271

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9294

Reviewed By: riversand963, hx235

Differential Revision: D33586002

Pulled By: pdillinger

fbshipit-source-id: 3d1a2fa71023e108613ff03dbd37a5f954fc4920
2022-01-14 11:58:18 -08:00
Peter Dillinger
653c392e47 More refactoring ahead of footer & meta changes (#9240)
Summary:
I'm working on a new format_version=6 to support context
checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test
updates to support that change.

Test coverage data and manual inspection agree on dead code in
block_based_table_reader.cc (removed).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240

Test Plan:
tests enhanced to cover more cases etc.

Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible.

TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true}

(Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness)
Before w/ wal: 454512
After w/ wal: 444820 (-2.1%)
Before w/o wal: 1004560
After w/o wal: 998897 (-0.6%)

Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case.

This regression will be corrected in a follow-up PR.

Reviewed By: ajkr

Differential Revision: D32813769

Pulled By: pdillinger

fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f
2021-12-10 08:13:26 -08:00
Hui Xiao
74544d582f Account Bloom/Ribbon filter construction memory in global memory limit (#9073)
Summary:
Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge.

**Context:**
Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction, moving toward the goal of [single global memory limit using block cache capacity](https://github.com/facebook/rocksdb/wiki/Projects-Being-Developed#improving-memory-efficiency). It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under `LRUCacheOptions::strict_capacity_limit=true`.

The option to turn on this feature is `BlockBasedTableOptions::reserve_table_builder_memory = true` which by default is set to `false`. We [decided](https://github.com/facebook/rocksdb/pull/9073#discussion_r741548409) not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option.

**Summary:**
- Reserved/released cache for creation/destruction of three main memory users with the passed-in `FilterBuildingContext::cache_res_mgr` during filter construction:
   - hash entries (i.e`hash_entries`.size(), we bucket-charge hash entries during insertion for performance),
   - banding (Ribbon Filter only, `bytes_coeff_rows` +`bytes_result_rows` + `bytes_backtrack`),
   - final filter (i.e, `mutable_buf`'s size).
      - Implementation details: in order to use `CacheReservationManager::CacheReservationHandle` to account final filter's memory, we have to store the `CacheReservationManager` object and `CacheReservationHandle` for final filter in `XXPH3BitsFilterBuilder` as well as  explicitly delete the filter bits builder when done with the final filter in block based table.
- Added option fo run `filter_bench` with this memory reservation feature

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9073

Test Plan:
- Added new tests in `db_bloom_filter_test` to verify filter construction peak cache reservation under combination of  `BlockBasedTable::Rep::FilterType` (e.g, `kFullFilter`, `kPartitionedFilter`), `BloomFilterPolicy::Mode`(e.g, `kFastLocalBloom`, `kStandard128Ribbon`, `kDeprecatedBlock`) and `BlockBasedTableOptions::reserve_table_builder_memory`
  - To address the concern for slow test: tests with memory reservation under `kFullFilter` + `kStandard128Ribbon` and `kPartitionedFilter` take around **3000 - 6000 ms** and others take around **1500 - 2000 ms**, in total adding **20000 - 25000 ms** to the test suit running locally
- Added new test in `bloom_test` to verify Ribbon Filter fallback on large banding in FullFilter
- Added test in `filter_bench` to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over **20** run as below:
   - FastLocalBloom
      - baseline `./filter_bench -impl=2 -quick -runs 20 | grep 'Build avg'`:
         - **Build avg ns/key: 29.56295** (DEBUG_LEVEL=1), **29.98153** (DEBUG_LEVEL=0)
      - new feature (expected to be similar as above)`./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg'`:
         - **Build avg ns/key: 30.99046** (DEBUG_LEVEL=1), **30.48867** (DEBUG_LEVEL=0)
      - new feature of RibbonFilter with fallback  (expected to be similar as above) `./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
         - **Build avg ns/key: 31.146975** (DEBUG_LEVEL=1), **30.08165** (DEBUG_LEVEL=0)

    - Ribbon128
       - baseline `./filter_bench -impl=3 -quick -runs 20 | grep 'Build avg'`:
           - **Build avg ns/key: 129.17585** (DEBUG_LEVEL=1), **130.5225** (DEBUG_LEVEL=0)
       - new feature  (expected to be similar as above) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg' `:
           - **Build avg ns/key: 131.61645** (DEBUG_LEVEL=1), **132.98075** (DEBUG_LEVEL=0)
       - new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
          - **Build avg ns/key: 52.032965** (DEBUG_LEVEL=1), **52.597825** (DEBUG_LEVEL=0)
          - And the warning message of `"Cache reservation for Ribbon filter banding failed due to cache full"` is indeed logged to console.

Reviewed By: pdillinger

Differential Revision: D31991348

Pulled By: hx235

fbshipit-source-id: 9336b2c60f44d530063da518ceaf56dac5f9df8e
2021-11-18 09:42:20 -08:00
Peter Dillinger
a7d4bea43a Implement XXH3 block checksum type (#9069)
Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.

Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument

This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069

Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.

DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.

Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.

 ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c       :       0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash       :       0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64     :       0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3         :       0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c       :       0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash       :       0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64     :       0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3         :       0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c       :       0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash       :       0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64     :       0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3         :       0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)

As you can see, especially once warmed up, xxh3 is fastest.

 ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

Test

    for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done

Results (ops/sec)

    for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done

results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3

Reviewed By: mrambacher

Differential Revision: D31905249

Pulled By: pdillinger

fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
2021-10-28 22:15:17 -07:00
Peter Dillinger
5bf9a7d5ee Clarify caching behavior for index and filter partitions (#9068)
Summary:
Somewhat confusingly, index and filter partition blocks are
never owned by table readers, even with
cache_index_and_filter_blocks=false. They still go into block cache
(possibly pinned by table reader) if there is a block cache. If no block
cache, they are only loaded transiently on demand.

This PR primarily clarifies the options APIs and some internal code
comments.

Also, this closes a hypothetical data corruption vulnerability where
some but not all index partitions are pinned. I haven't been able to
reproduce a case where it can happen (the failure seems to propagate
to abort table open) but it's worth patching nonetheless.

Fixes https://github.com/facebook/rocksdb/issues/8979

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9068

Test Plan:
existing tests :-/  I could cover the new code using sync
points, but then I'd have to very carefully relax my `assert(false)`

Reviewed By: ajkr

Differential Revision: D31898284

Pulled By: pdillinger

fbshipit-source-id: f2511a7d3a36bc04b627935d8e6cfea6422f98be
2021-10-27 17:23:04 -07:00
Akanksha Mahajan
fd2079938d Dynamically configure BlockBasedTableOptions.prepopulate_block_cache (#8620)
Summary:
Dynamically configure BlockBasedTableOptions.prepopulate_block_cache using DB::SetOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8620

Test Plan: Added new unit test

Reviewed By: anand1976

Differential Revision: D30091319

Pulled By: akankshamahajan15

fbshipit-source-id: fb586d1848a8dd525bba7b2f9eeac34f2fc6d82c
2021-08-05 19:44:51 -07:00
Akanksha Mahajan
8b2f60b668 Cache warming blocks during flush (#8561)
Summary:
Insert warm blocks  (data, uncompressed dict, index and filter blocks) during flush in Block cache which is enabled under option BlockBasedTableOptions.prepopulate_block_cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8561

Test Plan: Added unit test

Reviewed By: anand1976

Differential Revision: D29773411

Pulled By: akankshamahajan15

fbshipit-source-id: 6631123c10134340ef0bd7e90baafaa6deba0e66
2021-08-03 12:44:15 -07:00
Akanksha Mahajan
5ba1b6e549 Cache warming data blocks during flush (#8242)
Summary:
This PR prepopulates warm/hot data blocks which are already in memory
into block cache at the time of flush. On a flush, the data block that is
in memory (in memtables) get flushed to the device. If using Direct IO,
additional IO is incurred to read this data back into memory again, which
is avoided by enabling newly added option.

 Right now, this is enabled only for flush for data blocks. We plan to
expand this option to cover compactions in the future and for other types
 of blocks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242

Test Plan: Add new unit test

Reviewed By: anand1976

Differential Revision: D28521703

Pulled By: akankshamahajan15

fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05
2021-06-17 21:56:47 -07:00
Peter Dillinger
865a25101d Mark Ribbon filter and optimize_filters_for_memory as production (#8408)
Summary:
Marked the Ribbon filter and optimize_filters_for_memory features
as production-ready, each enabling memory savings for Bloom-like filters.
Use `NewRibbonFilterPolicy` in place of `NewBloomFilterPolicy` to use
Ribbon filters instead of Bloom, or `ribbonfilter` in place of
`bloomfilter` in configuration string.

Some small refactoring in db_stress.

Removed/refactored unused code in db_bench, in part preparing for future
default possibly being different from "disabled."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8408

Test Plan:
Lots of prior automated, ad-hoc, and "real world" testing.
Updated tests for new API names. Quick db_bench test:

bloom fillrandom
77730 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 89929384

ribbon fillrandom
71492 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 64531384

Reviewed By: mrambacher

Differential Revision: D29140805

Pulled By: pdillinger

fbshipit-source-id: d742c922722421678f95ad85eeb0aaebc9f5e49a
2021-06-17 12:29:16 -07:00
Peter Dillinger
85becd94c1 Refactor: use TableBuilderOptions to reduce parameter lists (#8240)
Summary:
Greatly reduced the not-quite-copy-paste giant parameter lists
of rocksdb::NewTableBuilder, rocksdb::BuildTable,
BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor.

Moved weird separate parameter `uint32_t column_family_id` of
TableFactory::NewTableBuilder into TableBuilderOptions.

Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t
target_file_size` is not randomly placed between uint64_t timestamps
(was easy to mix up).

Replaced a couple of fields of BlockBasedTableBuilder::Rep with a
FilterBuildingContext. The motivation for this change is making it
easier to pass along more data into new fields in FilterBuildingContext
(follow-up PR).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240

Test Plan: ASAN make check

Reviewed By: mrambacher

Differential Revision: D28075891

Pulled By: pdillinger

fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f
2021-04-29 07:00:50 -07:00
mrambacher
524b10bd6e Fix spelling in comments in include/rocksdb/ (#8120)
Summary:
Ran a spell check over the comments in the include/rocksdb directory and fixed any mis-spellings.

There are still some variable names that are spelled incorrectly (like SizeApproximationOptions::include_memtabtles, SstFileMetaData::oldest_ancester_time) that were not fixed, as those would break compilation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8120

Reviewed By: zhichao-cao

Differential Revision: D27366034

Pulled By: mrambacher

fbshipit-source-id: 6a3f3674890bb6acc751e9c5887a8fbb6adca5df
2021-03-29 05:05:06 -07:00
Peter Dillinger
0028e3398b Make format_version=5 new default (#8017)
Summary:
Haven't seen any production issues with new Bloom filter and
it's now > 1 year old (added in 6.6.0).

Updated check_format_compatible.sh and HISTORY.md

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8017

Test Plan: tests updated (or prior bugs fixed)

Reviewed By: ajkr

Differential Revision: D26762197

Pulled By: pdillinger

fbshipit-source-id: 0e755c46b443087c1544da0fd545beb9c403d1c2
2021-03-09 12:42:53 -08:00
Akanksha Mahajan
cd79a00903 Make BlockBasedTable::kMaxAutoReadAheadSize configurable (#7951)
Summary:
RocksDB does auto-readahead for iterators on noticing more
than two reads for a table file. The readahead starts at 8KB and doubles on every
additional read upto BlockBasedTable::kMaxAutoReadAheadSize which is
256*1024.
This PR adds a new option BlockBasedTableOptions::max_auto_readahead_size which
replaces BlockBasedTable::kMaxAutoReadAheadSize and the new option can be
configured.
If max_auto_readahead_size is set 0 then no implicit auto prefetching will
be done. If max_auto_readahead_size provided is less than
8KB (which is initial readahead size used by rocksdb in case of
auto-readahead), readahead size will remain same as max_auto_readahead_size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7951

Test Plan: Add new unit test case.

Reviewed By: anand1976

Differential Revision: D26568085

Pulled By: akankshamahajan15

fbshipit-source-id: b6543520fc74e97d859f2002328d4c5254d417af
2021-02-23 16:54:08 -08:00
mrambacher
c442f6809f Create a Customizable class to load classes and configurations (#6590)
Summary:
The Customizable class is an extension of the Configurable class and allows instances to be created by a name/ID.  Classes that extend customizable can define their Type (e.g. "TableFactory", "Cache") and  a method to instantiate them (TableFactory::CreateFromString).  Customizable objects can be registered with the ObjectRegistry and created dynamically.

Future PRs will make more types of objects extend Customizable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6590

Reviewed By: cheng-chang

Differential Revision: D24841553

Pulled By: zhichao-cao

fbshipit-source-id: d0c2132bd932e971cbfe2c908ca2e5db30c5e155
2020-11-11 15:10:41 -08:00
Andrew Kryczka
75d3b6fdf0 Redesign block cache pinning API (#7520)
Summary:
The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520

Test Plan:
- new unit test
- added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0`

Reviewed By: pdillinger

Differential Revision: D24200034

Pulled By: ajkr

fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e
2020-10-11 14:58:24 -07:00
mrambacher
7d472accdc Bring the Configurable options together (#5753)
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts

Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753

Reviewed By: ajkr

Differential Revision: D23385030

Pulled By: zhichao-cao

fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
2020-09-14 17:01:01 -07:00
Anand Ananthabhotla
9a5886bd8c Extend Get/MultiGet deadline support to table open (#6982)
Summary:
Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open.

The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future.

Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982

Test Plan: Add new unit tests in db_basic_test

Reviewed By: riversand963

Differential Revision: D22219515

Pulled By: anand1976

fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b
2020-06-29 14:53:17 -07:00
Peter Dillinger
5b2bbacb6f Minimize memory internal fragmentation for Bloom filters (#6427)
Summary:
New experimental option BBTO::optimize_filters_for_memory builds
filters that maximize their use of "usable size" from malloc_usable_size,
which is also used to compute block cache charges.

Rather than always "rounding up," we track state in the
BloomFilterPolicy object to mix essentially "rounding down" and
"rounding up" so that the average FP rate of all generated filters is
the same as without the option. (YMMV as heavily accessed filters might
be unluckily lower accuracy.)

Thus, the option near-minimizes what the block cache considers as
"memory used" for a given target Bloom filter false positive rate and
Bloom filter implementation. There are no forward or backward
compatibility issues with this change, though it only works on the
format_version=5 Bloom filter.

With Jemalloc, we see about 10% reduction in memory footprint (and block
cache charge) for Bloom filters, but 1-2% increase in storage footprint,
due to encoding efficiency losses (FP rate is non-linear with bits/key).

Why not weighted random round up/down rather than state tracking? By
only requiring malloc_usable_size, we don't actually know what the next
larger and next smaller usable sizes for the allocator are. We pick a
requested size, accept and use whatever usable size it has, and use the
difference to inform our next choice. This allows us to narrow in on the
right balance without tracking/predicting usable sizes.

Why not weight history of generated filter false positive rates by
number of keys? This could lead to excess skew in small filters after
generating a large filter.

Results from filter_bench with jemalloc (irrelevant details omitted):

    (normal keys/filter, but high variance)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.6278
    Number of filters: 5516
    Total size (MB): 200.046
    Reported total allocated memory (MB): 220.597
    Reported internal fragmentation: 10.2732%
    Bits/key stored: 10.0097
    Average FP rate %: 0.965228
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.5104
    Number of filters: 5464
    Total size (MB): 200.015
    Reported total allocated memory (MB): 200.322
    Reported internal fragmentation: 0.153709%
    Bits/key stored: 10.1011
    Average FP rate %: 0.966313

    (very few keys / filter, optimization not as effective due to ~59 byte
     internal fragmentation in blocked Bloom filter representation)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.5649
    Number of filters: 162950
    Total size (MB): 200.001
    Reported total allocated memory (MB): 224.624
    Reported internal fragmentation: 12.3117%
    Bits/key stored: 10.2951
    Average FP rate %: 0.821534
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 31.8057
    Number of filters: 159849
    Total size (MB): 200
    Reported total allocated memory (MB): 208.846
    Reported internal fragmentation: 4.42297%
    Bits/key stored: 10.4948
    Average FP rate %: 0.811006

    (high keys/filter)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.7017
    Number of filters: 164
    Total size (MB): 200.352
    Reported total allocated memory (MB): 221.5
    Reported internal fragmentation: 10.5552%
    Bits/key stored: 10.0003
    Average FP rate %: 0.969358
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.7131
    Number of filters: 160
    Total size (MB): 200.928
    Reported total allocated memory (MB): 200.938
    Reported internal fragmentation: 0.00448054%
    Bits/key stored: 10.1852
    Average FP rate %: 0.963387

And from db_bench (block cache) with jemalloc:

    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ (for FILE in /dev/shm/dbbench.no_optimize/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17063835
    $ (for FILE in /dev/shm/dbbench/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17430747
    $ #^ 2.1% additional filter storage
    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8440400
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 21087528
    rocksdb.bloom.filter.useful COUNT : 4963889
    rocksdb.bloom.filter.full.positive COUNT : 1214081
    rocksdb.bloom.filter.full.true.positive COUNT : 1161999
    $ #^ 1.04 % observed FP rate
    $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8448592
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 18220328
    rocksdb.bloom.filter.useful COUNT : 5360933
    rocksdb.bloom.filter.full.positive COUNT : 1321315
    rocksdb.bloom.filter.full.true.positive COUNT : 1262999
    $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters

(Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427

Test Plan: unit test added, 'make check' with gcc, clang and valgrind

Reviewed By: siying

Differential Revision: D22124374

Pulled By: pdillinger

fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e
2020-06-22 13:32:07 -07:00
anand76
f286fb344b Include options.h in table.h (#6823)
Summary:
https://github.com/facebook/rocksdb/issues/6389 replaced the #include of options.h in table.h with forward declarations, which is causing some build failures in RocksDB users in 6.10. Remove the forward declarations and #include options.h as recommended by the style guide - https://google.github.io/styleguide/cppguide.html#Forward_Declarations
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6823

Reviewed By: riversand963

Differential Revision: D21464078

Pulled By: anand1976

fbshipit-source-id: 6033ee2544d279690f57bb0db91bc83816cee11d
2020-05-07 15:55:29 -07:00
mrambacher
4cbc19d2a1 Add a ConfigOptions for use in comparing objects and converting to/from strings (#6389)
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings.  There is a mishmash of parameters in use here with more needed in the future.  This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389

Reviewed By: siying

Differential Revision: D21163707

Pulled By: zhichao-cao

fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
2020-04-21 17:38:17 -07:00
Mike Kolupaev
e45673dece Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.

Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.

It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.

Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621

Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.

Reviewed By: siying

Differential Revision: D20786930

Pulled By: al13n321

fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-15 17:40:44 -07:00
Peter Dillinger
93b80ca7ba Update default BBTO::format_version from 2 to 4 (#6582)
Summary:
Version 4 has been around long enough, for compatibility and
extensive validation, that it should be default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6582

Test Plan:
CI (w.r.t. changing the default; format_version=4 is well
tested and massively in production at Facebook)

Reviewed By: siying

Differential Revision: D20625233

Pulled By: pdillinger

fbshipit-source-id: 2f83ed874cffa4a39bc7a66cdf3833b978fbb948
2020-03-24 21:22:21 -07:00
sdong
fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Peter Dillinger
0695793283 Update format_version comment for 6.6.0
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6097

Differential Revision: D18729661

Pulled By: pdillinger

fbshipit-source-id: d2e4a9d6803aad8dd61ececd5c2b861e6f2da73b
2019-11-27 10:24:16 -08:00
Peter Dillinger
f059c7d9b9 New Bloom filter implementation for full and partitioned filters (#6007)
Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.

Speed

The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.

Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):

$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
  Single filter net ns/op: 26.2825
  Random filter net ns/op: 150.459
    Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
  Single filter net ns/op: 63.2978
  Random filter net ns/op: 188.038
    Average FP rate %: 1.13823

Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.

The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.

Accuracy

The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.

Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)

Fixes https://github.com/facebook/rocksdb/issues/5857
Fixes https://github.com/facebook/rocksdb/issues/4120

Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.

Compatibility

Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007

Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).

Differential Revision: D18294749

Pulled By: pdillinger

fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
2019-11-13 16:44:01 -08:00
sdong
15fd3be07b LRU Cache to enable mid-point insertion by default (#5508)
Summary:
Mid-point insertion is a useful feature and is mature now. Make it default. Also changed cache_index_and_filter_blocks_with_high_priority=true as default accordingly, so that we won't evict index and filter blocks easier after the change, to avoid too many surprises to users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5508

Test Plan: Run all existing tests.

Differential Revision: D16021179

fbshipit-source-id: ce8456e8d43b3bfb48df6c304b5290a9d19817eb
2019-06-27 10:20:57 -07:00
Mike Kolupaev
b4d7209428 Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.

Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.

So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.

Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.

This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289

Differential Revision: D15256423

Pulled By: al13n321

fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:54:04 -07:00
Mike Kolupaev
df38c1ce66 Add BlockBasedTableOptions::index_shortening (#5174)
Summary:
Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174

Differential Revision: D14884185

Pulled By: al13n321

fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534
2019-04-22 08:20:35 -07:00
Fosco Marotto
6c2bf9e916 Add copyright headers per FB open-source checkup tool. (#5199)
Summary:
internal task: T35568575
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5199

Differential Revision: D14962794

Pulled By: gfosco

fbshipit-source-id: 93838ede6d0235eaecff90d200faed9a8515bbbe
2019-04-18 10:55:01 -07:00
Siying Dong
1f7f5a5a79 Run automatic formatter against public header files (#5115)
Summary:
Automatically format public headers so it looks more consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5115

Differential Revision: D14632854

Pulled By: siying

fbshipit-source-id: ce9929ea62f9dcd65c69660b23eed1931cb0ae84
2019-03-27 13:24:25 -07:00
Siying Dong
2b4d5ceb47 Remove some "using std::..." from header files. (#5113)
Summary:
The code convention we are following, Google C++ Style, discourage
alias in header files, especially public headers:
https://google.github.io/styleguide/cppguide.html#Aliases
Remove some of them. Might removed some from .cc files as well to be consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113

Differential Revision: D14633030

Pulled By: siying

fbshipit-source-id: b990edc919d5de60295992284f980195e501d424
2019-03-27 10:28:21 -07:00
Siying Dong
0920bf4e68 Revert "Remove PlainTable's feature store_index_in_file (#4914)" (#5034)
Summary:
This reverts commit ee1818081f.

We are not ready to deprecate this feature. revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5034

Differential Revision: D14287246

Pulled By: siying

fbshipit-source-id: e4beafdeaee1c94364fdaa6ba198218d158339f7
2019-03-01 15:45:45 -08:00
Adam Retter
bb474e9a02 Add missing functionality to RocksJava (#4833)
Summary:
This is my latest round of changes to add missing items to RocksJava. More to come in future PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4833

Differential Revision: D14152266

Pulled By: sagar0

fbshipit-source-id: d6cff67e26da06c131491b5cf6911a8cd0db0775
2019-02-22 14:46:46 -08:00
Siying Dong
ee1818081f Remove PlainTable's feature store_index_in_file (#4914)
Summary:
Store_index_in_file is a less useful feature. To simplify the code to maintain, we are dropping the feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4914

Differential Revision: D13791883

Pulled By: siying

fbshipit-source-id: d187c5d662584866103e4b77d09dfb925509ae2e
2019-01-28 12:50:22 -08:00
Andrew Kryczka
8ec3e72551 Cache dictionary used for decompressing data blocks (#4881)
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881

Differential Revision: D13663801

Pulled By: ajkr

fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
2019-01-23 18:15:47 -08:00
DorianZheng
2670fe8c73 Get CompactionJobInfo from CompactFiles
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4716

Differential Revision: D13207677

Pulled By: ajkr

fbshipit-source-id: d0ccf5a66df6cbb07288b0c5ebad81fd9df3926b
2018-12-13 14:21:24 -08:00
Huachao Huang
5e72bc113a Add SstFileReader to read sst files (#4717)
Summary:
A user friendly sst file reader is useful when we want to access sst
files outside of RocksDB. For example, we can generate an sst file
with SstFileWriter and send it to other places, then use SstFileReader
to read the file and process the entries in other ways.

Also rename the original SstFileReader to SstFileDumper because of
name conflict, and seems SstFileDumper is more appropriate for tools.

TODO: there is only a very simple test now, because I want to get some feedback first.
If the changes look good, I will add more tests soon.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4717

Differential Revision: D13212686

Pulled By: ajkr

fbshipit-source-id: 737593383264c954b79e63edaf44aaae0d947e56
2018-11-27 13:02:23 -08:00
Yi Wu
05d9d82181 Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697)
Summary:
…ons (#4676)"

This reverts commit b32d087dbb.

`MemoryAllocator` needs to be with `Cache`, since cache entry can
outlive DB and block based table. The cache needs to hold reference to
memory allocator when deleting cache entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697

Differential Revision: D13133490

Pulled By: yiwu-arbug

fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a
2018-11-21 11:29:57 -08:00
Siying Dong
b82e57d425 Remove two variables from BlockContents class and don't use class Block for compressed block (#4650)
Summary:
We carry compression type and "cachable" variables for every block in the block cache, while they take well-known values. 8-byte is wasted for each block (2-byte for useful information but it takes 8 bytes because of padding). With this change, these two variables are removed.

The cachable information is only useful in the process of reading the block. We use other information to infer from it. For compressed blocks, the compression type is a part of the block content itself so we can get it from there.

Some code is slightly refactored so that the cachable information can flow better.

Another change is to only use class BlockContents for compressed block, and narrow the class Block to only be used for uncompressed blocks, including blocks in compressed block cache. This can make the Block class less confusing. It also saves tens of bytes for each block in compressed block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4650

Differential Revision: D12969070

Pulled By: siying

fbshipit-source-id: 548b62724e9eb66993026429fd9c7c3acd1f95ed
2018-11-13 17:02:55 -08:00
Yi Wu
b32d087dbb Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676)
Summary:
Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy.

It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676

Differential Revision: D13047662

Pulled By: yiwu-arbug

fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd
2018-11-13 13:48:38 -08:00
Sagar Vemuri
dc3528077a Update all unique/shared_ptr instances to be qualified with namespace std (#4638)
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638

Differential Revision: D12934992

Pulled By: sagar0

fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
2018-11-09 11:19:58 -08:00
Bo Hou
cd9404bb77 xxhash 64 support
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4607

Reviewed By: siying

Differential Revision: D12836696

Pulled By: jsjhoubo

fbshipit-source-id: 7122ccb712d0b0f1cd998aa4477e0da1401bd870
2018-11-01 15:44:06 -07:00
Maysam Yabandeh
655ef7d77f Inline doc for format_version 4 (#4350)
Summary:
Fixes #4337
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4350

Differential Revision: D9700871

Pulled By: maysamyabandeh

fbshipit-source-id: fe1e07803783f34588dc14aba66d51117ca4a180
2018-09-07 07:57:30 -07:00
cngzhnp
64324e329e Support pragma once in all header files and cleanup some warnings (#4339)
Summary:
As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited.

Besides this, try to fix some warnings about loss of data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339

Differential Revision: D9654990

Pulled By: ajkr

fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848
2018-09-05 18:13:31 -07:00
Fenggang Wu
9d646a6311 Add db_bench options of data block hash index (#4281)
Summary:
Add `--data_block_index_type` and `--data_block_hash_table_util_ratio` option to `db_bench`.

`--data_block_index_type` can be either of `binary` (default) or `binary_and_hash`;
`--data_block_hash_table_util_ratio` will be a double. The default value is `0.75`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4281

Differential Revision: D9361476

Pulled By: fgwu

fbshipit-source-id: dc53e01acef9db81b9eec5e8a96f3bc8ed718c10
2018-08-16 18:42:46 -07:00
Fenggang Wu
19ec44fd39 Improve point-lookup performance using a data block hash index (#4174)
Summary:
Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`

The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174

Differential Revision: D8965914

Pulled By: fgwu

fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
2018-08-15 14:30:03 -07:00
Fenggang Wu
ee7617167f DataBlockHashIndex: Specify that DataBlockHashIndex is not yet implemented in the comment
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4203

Differential Revision: D9090912

Pulled By: fgwu

fbshipit-source-id: 6a68be83693ddf2a5c060290382141f0d2fb400b
2018-07-31 11:43:08 -07:00
Fenggang Wu
a11df583ec Add DataBlockIndexType option in BlockBasedTableOptions (#4150)
Summary:
Added DataBlockIndexType option in BlockBasedTableOptions.
```
enum DataBlockIndexType : char {
    kDataBlockBinarySearch = 0, // traditional block type
    kDataBlockHashIndex = 1, // additional hash index appended to the end.
};
```
The default type is the traditional binary seek option: `kDataBlockBinarySearch`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4150

Differential Revision: D8895958

Pulled By: fgwu

fbshipit-source-id: 480adef48104cf11d30db3bad9a73f98b4a80c10
2018-07-27 15:42:27 -07:00
Maysam Yabandeh
80ade9ad83 Pin top-level index on partitioned index/filter blocks (#4037)
Summary:
Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead.
Closes https://github.com/facebook/rocksdb/pull/4037

Differential Revision: D8596218

Pulled By: maysamyabandeh

fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2
2018-06-22 15:27:46 -07:00