Commit Graph

17 Commits

Author SHA1 Message Date
Peter Dillinger
ea89c77f27 Fix major bug with MultiGet, DeleteRange, and memtable Bloom (#9453)
Summary:
MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.

This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)

Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9453

Test Plan:
new unit test added. Added memtable Bloom to crash test.

Performance testing
--------------------

Build WAL-only DB (recovers to memtable):
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000
```

Query test command, to maximize sensitivity to the changed code:
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS
```
(Note -num here is 10x larger for mostly memtable misses)

Before & after run simultaneously, average over 10 iterations per data point, ops/sec.

MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066

MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990

MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before:  12081715
After: 12117603

MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591

Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.

Reviewed By: ajkr

Differential Revision: D33805025

Pulled By: pdillinger

fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
2022-01-27 14:55:04 -08:00
Yanqin Jin
b512f4bc76 Batch blob read IO for MultiGet (#8699)
Summary:
In batched `MultiGet()`, RocksDB batches blob read IO and uses `RandomAccessFileReader::MultiRead()`
to read the blobs instead of issuing multiple `Read()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8699

Test Plan:
```
make check
```

Reviewed By: ltamasi

Differential Revision: D31030861

Pulled By: riversand963

fbshipit-source-id: a0df6060cbfd54cff9515a4eee08807b1dbcb0c8
2021-09-17 19:23:13 -07:00
Peter Dillinger
4750421ece Replace most typedef with using= (#8751)
Summary:
Old typedef syntax is confusing

Most but not all changes with

    perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files
    make format

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751

Test Plan: existing

Reviewed By: zhichao-cao

Differential Revision: D30745277

Pulled By: pdillinger

fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9
2021-09-07 11:31:59 -07:00
Levi Tamasi
1afbd1948c Add initial blob support to batched MultiGet (#7766)
Summary:
The patch adds initial support for reading blobs to the batched `MultiGet` API.
The current implementation simply retrieves the blob values as the blob indexes
are encountered; that is, reads from blob files are currently not batched. (This
will be optimized in a separate phase.) In addition, the patch removes some dead
code related to BlobDB from the batched `MultiGet` implementation, namely the
`is_blob` / `is_blob_index` flags that are passed around in `DBImpl` and `MemTable` /
`MemTableListVersion`. These were never hooked up to anything and wouldn't
work anyways, since a single flag is not sufficient to communicate the "blobness"
of multiple key-values.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7766

Test Plan: `make check`

Reviewed By: jay-zhuang

Differential Revision: D25479290

Pulled By: ltamasi

fbshipit-source-id: 7aba2d290e31876ee592bcf1adfd1018713a8000
2020-12-14 13:48:22 -08:00
Jay Zhuang
881e0dcc09 Fix MultiGet unable to query timestamp data issue (#7589)
Summary:
The filter query key should not contain timestamp. The timestamp is
stripped for Get(), but not MultiGet().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7589

Reviewed By: riversand963

Differential Revision: D24494661

Pulled By: jay-zhuang

fbshipit-source-id: fc5ff40f9d683a89a760c6ff0ab3aed05a70c317
2020-11-03 09:45:41 -08:00
Jay Zhuang
1bdaef7a06 Status check enforcement for timestamp_basic_test (#7454)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7454

Reviewed By: riversand963

Differential Revision: D23981719

Pulled By: jay-zhuang

fbshipit-source-id: 01073f73e54c17067b886c4a2f179b2804198399
2020-09-29 18:23:27 -07:00
Akanksha Mahajan
bcefc59e9f Allow MultiGet users to limit cumulative value size (#6826)
Summary:
1. Add a value_size in read options which limits the cumulative value size of keys read in batches. Once the size exceeds read_options.value_size, all the remaining keys are returned with status Abort without further fetching any key.
2. Add a unit test case MultiGetBatchedValueSizeSimple the reads keys from memory and sst files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6826

Test Plan:
1. make check -j64
	   2. Add a new unit test case

Reviewed By: anand1976

Differential Revision: D21471483

Pulled By: akankshamahajan15

fbshipit-source-id: dea51b8e76d5d1df38ece8cdb29933b1d798b900
2020-05-27 13:07:14 -07:00
Peter Dillinger
bae6f58696 Basic MultiGet support for partitioned filters (#6757)
Summary:
In MultiGet, access each applicable filter partition only once
per batch, rather than for each applicable key. Also,

* Fix Bloom stats for MultiGet
* Fix/refactor MultiGetContext::Range::KeysLeft, including
* Add efficient BitsSetToOne implementation
* Assert that MultiGetContext::Range does not go beyond shift range

Performance test: Generate db:

    $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true
    ...

Before (middle performing run of three; note some missing Bloom stats):

    $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
    multireadrandom :      26.403 micros/op 597517 ops/sec; (548427 of 671968 found)
    rocksdb.block.cache.filter.hit COUNT : 83443275
    rocksdb.bloom.filter.useful COUNT : 0
    rocksdb.bloom.filter.full.positive COUNT : 0
    rocksdb.bloom.filter.full.true.positive COUNT : 7931450
    rocksdb.number.multiget.get COUNT : 385984
    rocksdb.number.multiget.keys.read COUNT : 12351488
    rocksdb.number.multiget.bytes.read COUNT : 793145000
    rocksdb.number.multiget.keys.found COUNT : 7931450

After (middle performing run of three):

    $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
    multireadrandom :      21.024 micros/op 752963 ops/sec; (705188 of 863968 found)
    rocksdb.block.cache.filter.hit COUNT : 49856682
    rocksdb.bloom.filter.useful COUNT : 45684579
    rocksdb.bloom.filter.full.positive COUNT : 10395458
    rocksdb.bloom.filter.full.true.positive COUNT : 9908456
    rocksdb.number.multiget.get COUNT : 481984
    rocksdb.number.multiget.keys.read COUNT : 15423488
    rocksdb.number.multiget.bytes.read COUNT : 990845600
    rocksdb.number.multiget.keys.found COUNT : 9908456

So that's about 25% higher throughput even for random keys
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757

Test Plan: unit test included

Reviewed By: anand1976

Differential Revision: D21243256

Pulled By: pdillinger

fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
2020-04-28 14:49:34 -07:00
Yanqin Jin
d4398e08fc Fix timestamp support for MultiGet (#6748)
Summary:
1. Avoid nullptr dereference when passing timestamp to KeyContext creation.
2. Construct LookupKey correctly with timestamp when creating MultiGetContext.
3. Compare without timestamp when sorting KeyContexts.

Fixes https://github.com/facebook/rocksdb/issues/6745

Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6748

Reviewed By: pdillinger

Differential Revision: D21258691

Pulled By: riversand963

fbshipit-source-id: 44e65b759c18b9986947783edf03be4f890bb004
2020-04-27 22:49:56 -07:00
Huisheng Liu
a6ce5c823b multiget support for timestamps (#6483)
Summary:
Add timestamp support for MultiGet().
timestamp from readoptions is honored, and timestamps can be returned along with values.

MultiReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit 17bef7d3a):
  multireadrandom :     104.173 micros/op 307167 ops/sec; (5462999 of 5462999 found)
This PR:
  multireadrandom :     104.199 micros/op 307095 ops/sec; (5307999 of 5307999 found)

.\db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=multireadrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6483

Reviewed By: anand1976

Differential Revision: D20498373

Pulled By: riversand963

fbshipit-source-id: 8505f22bc40fd791bc7dd05e48d7e67c91edb627
2020-03-24 11:24:09 -07:00
sdong
fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Cheng Chang
152f8a8ffe Remove unnecessary computation of index (#6406)
Summary:
`index` can be replaced by  `iter`, saving the computation of `index++`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6406

Test Plan: make check

Differential Revision: D19905056

Pulled By: cheng-chang

fbshipit-source-id: add4638959c0d2e4e77a11f3fa04ffabaf0de790
2020-02-14 08:26:23 -08:00
anand76
6c7b1a0cc7 Batched MultiGet API for multiple column families (#5816)
Summary:
Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.

As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816

Test Plan:
make check
make asan_check
asan_crash_test

Differential Revision: D18408676

Pulled By: anand1976

fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
2019-11-12 13:52:55 -08:00
Vijay Nadimpalli
4c49e38f15 MultiGet batching in memtable (#5818)
Summary:
RocksDB has a MultiGet() API that implements batched key lookup for higher performance (https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L468). Currently, batching is implemented in BlockBasedTableReader::MultiGet() for SST file lookups. One of the ways it improves performance is by pipelining bloom filter lookups (by prefetching required cachelines for all the keys in the batch, and then doing the probe) and thus hiding the cache miss latency. The same concept can be extended to the memtable as well. This PR involves implementing a pipelined bloom filter lookup in DynamicBloom, and implementing MemTable::MultiGet() that can leverage it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5818

Test Plan:
Existing tests

Performance Test:
Ran the below command which fills up the memtable and makes sure there are no flushes and then call multiget. Ran it on master and on the new change and see atleast 1% performance improvement across all the test runs I did. Sometimes the improvement was upto 5%.

TEST_TMPDIR=/data/users/$USER/benchmarks/feature/ numactl -C 10 ./db_bench -benchmarks="fillseq,multireadrandom" -num=600000 -compression_type="none" -level_compaction_dynamic_level_bytes -write_buffer_size=200000000 -target_file_size_base=200000000 -max_bytes_for_level_base=16777216 -reads=90000 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 -statistics -memtable_whole_key_filtering=true -memtable_bloom_size_ratio=10

Differential Revision: D17578869

Pulled By: vjnadimpalli

fbshipit-source-id: 23dc651d9bf49db11d22375bf435708875a1f192
2019-10-10 09:39:39 -07:00
anand76
e10570331d Support row cache with batched MultiGet (#5706)
Summary:
This PR adds support for row cache in ```rocksdb::TableCache::MultiGet```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5706

Test Plan:
1. Unit tests in db_basic_test
2. db_bench results with batch size of 2 (```Get``` is faster than ```MultiGet``` for single key) -
Get -
readrandom   :       3.935 micros/op 254116 ops/sec;   28.1 MB/s (22870998 of 22870999 found)
MultiGet -
multireadrandom :       3.743 micros/op 267190 ops/sec; (24047998 of 24047998 found)

Command used -
TEST_TMPDIR=/dev/shm/multiget numactl -C 10  ./db_bench -use_existing_db=true -use_existing_keys=false -benchmarks="readtorowcache,[read|multiread]random" -write_buffer_size=16777216 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -row_cache_size=4194304000 -batch_size=2 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=131072

Differential Revision: D17086297

Pulled By: anand1976

fbshipit-source-id: 85784378da913e05f1baf31ec1b4e7c9345e7f57
2019-08-28 16:11:56 -07:00
Zhongyi Xie
5d27d65bef multiget: fix memory issues due to vector auto resizing (#5279)
Summary:
This PR fixes three memory issues found by ASAN
* in db_stress, the key vector for MultiGet is created using `emplace_back` which could potentially invalidates references to the underlying storage (vector<string>) due to auto resizing. Fix by calling reserve in advance.
* Similar issue in construction of GetContext autovector in version_set.cc
* In multiget_context.h use T[] specialization for unique_ptr that holds a char array
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5279

Differential Revision: D15202893

Pulled By: miasantreble

fbshipit-source-id: 14cc2cda0ed64d29f2a1e264a6bfdaa4294ee75d
2019-05-03 15:58:43 -07:00
anand76
fefd4b98c5 Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.

Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency

The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.

Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).

Batch   Sizes

1        | 2        | 4         | 8      | 16  | 32

Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)

Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135

Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62

Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891

dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011

Differential Revision: D14348703

Pulled By: anand1976

fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:28:26 -07:00