Commit Graph

150 Commits

Author SHA1 Message Date
Zhen Li
9c24a5cb4d Fix persistent cache on windows (#6932)
Summary:
Persistent cache feature caused rocks db crash on windows. I posted a issue for it, https://github.com/facebook/rocksdb/issues/6919. I found this is because no "persistent_cache_key_prefix" is generated for persistent cache. Looking repo history, "GetUniqueIdFromFile" is not implemented on Windows. So my fix is adding "NewId()" function in "persistent_cache" and using it to generate prefix for persistent cache. In this PR, i also re-enable related test cases defined in "db_test2" and "persistent_cache_test" for windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6932

Test Plan:
1. run related test cases in "db_test2" and "persistent_cache_test" on windows and see it passed.
2. manually run db_bench.exe with "read_cache_path" and verified.

Reviewed By: riversand963

Differential Revision: D21911608

Pulled By: cheng-chang

fbshipit-source-id: cdfd938d54a385edbb2836b13aaa1d39b0a6f1c2
2020-06-13 13:28:31 -07:00
Andrew Kryczka
e6be168aa5 save a key comparison in block seeks (#6646)
Summary:
This saves up to two key comparisons in block seeks. The first key
comparison saved is a redundant key comparison against the restart key
where the linear scan starts. This comparison is saved in all cases
except when the found key is in the first restart interval. The
second key comparison saved is a redundant key comparison against the
restart key where the linear scan ends. This is only saved in cases
where all keys in the restart interval are less than the target
(probability roughly `1/restart_interval`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6646

Test Plan:
ran a benchmark with mostly default settings and counted key comparisons

before: `user_key_comparison_count = 19399529`
after: `user_key_comparison_count = 18431498`

setup command:

```
$ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000
```

benchmark command:

```
$ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3
```

Reviewed By: pdillinger

Differential Revision: D20849707

Pulled By: ajkr

fbshipit-source-id: 1f01c5cd99ea771fd27974046e37b194f1cdcfac
2020-06-10 13:58:39 -07:00
Andrew Kryczka
02db03af8d make L0 index/filter pinned memory usage predictable (#6911)
Summary:
Memory pinned by `pin_l0_filter_and_index_blocks_in_cache` needs to be predictable based on user config. This PR makes sure
we do not pin extra memory for large files generated by intra-L0 (see https://github.com/facebook/rocksdb/issues/6889).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6911

Test Plan: unit test

Reviewed By: siying

Differential Revision: D21835818

Pulled By: ajkr

fbshipit-source-id: a11a088549d06bed8aacc2548d266e5983f0ead4
2020-06-09 16:51:23 -07:00
Yanqin Jin
3020df9df5 Remove unnecessary inclusion of version_edit.h in env (#6952)
Summary:
In db_options.c, we should avoid including header files in the `db` directory to avoid introducing unnecessary dependency. The reason why `version_edit.h` has been included in `db_options.cc` is because we need two constants, `kUnknownChecksum` and `kUnknownChecksumFuncName`. We can put these two constants as `constexpr` in the public header `file_checksum.h`.

Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6952

Reviewed By: zhichao-cao

Differential Revision: D21925341

Pulled By: riversand963

fbshipit-source-id: 2902f3b74c97f0cf16c58ad24c095c787c3a40e2
2020-06-07 21:56:55 -07:00
anand76
98b0cbea88 Check iterator status BlockBasedTableReader::VerifyChecksumInBlocks() (#6909)
Summary:
The ```for``` loop in ```VerifyChecksumInBlocks``` only checks ```index_iter->Valid()``` which could be ```false``` either due to reaching the end of the index or, in case of partitioned index, it could be due to a checksum mismatch error when reading a 2nd level index block. Instead of throwing away the index iterator status, we need to return any errors back to the caller.

Tests:
Add a test in block_based_table_reader_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6909

Reviewed By: pdillinger

Differential Revision: D21833922

Pulled By: anand1976

fbshipit-source-id: bc778ebf1121dbbdd768689de5183f07a9f0beae
2020-06-05 11:08:25 -07:00
sdong
afa3518839 Revert "Update googletest from 1.8.1 to 1.10.0 (#6808)" (#6923)
Summary:
This reverts commit 8d87e9cea1.

Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923

Reviewed By: pdillinger

Differential Revision: D21864799

fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc
2020-06-03 15:55:03 -07:00
Peter Dillinger
9360776cb9 Fix handling of too-small filter partition size (#6905)
Summary:
Because ARM and some other platforms have a larger cache line
size, they have a larger minimum filter size, which causes recently
added PartitionedMultiGet test in db_bloom_filter_test to fail on those
platforms. The code would actually end up using larger partitions,
because keys_per_partition_ would be 0 and never == number of keys
added.

The code now attempts to get as close as possible to the small target
size, while fully utilizing that filter size, if the target partition
size is smaller than the minimum filter size.

Also updated the test to break more uniformly across platforms
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6905

Test Plan: updated test, tested on ARM

Reviewed By: anand1976

Differential Revision: D21840639

Pulled By: pdillinger

fbshipit-source-id: 11684b6d35f43d2e98b85ddb2c8dcfd59d670817
2020-06-03 10:43:01 -07:00
Peter Dillinger
14eca6bf04 For ApproximateSizes, pro-rate table metadata size over data blocks (#6784)
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.

It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.

So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.

Also includes miscellaneous comment improvements / clarifications.

Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784

Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...

    [ RUN      ] DBTest.ApproximateSizesFilesWithErrorMargin
    db/db_test.cc:1562: Failure
    Expected: (size) <= (11 * 100), actual: 9478 vs 1100

Other tests updated to reflect consistent accounting of metadata.

Reviewed By: siying

Differential Revision: D21334706

Pulled By: pdillinger

fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
2020-06-02 12:30:23 -07:00
sdong
298b00a396 Reduce dependency on gtest dependency in release code (#6907)
Summary:
Release code now depends on gtest, indirectly through including "test_util/testharness.h". This creates multiple problems. One important reason is the definition of IGNORE_STATUS_IF_ERROR() in test_util/testharness.h. Move it to sync_point.h instead.
Note that utilities/cassandra/format.h still depends on "test_util/testharness.h". This will be resolved in a separate diff.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6907

Test Plan: Run all existing tests.

Reviewed By: ajkr

Differential Revision: D21829884

fbshipit-source-id: 9253c19ffde2936f3ae68998210f8e54f645a6e6
2020-06-02 12:11:24 -07:00
Adam Retter
8d87e9cea1 Update googletest from 1.8.1 to 1.10.0 (#6808)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6808

Reviewed By: anand1976

Differential Revision: D21483984

Pulled By: pdillinger

fbshipit-source-id: 70c5eff2bd54ddba469761d95e4cd4611fb8e598
2020-06-01 20:33:42 -07:00
anand76
66942e8158 Avoid unnecessary reads of uncompression dictionary in MultiGet (#6906)
Summary:
We may sometimes read the uncompression dictionary when its not
necessary, when we lookup a key in an SST file but the index indicates
the key is not present. This can happen with index_type 3.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6906

Test Plan: make check

Reviewed By: cheng-chang

Differential Revision: D21828944

Pulled By: anand1976

fbshipit-source-id: 7aef4f0a39548d0874eafefd2687006d2652f9bb
2020-06-01 19:43:37 -07:00
Cheng Chang
bcb9e41080 Explicitly free allocated buffer when status is not ok (#6903)
Summary:
Currently we rely on `BlockContents` to implicitly free the allocated scratch buffer, but when IO error happens, it doesn't make sense to construct the `BlockContents` which might be corrupted. In the stress test, we find that `assert(req.result.size() == block_size(handle));` fails because of potential IO errors.

In this PR, we explicitly free the scratch buffer on error without constructing `BlockContents`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6903

Test Plan: watch stress test

Reviewed By: anand1976

Differential Revision: D21823869

Pulled By: cheng-chang

fbshipit-source-id: 5603fc80e9bf3f44a9d7250ddebd871afe1eb89f
2020-06-01 15:19:40 -07:00
Andrew Kryczka
c5abf78bca avoid IterKey::UpdateInternalKey() in BlockIter (#6843)
Summary:
`IterKey::UpdateInternalKey()` is an error-prone API as it's
incompatible with `IterKey::TrimAppend()`, which is used for
decoding delta-encoded internal keys. This PR stops using it in
`BlockIter`. Instead, it assigns global seqno in a separate `IterKey`'s
buffer when needed. The logic for safely getting a Slice with global
seqno properly assigned is encapsulated in `GlobalSeqnoAppliedKey`.
`BinarySeek()` is also migrated to use this API (previously it ignored
global seqno entirely).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6843

Test Plan:
benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
created by regular flush; "ingestion_db" created by ingesting a file. Both
DBs have same contents.

```
$ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
$ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
$ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
```

benchmark run command:
```
TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=10 -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=1048576000 -threads=1 -reads=40000000
```

results:

| DB | code | throughput |
|---|---|---|
| normal_db | master |  267.9 |
| normal_db   |    PR6843 | 254.2 (-5.1%) |
| ingestion_db |   master |  259.6 |
| ingestion_db |   PR6843 | 250.5 (-3.5%) |

Reviewed By: pdillinger

Differential Revision: D21562604

Pulled By: ajkr

fbshipit-source-id: 937596f836930515da8084d11755e1f247dcb264
2020-05-28 10:51:30 -07:00
Cheng Chang
82a82c76e7 Fix potential memory leak of scratch buffer (#6879)
Summary:
If `req.scratch` is an internally allocated buffer, but `raw_block_contents` is not constructed to own `req.scratch`, then `req.scratch` will be leaked.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6879

Test Plan: make asan_check

Reviewed By: anand1976

Differential Revision: D21728498

Pulled By: cheng-chang

fbshipit-source-id: 8fc6a4f2543918c565ddc16ecfad1807eb9a42cf
2020-05-26 15:29:04 -07:00
Andrew Kryczka
292bcf6227 skip direct I/O tests in rocksdb lite (#6867)
Summary:
Fix a couple places where direct I/O was used even though it is
unsupported in lite builds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6867

Test Plan: `LITE=1 make check -j48`

Reviewed By: pdillinger

Differential Revision: D21689185

Pulled By: ajkr

fbshipit-source-id: 3eaa3abf69cd7d0bcaabbcad3bb5a26fb8dd7301
2020-05-21 13:57:17 -07:00
mrambacher
38be686160 Add Struct Type to OptionsTypeInfo (#6425)
Summary:
Added code for generically handing structs to OptionTypeInfo.  A struct is a collection of variables handled by their own map of OptionTypeInfos.  Examples of structs include Compaction and Cache options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6425

Reviewed By: siying

Differential Revision: D21668789

Pulled By: zhichao-cao

fbshipit-source-id: 064b110de39dadf82361ed4663f7ac1a535b0b07
2020-05-21 10:58:39 -07:00
Cheng Chang
91b7553293 Enable IO Uring in MultiGet in direct IO mode (#6815)
Summary:
Currently, in direct IO mode, `MultiGet` retrieves the data blocks one by one instead of in parallel, see `BlockBasedTable::RetrieveMultipleBlocks`.

Since direct IO is supported in `RandomAccessFileReader::MultiRead` in https://github.com/facebook/rocksdb/pull/6446, this PR applies `MultiRead` to `MultiGet` so that the data blocks can be retrieved in parallel.

Also, in direct IO mode and when data blocks are compressed and need to uncompressed, this PR only allocates one continuous aligned buffer to hold the data blocks, and then directly uncompress the blocks to insert into block cache, there is no longer intermediate copies to scratch buffers.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6815

Test Plan:
1. added a new unit test `BlockBasedTableReaderTest::MultiGet`.
2. existing unit tests and stress tests  contain tests against `MultiGet` in direct IO mode.

Reviewed By: anand1976

Differential Revision: D21426347

Pulled By: cheng-chang

fbshipit-source-id: b8446ae0e74152444ef9111e97f8e402ac31b24f
2020-05-14 23:26:26 -07:00
sdong
4a4b8a1344 sst_dump to reduce number of file reads (#6836)
Summary:
sst_dump can issue many file reads from the file system. This doesn't work well with file systems without a OS cache, especially remote file systems. In order to mitigate this problem, several improvements are done:
1. --readahead_size is added, so that users can specify readahead size when scanning the data.
2. Force a 512KB tail readahead, which prevents three I/Os for footer, meta index and property blocks and hopefully index and filter blocks too.
3. Consoldiate SSTDump's I/Os before opening the file for read. Use the same file prefetch buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6836

Test Plan: Add a test that covers this new feature.

Reviewed By: pdillinger

Differential Revision: D21516607

fbshipit-source-id: 3ae43526286f67b2f4a5bdedfbc92719d579b87e
2020-05-12 18:23:33 -07:00
Ziyue Yang
c384c08a4f Add tests for compression failure in BlockBasedTableBuilder (#6709)
Summary:
Currently there is no check for whether BlockBasedTableBuilder will expose
compression error status if compression fails during the table building.
This commit adds fake faulting compressors and a unit test to test such
cases.

This check finds 5 bugs, and this commit also fixes them:

1. Not handling compression failure well in
   BlockBasedTableBuilder::BGWorkWriteRawBlock.
2. verify_compression failing in BlockBasedTableBuilder when used with ZSTD.
3. Wrongly passing the same reference of block contents to
   BlockBasedTableBuilder::CompressAndVerifyBlock in parallel compression.
4. Wrongly setting block_rep->first_key_in_next_block to nullptr in
   BlockBasedTableBuilder::EnterUnbuffered when there are still incoming data
   blocks.
5. Not maintaining variables for compression ratio estimation and first_block
   in BlockBasedTableBuilder::EnterUnbuffered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6709

Reviewed By: ajkr

Differential Revision: D21236254

fbshipit-source-id: 101f6e62b2bac2b7be72be198adf93cd32a1ff46
2020-05-12 09:27:35 -07:00
Peter Dillinger
b27a1448b6 Fix false NotFound from batched MultiGet with kHashSearch (#6821)
Summary:
The error is assigning KeyContext::s to NotFound status in a
table reader for a "not found in this table" case, which skips searching
in later tables, like only a delete should. (The hash search index iterator
is the only one that can return status NotFound even if Valid() == false.)

This was detected by intermittent failure in
MultiThreadedDBTest.MultiThreaded/5, a kHashSearch configuration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6821

Test Plan: modified existing unit test to reproduce problem

Reviewed By: anand1976

Differential Revision: D21450469

Pulled By: pdillinger

fbshipit-source-id: 7478003684d637dbd491cdac81468041a791be2c
2020-05-07 15:41:37 -07:00
mrambacher
394f2bbd13 Add OptionTypeInfo::Enum and related methods (#6423)
Summary:
Add methods and constructors for handling enums to the OptionTypeInfo.  This change allows enums to be converted/compared without adding a special "type" to the OptionType.

This change addresses a couple of issues:
- It allows new enumerated types to be added to the options without editing the OptionType base class (and related methods)
- It standardizes the procedure for adding enumerated types to the options, reducing potential mistakes
- It moves the enum maps to the location where they are used, allowing them to be static file members rather than global values
- It reduces the number of types and cases that need to be handled in the various OptionType methods
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6423

Reviewed By: siying

Differential Revision: D21408713

Pulled By: zhichao-cao

fbshipit-source-id: fc492af285d011822578b95d186a0fce25d35626
2020-05-05 15:04:04 -07:00
sdong
079e50d2ba Disallow BlockBasedTableBuilder to set status from non-OK (#6776)
Summary:
There is no systematic mechanism to prevent BlockBasedTableBuilder's status to be set from non-OK to OK. Adding a mechanism to force this will help us prevent failures in the future.

The solution is to only make it possible to set the status code if the status code to set is not OK.

Since the status code passed to CompressAndVerifyBlock() is changed, a mini refactoring is done too so that the output arguments are changed from reference to pointers, based on Google C++ Style.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6776

Test Plan: Run all existing test.

Reviewed By: pdillinger

Differential Revision: D21314382

fbshipit-source-id: 27000c10f1e4c121661e026548d6882066409375
2020-04-30 15:37:03 -07:00
anand76
ab13d43e1d Pass a timeout to FileSystem for random reads (#6751)
Summary:
Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.

For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.

Tests:
Update existing unit tests to verify the correct timeout value is being passed
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751

Reviewed By: riversand963

Differential Revision: D21285631

Pulled By: anand1976

fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
2020-04-30 14:50:39 -07:00
mrambacher
618bf638aa Add Functions to OptionTypeInfo (#6422)
Summary:
Added functions for parsing, serializing, and comparing elements to OptionTypeInfo.  These functions allow all of the special cases that could not be handled directly in the map of OptionTypeInfo to be moved into the map.  Using these functions, every type can be handled via the map rather than special cased.

By adding these functions, the code for handling options can become more standardized (fewer special cases) and (eventually) handled completely by common classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6422

Test Plan: pass make check

Reviewed By: siying

Differential Revision: D21269005

Pulled By: zhichao-cao

fbshipit-source-id: 9ba71c721a38ebf9ee88259d60bd81b3282b9077
2020-04-28 18:04:26 -07:00
Peter Dillinger
bae6f58696 Basic MultiGet support for partitioned filters (#6757)
Summary:
In MultiGet, access each applicable filter partition only once
per batch, rather than for each applicable key. Also,

* Fix Bloom stats for MultiGet
* Fix/refactor MultiGetContext::Range::KeysLeft, including
* Add efficient BitsSetToOne implementation
* Assert that MultiGetContext::Range does not go beyond shift range

Performance test: Generate db:

    $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true
    ...

Before (middle performing run of three; note some missing Bloom stats):

    $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
    multireadrandom :      26.403 micros/op 597517 ops/sec; (548427 of 671968 found)
    rocksdb.block.cache.filter.hit COUNT : 83443275
    rocksdb.bloom.filter.useful COUNT : 0
    rocksdb.bloom.filter.full.positive COUNT : 0
    rocksdb.bloom.filter.full.true.positive COUNT : 7931450
    rocksdb.number.multiget.get COUNT : 385984
    rocksdb.number.multiget.keys.read COUNT : 12351488
    rocksdb.number.multiget.bytes.read COUNT : 793145000
    rocksdb.number.multiget.keys.found COUNT : 7931450

After (middle performing run of three):

    $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
    multireadrandom :      21.024 micros/op 752963 ops/sec; (705188 of 863968 found)
    rocksdb.block.cache.filter.hit COUNT : 49856682
    rocksdb.bloom.filter.useful COUNT : 45684579
    rocksdb.bloom.filter.full.positive COUNT : 10395458
    rocksdb.bloom.filter.full.true.positive COUNT : 9908456
    rocksdb.number.multiget.get COUNT : 481984
    rocksdb.number.multiget.keys.read COUNT : 15423488
    rocksdb.number.multiget.bytes.read COUNT : 990845600
    rocksdb.number.multiget.keys.found COUNT : 9908456

So that's about 25% higher throughput even for random keys
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757

Test Plan: unit test included

Reviewed By: anand1976

Differential Revision: D21243256

Pulled By: pdillinger

fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
2020-04-28 14:49:34 -07:00
Peter Dillinger
249eff0f30 Stats for redundant insertions into block cache (#6681)
Summary:
Since read threads do not coordinate on loading data into block
cache, two threads between Lookup and Insert can end up loading and
inserting the same data. This is particularly concerning with
cache_index_and_filter_blocks since those are hot and more likely to
be race targets if ejected from (or not pre-populated in) the cache.

Particularly with moves toward disaggregated / network storage, the cost
of redundant retrieval might be high, and we should at least have some
hard statistics from which we can estimate impact.

Example with full filter thrashing "cliff":

    $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10
    ...
    $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
    rocksdb.block.cache.add COUNT : 14181
    rocksdb.block.cache.add.failures COUNT : 0
    rocksdb.block.cache.add.redundant COUNT : 476
    rocksdb.block.cache.data.add COUNT : 12749
    rocksdb.block.cache.data.add.redundant COUNT : 18
    rocksdb.block.cache.filter.add COUNT : 1003
    rocksdb.block.cache.filter.add.redundant COUNT : 217
    rocksdb.block.cache.index.add COUNT : 429
    rocksdb.block.cache.index.add.redundant COUNT : 241
    $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
    rocksdb.block.cache.add COUNT : 1182223
    rocksdb.block.cache.add.failures COUNT : 0
    rocksdb.block.cache.add.redundant COUNT : 302728
    rocksdb.block.cache.data.add COUNT : 31425
    rocksdb.block.cache.data.add.redundant COUNT : 12
    rocksdb.block.cache.filter.add COUNT : 795455
    rocksdb.block.cache.filter.add.redundant COUNT : 130238
    rocksdb.block.cache.index.add COUNT : 355343
    rocksdb.block.cache.index.add.redundant COUNT : 172478
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681

Test Plan: Some manual testing (above) and unit test covering key metrics is included

Reviewed By: ltamasi

Differential Revision: D21134113

Pulled By: pdillinger

fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87
2020-04-27 13:20:27 -07:00
anand76
9e7b7e2c08 Silence false alarms in db_stress fault injection (#6741)
Summary:
False alarms are caused by codepaths that intentionally swallow IO
errors.

Tests:
make crash_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6741

Reviewed By: ltamasi

Differential Revision: D21181138

Pulled By: anand1976

fbshipit-source-id: 5ccfbc68eb192033488de6269e59c00f2c65ce00
2020-04-24 13:06:12 -07:00
Ibrahim Jarif
ae77880223 Fix some typos in code comments (#6733)
Summary:
This PR fixes some typos in code comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6733

Reviewed By: siying

Differential Revision: D21209037

Pulled By: zhichao-cao

fbshipit-source-id: d9274611fab1f5e992998c8c4117b8078c4cbc69
2020-04-23 12:28:49 -07:00
mrambacher
4cbc19d2a1 Add a ConfigOptions for use in comparing objects and converting to/from strings (#6389)
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings.  There is a mishmash of parameters in use here with more needed in the future.  This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389

Reviewed By: siying

Differential Revision: D21163707

Pulled By: zhichao-cao

fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
2020-04-21 17:38:17 -07:00
Andrew Kryczka
f9155a3404 Prevent uninitialized load in IndexBlockIter (#6736)
Summary:
When index block is empty or an error happens while reading it,
`Invalidate()` is called rather than `Initialize()`. So `Seek()` must
not refer to member variables that are only initialized in
`Initialize()` until it is sure `Initialize()` has been called.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6736

Reviewed By: siying

Differential Revision: D21139641

Pulled By: ajkr

fbshipit-source-id: 71c58cc1adbd795dc3729dd5023bf7df1515ff32
2020-04-20 16:32:43 -07:00
Peter Dillinger
31da5e34c1 C++20 compatibility (#6697)
Summary:
Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended:

* Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists
* Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition
* std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API
* Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line
* Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20)
* Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor
* Added GCC 9 C++11 & GCC9 C++20 in Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697

Test Plan: make check and CI

Reviewed By: cheng-chang

Differential Revision: D21020318

Pulled By: pdillinger

fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91
2020-04-20 13:24:25 -07:00
Mike Kolupaev
e45673dece Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.

Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.

It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.

Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621

Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.

Reviewed By: siying

Differential Revision: D20786930

Pulled By: al13n321

fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-15 17:40:44 -07:00
Ziyue Yang
41563b61db Fix data racing of BlockBasedTableBuilder::ParallelCompressionRep::first_block (#6640)
Summary:
BlockBasedTableBuilder::ParallelCompressionRep::first_block can be read in
Flush() and written in BGWorkWriteRawBlock() concurrently. This commit fixes
the issue by reading first_block out before pushing the block to compression
and write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6640

Test Plan: Run all tests concurrently with TSAN.

Reviewed By: cheng-chang

Differential Revision: D20851370

fbshipit-source-id: 6f039222e8319d31e15f1b45e05c106527253f72
2020-04-13 16:24:57 -07:00
Andrew Kryczka
9eca6d651d fix comparison count for format_version=3 indexes (#6650)
Summary:
In index blocks since `format_version=3`, user keys are written
rather than internal keys. When reading such blocks, the comparator is
obtained via `InternalKeyComparator::user_comparator()`. That function
must not return an unwrapped result as the wrapper class provides
accounting logic to populate `PerfContext::user_key_comparison_count`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6650

Test Plan:
ran db_bench and verified
`PerfContext::user_key_comparison_count` became larger.

Reviewed By: cheng-chang

Differential Revision: D20866325

Pulled By: ajkr

fbshipit-source-id: ad755d46bda31157dacc5b66e532279f19ad538c
2020-04-13 11:18:37 -07:00
anand76
79c838eb0f Fix a few bugs in db_stress fault injection (#6693)
Summary:
Fix the following issues -
1. Output parsing error in db_crashtest.py
2. Memory leak on exit
3. False alarm on filter block read error
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6693

Test Plan: asan_crash

Reviewed By: cheng-chang

Differential Revision: D20990399

Pulled By: anand1976

fbshipit-source-id: 178ee0dd7c69a4bc5db698379db0dedb29281699
2020-04-13 11:01:03 -07:00
anand76
5c19a441c4 Fault injection in db_stress (#6538)
Summary:
This PR implements a fault injection mechanism for injecting errors in reads in db_stress. The FaultInjectionTestFS is used for this purpose. A thread local structure is used to track the errors, so that each db_stress thread can independently enable/disable error injection and verify observed errors against expected errors. This is initially enabled only for Get and MultiGet, but can be extended to iterator as well once its proven stable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6538

Test Plan:
crash_test
make check

Reviewed By: riversand963

Differential Revision: D20714347

Pulled By: anand1976

fbshipit-source-id: d7598321d4a2d72bda0ced57411a337a91d87dc7
2020-04-10 17:21:26 -07:00
anand76
d600e5b0eb Fix a Centos build failure reported in #6651 (#6656)
Summary:
Fixes issue https://github.com/facebook/rocksdb/issues/6651

Tests:
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6656

Reviewed By: cheng-chang

Differential Revision: D20879084

Pulled By: anand1976

fbshipit-source-id: c2cc508ca2716fcf80dcf9d2ba31c32d211f941e
2020-04-10 11:47:46 -07:00
Yi Wu
eb287c72d7 Fix wrong key being read on ingested file with global seqno and delta encoding (#6669)
Summary:
On reading an ingested SST file, `DataBlockIter` will replace seqno encoded in a key with global seqno. However, if the original seqno was part of the prefix used for the next key, the global seqno is by mistake used as part of the prefix to construct the next key, causing wrong result being returned. Although at this point it is only software error while data in the file is not corrupted, the issue can further cause compaction output out of order and corrupted result when the ingested SST participated in compaction. Fixing the issue by save the actual seqno and restore it before the key being used as prefix to construct next key.

The unit test is by Little-Wallace from https://github.com/facebook/rocksdb/issues/6666. Fixing https://github.com/facebook/rocksdb/issues/6666.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6669

Test Plan:
New unit test

Signed-off-by: Yi Wu <yiwu@pingcap.com>

Reviewed By: cheng-chang

Differential Revision: D20931808

Pulled By: ajkr

fbshipit-source-id: f01959c35d6a493954dca981663766c7a5a9e8ab
2020-04-08 21:22:15 -07:00
sdong
00f8016b36 Fix clang anaylze warning caused by #6262 (#6641)
Summary:
https://github.com/facebook/rocksdb/pull/6262 causes CLANG analyze to complain. Add assertion to suppress the warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6641

Test Plan: Run "clang analyze" and make sure it passes.

Reviewed By: anand1976

Differential Revision: D20841722

fbshipit-source-id: 5fa6e0c5cfe7a822214c9b898a408df59d4fd2cd
2020-04-03 15:47:51 -07:00
mrambacher
259b6ec8da Move the OptionTypeMap code closer to home (#6198)
Summary:
This is a predecessor to the Configurable PR.  This change moves the OptionTypeInfo maps closer to where they will be used.

When the Configurable changes are adopted, these values will become static and not associated with the OptionsHelper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6198

Reviewed By: siying

Differential Revision: D20778108

Pulled By: zhichao-cao

fbshipit-source-id: a9f85fc73bc53503656e1958ecc1e764052fd1aa
2020-04-03 10:52:38 -07:00
sdong
d0f3894cf1 In block based table builder, make variables for estimating file size atomic (#6636)
Summary:
With https://github.com/facebook/rocksdb/issues/6262, TSAN complains about data race of some variables. Those variables are used to estimate file size and are accessed in writer and background threads. Since file size estimation doesn't have to be 100% accurate, we make some variables atomic and use relaxed memory order.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6636

Test Plan: Run all tests with TSAN.

Reviewed By: anand1976

Differential Revision: D20820635

fbshipit-source-id: 1ea45ff38be15e33674ffe06b7d42fc9fe161ea5
2020-04-02 16:16:24 -07:00
Ziyue Yang
8088482dd6 Fix a division by zero after #6262 (#6633)
Summary:
With https://github.com/facebook/rocksdb/issues/6262, UBSAN fails with "division by zero":

[ RUN      ] Timestamp/DBBasicTestWithTimestampCompressionSettings.PutAndGetWithCompaction/3
internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:1066:39: runtime error: division by zero
    #0 0x7ffb3117b071 in rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool) internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:1066
    https://github.com/facebook/rocksdb/issues/1 0x7ffb311775e1 in rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, bool) internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:848
    https://github.com/facebook/rocksdb/issues/2 0x7ffb311771a2 in rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, bool) internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:832

This is caused by not returning immediately after CompressAndVerifyBlock call
in WriteBlock when rep_->status == kBuffered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6633

Test Plan: Run all existing test.

Reviewed By: anand1976

Differential Revision: D20808366

fbshipit-source-id: 09f24b7c0fbaf4c7a8fc48cac61fa6fcb9b85811
2020-04-02 11:57:05 -07:00
Ziyue Yang
03a781a90c Add pipelined & parallel compression optimization (#6262)
Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262

Reviewed By: ajkr

Differential Revision: D20651306

fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
2020-04-01 16:40:18 -07:00
Levi Tamasi
e6f86cfb36 Revert the recent cache deleter change (#6620)
Summary:
Revert "Use function objects as deleters in the block cache (https://github.com/facebook/rocksdb/issues/6545)"

    This reverts commit 6301dbe7a7.

    Revert "Call out the cache deleter related interface change in HISTORY.md (https://github.com/facebook/rocksdb/issues/6606)"

    This reverts commit 3a35542f86.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6620

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D20773311

Pulled By: ltamasi

fbshipit-source-id: 7637a761f718f323ef0e7da959462e8fb06e7a2b
2020-03-31 16:11:06 -07:00
Zhichao Cao
e8d332d97e Use FileChecksumGenFactory for SST file checksum (#6600)
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600

Test Plan: tested with make asan_check

Reviewed By: riversand963

Differential Revision: D20717670

Pulled By: zhichao-cao

fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
2020-03-29 15:58:46 -07:00
Zhichao Cao
4246888101 Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 16:04:43 -07:00
Levi Tamasi
6301dbe7a7 Use function objects as deleters in the block cache (#6545)
Summary:
As the first step of reintroducing eviction statistics for the block
cache, the patch switches from using simple function pointers as deleters
to function objects implementing an interface. This will enable using
deleters that have state, like a smart pointer to the statistics object
that is to be updated when an entry is removed from the cache. For now,
the patch adds a deleter template class `SimpleDeleter`, which simply
casts the `value` pointer to its original type and calls `delete` or
`delete[]` on it as appropriate. Note: to prevent object lifecycle
issues, deleters must outlive the cache entries referring to them;
`SimpleDeleter` ensures this by using the ("leaky") Meyers singleton
pattern.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6545

Test Plan: `make asan_check`

Reviewed By: siying

Differential Revision: D20475823

Pulled By: ltamasi

fbshipit-source-id: fe354c33dd96d9bafc094605462352305449a22a
2020-03-26 16:19:58 -07:00
Mike Kolupaev
963af52f15 Fix iterator reading filter block despite read_tier == kBlockCacheTier (#6562)
Summary:
We're seeing iterators with `ReadOptions::read_tier == kBlockCacheTier` sometimes doing file reads. Stack trace:

```
rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*, bool) const
rocksdb::BlockFetcher::ReadBlockContents()
rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const
rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool) const
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*)
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*) const
rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) const
rocksdb::FullFilterBlockReader::RangeMayExist(rocksdb::Slice const*, rocksdb::Slice const&, rocksdb::SliceTransform const*, rocksdb::Comparator const*, rocksdb::Slice const*, bool*, bool, rocksdb::BlockCacheLookupContext*)
rocksdb::BlockBasedTable::PrefixMayMatch(rocksdb::Slice const&, rocksdb::ReadOptions const&, rocksdb::SliceTransform const*, bool, rocksdb::BlockCacheLookupContext*) const
rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::SeekImpl(rocksdb::Slice const*)
rocksdb::ForwardIterator::SeekInternal(rocksdb::Slice const&, bool)
rocksdb::DBIter::Seek(rocksdb::Slice const&)
```

`BlockBasedTableIterator::CheckPrefixMayMatch` was missing a check for `kBlockCacheTier`. This PR adds it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6562

Test Plan: deployed it to a logdevice test cluster and looked at logdevice's IO tracing.

Reviewed By: siying

Differential Revision: D20529368

Pulled By: al13n321

fbshipit-source-id: 65bf33964b1951464415c900336635fb20919611
2020-03-26 15:21:26 -07:00
Cheng Chang
4fc216649d Support direct IO in RandomAccessFileReader::MultiRead (#6446)
Summary:
By supporting direct IO in RandomAccessFileReader::MultiRead, the benefits of parallel IO (IO uring) and direct IO can be combined.

In direct IO mode, read requests are aligned and merged together before being issued to RandomAccessFile::MultiRead, so blocks in the original requests might share the same underlying buffer, the shared buffers are returned in `aligned_bufs`, which is a new parameter of the `MultiRead` API.

For example, suppose alignment requirement for direct IO is 4KB, one request is (offset: 1KB, len: 1KB), another request is (offset: 3KB, len: 1KB), then since they all belong to page (offset: 0, len: 4KB), `MultiRead` only reads the page with direct IO into a buffer on heap, and returns 2 Slices referencing regions in that same buffer. See `random_access_file_reader_test.cc` for more examples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6446

Test Plan: Added a new test `random_access_file_reader_test.cc`.

Reviewed By: anand1976

Differential Revision: D20097518

Pulled By: cheng-chang

fbshipit-source-id: ca48a8faf9c3af146465c102ef6b266a363e78d1
2020-03-20 16:33:26 -07:00
sdong
712bc4b6a2 Fix regression bug in partitioned index reseek caused by #6531 (#6551)
Summary:
https://github.com/facebook/rocksdb/pull/6531 removed some code in partitioned index seek logic. By mistake the logic of storing previous index offset is removed, while the logic of using it is preserved, so that the code might use wrong value to determine reseeking condition.
This will trigger a bug, if following a Seek() not going to the last block, SeekToLast() is called, and then Seek() is called which should position the cursor to the block before SeekToLast().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6551

Test Plan: Add a unit test that reproduces the bug. In the same unit test, also some reseek cases are covered to avoid regression.

Reviewed By: pdillinger

Differential Revision: D20493990

fbshipit-source-id: 3919aa4861c0481ec96844e053048da1a934b91d
2020-03-17 12:33:10 -07:00