Commit Graph

742 Commits

Author SHA1 Message Date
Zitan Chen
119b26fac0 Implement a new subcommand "identify" for sst_dump (#6943)
Summary:
Implemented a subcommand of sst_dump called identify, which determines whether a file is an SST file or identifies and lists all the SST files in a directory;

This update also fixes the problem that sst_dump exits with a success state even if target file/directory does not exist/is not an SST file/is empty/is corrupted.

One test is added to sst_dump_test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6943

Test Plan: Passed make check and a few manual tests

Reviewed By: pdillinger

Differential Revision: D21928985

Pulled By: gg814

fbshipit-source-id: 9a8b48e0cf1a0e96b13f42b690aba8ad981afad3
2020-06-08 13:58:28 -07:00
anand76
98b0cbea88 Check iterator status BlockBasedTableReader::VerifyChecksumInBlocks() (#6909)
Summary:
The ```for``` loop in ```VerifyChecksumInBlocks``` only checks ```index_iter->Valid()``` which could be ```false``` either due to reaching the end of the index or, in case of partitioned index, it could be due to a checksum mismatch error when reading a 2nd level index block. Instead of throwing away the index iterator status, we need to return any errors back to the caller.

Tests:
Add a test in block_based_table_reader_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6909

Reviewed By: pdillinger

Differential Revision: D21833922

Pulled By: anand1976

fbshipit-source-id: bc778ebf1121dbbdd768689de5183f07a9f0beae
2020-06-05 11:08:25 -07:00
Yanqin Jin
a8170d774c Close file to avoid file-descriptor leakage (#6936)
Summary:
When operation on an open file descriptor fails, we should close the file descriptor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6936

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D21885458

Pulled By: riversand963

fbshipit-source-id: ba077a76b256a8537f21e22e4ec198f45390bf50
2020-06-04 14:21:15 -07:00
Zitan Chen
02df00d97b API change: DB::OpenForReadOnly will not write to the file system unless create_if_missing is true (#6900)
Summary:
DB::OpenForReadOnly will not write anything to the file system (i.e., create directories or files for the DB) unless create_if_missing is true.

This change also fixes some subcommands of ldb, which write to the file system even if the purpose is for readonly.

Two tests for this updated behavior of DB::OpenForReadOnly are also added.

Other minor changes:
1. Updated HISTORY.md to include this API change of DB::OpenForReadOnly;
2. Updated the help information for the put and batchput subcommands of ldb with the option [--create_if_missing];
3. Updated the comment of Env::DeleteDir to emphasize that it returns OK only if the directory to be deleted is empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6900

Test Plan: passed make check; also manually tested a few ldb subcommands

Reviewed By: pdillinger

Differential Revision: D21822188

Pulled By: gg814

fbshipit-source-id: 604cc0f0d0326a937ee25a32cdc2b512f9a3be6e
2020-06-03 18:57:49 -07:00
Levi Tamasi
0b8c549b3f Mention the consistency check improvement in HISTORY.md (#6924)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6924

Reviewed By: cheng-chang

Differential Revision: D21865662

Pulled By: ltamasi

fbshipit-source-id: 83a01bcbb779cfba941154a36a9e735293a93211
2020-06-03 13:40:41 -07:00
Peter Dillinger
14eca6bf04 For ApproximateSizes, pro-rate table metadata size over data blocks (#6784)
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.

It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.

So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.

Also includes miscellaneous comment improvements / clarifications.

Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784

Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...

    [ RUN      ] DBTest.ApproximateSizesFilesWithErrorMargin
    db/db_test.cc:1562: Failure
    Expected: (size) <= (11 * 100), actual: 9478 vs 1100

Other tests updated to reflect consistent accounting of metadata.

Reviewed By: siying

Differential Revision: D21334706

Pulled By: pdillinger

fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
2020-06-02 12:30:23 -07:00
Andrew Kryczka
c5abf78bca avoid IterKey::UpdateInternalKey() in BlockIter (#6843)
Summary:
`IterKey::UpdateInternalKey()` is an error-prone API as it's
incompatible with `IterKey::TrimAppend()`, which is used for
decoding delta-encoded internal keys. This PR stops using it in
`BlockIter`. Instead, it assigns global seqno in a separate `IterKey`'s
buffer when needed. The logic for safely getting a Slice with global
seqno properly assigned is encapsulated in `GlobalSeqnoAppliedKey`.
`BinarySeek()` is also migrated to use this API (previously it ignored
global seqno entirely).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6843

Test Plan:
benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
created by regular flush; "ingestion_db" created by ingesting a file. Both
DBs have same contents.

```
$ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
$ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
$ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
```

benchmark run command:
```
TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=10 -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=1048576000 -threads=1 -reads=40000000
```

results:

| DB | code | throughput |
|---|---|---|
| normal_db | master |  267.9 |
| normal_db   |    PR6843 | 254.2 (-5.1%) |
| ingestion_db |   master |  259.6 |
| ingestion_db |   PR6843 | 250.5 (-3.5%) |

Reviewed By: pdillinger

Differential Revision: D21562604

Pulled By: ajkr

fbshipit-source-id: 937596f836930515da8084d11755e1f247dcb264
2020-05-28 10:51:30 -07:00
Akanksha Mahajan
bcefc59e9f Allow MultiGet users to limit cumulative value size (#6826)
Summary:
1. Add a value_size in read options which limits the cumulative value size of keys read in batches. Once the size exceeds read_options.value_size, all the remaining keys are returned with status Abort without further fetching any key.
2. Add a unit test case MultiGetBatchedValueSizeSimple the reads keys from memory and sst files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6826

Test Plan:
1. make check -j64
	   2. Add a new unit test case

Reviewed By: anand1976

Differential Revision: D21471483

Pulled By: akankshamahajan15

fbshipit-source-id: dea51b8e76d5d1df38ece8cdb29933b1d798b900
2020-05-27 13:07:14 -07:00
Zhichao Cao
545e14b53b Generate file checksum in SstFileWriter (#6859)
Summary:
If Option.file_checksum_gen_factory is set, rocksdb generates the file checksum during flush and compaction based on the checksum generator created by the factory and store the checksum and function name in vstorage and Manifest.

This PR enable file checksum generation in SstFileWrite and store the checksum and checksum function name in the  ExternalSstFileInfo, such that application can use them for other purpose, for example, ingest the file checksum with files in IngestExternalFile().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6859

Test Plan: add unit test and pass make asan_check.

Reviewed By: ajkr

Differential Revision: D21656247

Pulled By: zhichao-cao

fbshipit-source-id: 78a3570c76031d8832e3d2de3d6c79cdf2b675d0
2020-05-20 11:55:31 -07:00
sdong
4a4b8a1344 sst_dump to reduce number of file reads (#6836)
Summary:
sst_dump can issue many file reads from the file system. This doesn't work well with file systems without a OS cache, especially remote file systems. In order to mitigate this problem, several improvements are done:
1. --readahead_size is added, so that users can specify readahead size when scanning the data.
2. Force a 512KB tail readahead, which prevents three I/Os for footer, meta index and property blocks and hopefully index and filter blocks too.
3. Consoldiate SSTDump's I/Os before opening the file for read. Use the same file prefetch buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6836

Test Plan: Add a test that covers this new feature.

Reviewed By: pdillinger

Differential Revision: D21516607

fbshipit-source-id: 3ae43526286f67b2f4a5bdedfbc92719d579b87e
2020-05-12 18:23:33 -07:00
sdong
a50ea71c00 Improve ldb consistency checks (#6802)
Summary:
When using ldb, users cannot turn on force consistency check in most commands, while they cannot use checksonsistnecy with --try_load_options. The change fixes both by:
1. checkconsistency now calls OpenDB() so that it gets all the options loading and sanitized options logic
2. use options.check_consistency_checks = true by default, and add a --disable_consistency_checks to turn it off.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6802

Test Plan: Add a new unit test. Some manual tests with corrupted DBs.

Reviewed By: pdillinger

Differential Revision: D21388051

fbshipit-source-id: 8d122732d391b426e3982a1c3232a8e3763ffad0
2020-05-08 14:17:47 -07:00
Yanqin Jin
e72e2167fd Fix a few bugs in best-efforts recovery (#6824)
Summary:
1. Update column_family_memtables_ to point to latest column_family_set in
   version_set after recovery.
2. Normalize file paths passed by application so that directories end with '/'
   or '\\'.
3. In addition to missing files, corrupted files are also ignored in
   best-efforts recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6824

Test Plan: COMPILE_WITH_ASAN=1 make check

Reviewed By: anand1976

Differential Revision: D21463905

Pulled By: riversand963

fbshipit-source-id: c48db8843cc93c8c1c7139c474b64e6f775307d2
2020-05-08 13:01:42 -07:00
anand76
94265234de Fix race due to delete triggered compaction in Universal compaction mode (#6799)
Summary:
Delete triggered compaction in universal compaction mode was causing a corruption when scheduled in parallel with other compactions.
1. When num_levels = 1, a file marked for compaction may be picked along with all older files in L0, without checking if any of them are already being compaction. This can cause unpredictable results like resurrection of older versions of keys or deleted keys.
2. When num_levels > 1, a delete triggered compaction would not get scheduled if it overlaps with a running regular compaction. However, the reverse is not true. This is due to the fact that in ```UniversalCompactionBuilder::CalculateSortedRuns```, it assumes that entire sorted runs are picked for compaction and only checks the first file in a sorted run to determine conflicts. This is violated by a delete triggered compaction as it works on a subset of a sorted run.

Fix the bug for num_levels > 1, and disable the feature for now when num_levels = 1. After disabling this feature, files would still get marked for compaction, but no compaction would get scheduled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6799

Reviewed By: pdillinger

Differential Revision: D21431286

Pulled By: anand1976

fbshipit-source-id: ae9f0bdb1d6ae2f10284847db731c23f43af164a
2020-05-07 17:32:17 -07:00
Andrew Kryczka
3730b05dc9 Fixup HISTORY.md for e9ba4ba "validate range tombstone covers positiv… (#6825)
Summary:
…e range"

Moved it from the wrong section (6.10) to the right section (Unreleased).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6825

Reviewed By: zhichao-cao

Differential Revision: D21464577

Pulled By: ajkr

fbshipit-source-id: a836b4ab10be2464182826f9411c9c424c933b70
2020-05-07 16:40:17 -07:00
Peter Dillinger
b27a1448b6 Fix false NotFound from batched MultiGet with kHashSearch (#6821)
Summary:
The error is assigning KeyContext::s to NotFound status in a
table reader for a "not found in this table" case, which skips searching
in later tables, like only a delete should. (The hash search index iterator
is the only one that can return status NotFound even if Valid() == false.)

This was detected by intermittent failure in
MultiThreadedDBTest.MultiThreaded/5, a kHashSearch configuration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6821

Test Plan: modified existing unit test to reproduce problem

Reviewed By: anand1976

Differential Revision: D21450469

Pulled By: pdillinger

fbshipit-source-id: 7478003684d637dbd491cdac81468041a791be2c
2020-05-07 15:41:37 -07:00
Andrew Kryczka
e9ba4ba348 validate range tombstone covers positive range (#6788)
Summary:
We found some files containing nothing but negative range tombstones,
and unsurprisingly their metadata specified a negative range, which made
things crash. Time to add a bit of user input validation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6788

Reviewed By: zhichao-cao

Differential Revision: D21343719

Pulled By: ajkr

fbshipit-source-id: f1c16e4c3e9fa150958c8c866176632a3206fb74
2020-05-07 11:55:30 -07:00
anand76
c1e1185b7a Update release version to 6.10 (#6797)
Summary:
Update HISTORY.md and version.h to 6.10.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6797

Reviewed By: zhichao-cao

Differential Revision: D21371390

Pulled By: anand1976

fbshipit-source-id: 6017bca24fc5d12076d1ddaec7783c9b85712d42
2020-05-06 16:42:37 -07:00
Levi Tamasi
06c3b85b9a Disallow using the base DB's storage directory as blob_dir in BlobDB (#6810)
Summary:
https://github.com/facebook/rocksdb/pull/6807 extends the logic that
identifies and purges obsolete files to blob files handled by RocksDB
itself. In order to prevent that from interfering with the current BlobDB code,
we need to make sure that `BlobDBOptions::blob_dir` is different from
the storage directories used by the base DB. (Note: this is true by default.)
The patch adds a check that explicitly disallows this configuration and
returns `Status::NotSupported` from `BlobDB::Open` in such cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6810

Test Plan: Tested using the BlobDB mode of `db_bench`.

Reviewed By: riversand963

Differential Revision: D21412676

Pulled By: ltamasi

fbshipit-source-id: 6630cc7481e48c8bf55d59423b25f14d52ffe681
2020-05-06 14:00:46 -07:00
Yanqin Jin
5a61e7864d Fix db_stress when GetLiveFiles() flushes dropped CF (#6805)
Summary:
Current impl. of db_stress will abort verification and report failure if
GetLiveFiles() causes a dropped column family to be flushed. This is not
desired.
To fix, this PR makes the following change:
In GetLiveFiles, if flush is triggered and returns
Status::IsColumnFamilyDropped(), then set status to Status::OK().
This is OK because dropped column families will be skipped during the rest of
this function, and valid column families will have their live files returned to
caller.

Test plan (dev server):
make check
./db_stress -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100
./db_stress -disable_wal=1 -reopen=0 -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6805

Reviewed By: ltamasi

Differential Revision: D21390044

Pulled By: riversand963

fbshipit-source-id: de67846b95a4f1b88aa0a30c3d70c43cc68625b9
2020-05-04 17:45:49 -07:00
sdong
680c416348 Avoid Swallowing Some File Consistency Checking Bugs (#6793)
Summary:
We are swallowing some file consistency checking failures. This is not expected. We are fixing two cases: DB reopen and manifest dump.
More places are not fixed and need follow-up.

Error from CheckConsistencyForDeletes() is also swallowed, which is not fixed in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6793

Test Plan: Add a unit test to cover the reopen case.

Reviewed By: riversand963

Differential Revision: D21366525

fbshipit-source-id: eb438a322237814e8d5125f916a3c6de97f39ded
2020-05-04 14:18:11 -07:00
sdong
6277e28039 Flag CompressionOptions::parallel_threads to be experimental (#6781)
Summary:
The feature of CompressionOptions::parallel_threads is still not yet mature. Mention it to be experimental in the comments for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6781

Reviewed By: pdillinger

Differential Revision: D21330678

fbshipit-source-id: d7dd7d099fb002a5c6a5d8da689ce5ee08a9eb13
2020-04-30 15:22:06 -07:00
Peter Dillinger
bae6f58696 Basic MultiGet support for partitioned filters (#6757)
Summary:
In MultiGet, access each applicable filter partition only once
per batch, rather than for each applicable key. Also,

* Fix Bloom stats for MultiGet
* Fix/refactor MultiGetContext::Range::KeysLeft, including
* Add efficient BitsSetToOne implementation
* Assert that MultiGetContext::Range does not go beyond shift range

Performance test: Generate db:

    $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true
    ...

Before (middle performing run of three; note some missing Bloom stats):

    $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
    multireadrandom :      26.403 micros/op 597517 ops/sec; (548427 of 671968 found)
    rocksdb.block.cache.filter.hit COUNT : 83443275
    rocksdb.bloom.filter.useful COUNT : 0
    rocksdb.bloom.filter.full.positive COUNT : 0
    rocksdb.bloom.filter.full.true.positive COUNT : 7931450
    rocksdb.number.multiget.get COUNT : 385984
    rocksdb.number.multiget.keys.read COUNT : 12351488
    rocksdb.number.multiget.bytes.read COUNT : 793145000
    rocksdb.number.multiget.keys.found COUNT : 7931450

After (middle performing run of three):

    $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
    multireadrandom :      21.024 micros/op 752963 ops/sec; (705188 of 863968 found)
    rocksdb.block.cache.filter.hit COUNT : 49856682
    rocksdb.bloom.filter.useful COUNT : 45684579
    rocksdb.bloom.filter.full.positive COUNT : 10395458
    rocksdb.bloom.filter.full.true.positive COUNT : 9908456
    rocksdb.number.multiget.get COUNT : 481984
    rocksdb.number.multiget.keys.read COUNT : 15423488
    rocksdb.number.multiget.bytes.read COUNT : 990845600
    rocksdb.number.multiget.keys.found COUNT : 9908456

So that's about 25% higher throughput even for random keys
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757

Test Plan: unit test included

Reviewed By: anand1976

Differential Revision: D21243256

Pulled By: pdillinger

fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
2020-04-28 14:49:34 -07:00
Peter Dillinger
a7f0b27b39 HISTORY.md update for bzip upgrade (#6767)
Summary:
See https://github.com/facebook/rocksdb/issues/6714 and https://github.com/facebook/rocksdb/issues/6703
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6767

Reviewed By: riversand963

Differential Revision: D21283307

Pulled By: pdillinger

fbshipit-source-id: 8463bec725669d13846c728ad4b5bde43f9a84f8
2020-04-28 12:29:31 -07:00
Peter Dillinger
4574d7513d Update HISTORY.md for block cache redundant adds (#6764)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6764

Reviewed By: ltamasi

Differential Revision: D21267108

Pulled By: pdillinger

fbshipit-source-id: a3dfe2dbe4e8f6309a53eb72903ef58d52308f97
2020-04-28 08:26:43 -07:00
Yanqin Jin
d4398e08fc Fix timestamp support for MultiGet (#6748)
Summary:
1. Avoid nullptr dereference when passing timestamp to KeyContext creation.
2. Construct LookupKey correctly with timestamp when creating MultiGetContext.
3. Compare without timestamp when sorting KeyContexts.

Fixes https://github.com/facebook/rocksdb/issues/6745

Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6748

Reviewed By: pdillinger

Differential Revision: D21258691

Pulled By: riversand963

fbshipit-source-id: 44e65b759c18b9986947783edf03be4f890bb004
2020-04-27 22:49:56 -07:00
Levi Tamasi
bea91d5d61 Destroy any ColumnFamilyHandles in BlobDB::Open upon error (#6763)
Summary:
If an error happens during BlobDBImpl::Open after the base DB has been
opened, we need to destroy the `ColumnFamilyHandle`s returned by `DB::Open`
to prevent an assertion in `ColumnFamilySet`'s destructor from being hit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6763

Test Plan: Ran `make check` and tested using the BlobDB mode of `db_bench`.

Reviewed By: riversand963

Differential Revision: D21262643

Pulled By: ltamasi

fbshipit-source-id: 60ebc7ab19be66cf37fbe5f6d8957d58470f3d3b
2020-04-27 16:45:13 -07:00
Akanksha Mahajan
75b13ea94a Allow sst_dump to check size of different compression levels and report time (#6634)
Summary:
Summary : 1. Add two arguments --compression_level_from and --compression_level_to to check
	  the compression size with different compression level in the given range. Users must
          specify one compression type else it will error out. Both from and to levels must
	  also be specified together.
	  2. Display the time taken to compress each file with different compressions by default.

Test Plan : make -j64 check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6634

Test Plan: make -j64 check

Reviewed By: anand1976

Differential Revision: D20810282

Pulled By: akankshamahajan15

fbshipit-source-id: ac9098d3c079a1fad098f6678dbedb4d888a791b
2020-04-27 12:36:16 -07:00
Cheng Chang
40497a875a Reduce memory copies when fetching and uncompressing blocks from SST files (#6689)
Summary:
In https://github.com/facebook/rocksdb/pull/6455, we modified the interface of `RandomAccessFileReader::Read` to be able to get rid of memcpy in direct IO mode.
This PR applies the new interface to `BlockFetcher` when reading blocks from SST files in direct IO mode.

Without this PR, in direct IO mode, when fetching and uncompressing compressed blocks, `BlockFetcher` will first copy the raw compressed block into `BlockFetcher::compressed_buf_` or `BlockFetcher::stack_buf_` inside `RandomAccessFileReader::Read` depending on the block size. then during uncompressing, it will copy the uncompressed block into `BlockFetcher::heap_buf_`.

In this PR, we get rid of the first memcpy and directly uncompress the block from `direct_io_buf_` to `heap_buf_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6689

Test Plan: A new unit test `block_fetcher_test` is added.

Reviewed By: anand1976

Differential Revision: D21006729

Pulled By: cheng-chang

fbshipit-source-id: 2370b92c24075692423b81277415feb2aed5d980
2020-04-24 15:32:56 -07:00
Yanqin Jin
e04f3bce4f Update CURRENT file after best-efforts recovery (#6746)
Summary:
After a successful recovery, the CURRENT file should be updated to point to the valid MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6746

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D21189876

Pulled By: riversand963

fbshipit-source-id: 7537b49988c5c425ebe9505a5cc260de351ad79b
2020-04-23 16:21:09 -07:00
mrambacher
4cbc19d2a1 Add a ConfigOptions for use in comparing objects and converting to/from strings (#6389)
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings.  There is a mishmash of parameters in use here with more needed in the future.  This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389

Reviewed By: siying

Differential Revision: D21163707

Pulled By: zhichao-cao

fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
2020-04-21 17:38:17 -07:00
Akanksha Mahajan
03a1d95db0 Set max_background_flushes dynamically (#6701)
Summary:
1. Add changes so that max_background_flushes can be set dynamically.
                   2. Add a testcase DBOptionsTest.SetBackgroundFlushThreads which set the
                        max_background_flushes dynamically using SetDBOptions.

TestPlan:  1. make -j64 check
                  2. Using new testcase DBOptionsTest.SetBackgroundFlushThreads
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6701

Reviewed By: ajkr

Differential Revision: D21028010

Pulled By: akankshamahajan15

fbshipit-source-id: 5f949e4a8fd3c32537b637947b7ee09a69cfc7c1
2020-04-20 16:19:02 -07:00
Yanqin Jin
243852ec15 Add IsDirectory() to Env and FS (#6711)
Summary:
IsDirectory() is a common API to check whether a path is a regular file or
directory.
POSIX: call stat() and use S_ISDIR(st_mode)
Windows: PathIsDirectoryA() and PathIsDirectoryW()
HDFS: FileSystem.IsDirectory()
Java: File.IsDirectory()
...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6711

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D21053520

Pulled By: riversand963

fbshipit-source-id: 680aadfd8ce982b63689190cf31b3145d5a89e27
2020-04-17 14:39:18 -07:00
Mike Kolupaev
e45673dece Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.

Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.

It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.

Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621

Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.

Reviewed By: siying

Differential Revision: D20786930

Pulled By: al13n321

fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-15 17:40:44 -07:00
Zhichao Cao
38dfa406ff Add NewFileChecksumGenCrc32cFactory to file checksum (#6688)
Summary:
Add NewFileChecksumGenCrc32cFactory to file checksum public interface such that applications can use the build in crc32 checksum factory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6688

Test Plan: pass make asan_check

Reviewed By: riversand963

Differential Revision: D21006859

Pulled By: zhichao-cao

fbshipit-source-id: ea8a45196a8b77c310728ab05f6cc0f49f3baef0
2020-04-13 19:13:41 -07:00
Andrew Kryczka
9eca6d651d fix comparison count for format_version=3 indexes (#6650)
Summary:
In index blocks since `format_version=3`, user keys are written
rather than internal keys. When reading such blocks, the comparator is
obtained via `InternalKeyComparator::user_comparator()`. That function
must not return an unwrapped result as the wrapper class provides
accounting logic to populate `PerfContext::user_key_comparison_count`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6650

Test Plan:
ran db_bench and verified
`PerfContext::user_key_comparison_count` became larger.

Reviewed By: cheng-chang

Differential Revision: D20866325

Pulled By: ajkr

fbshipit-source-id: ad755d46bda31157dacc5b66e532279f19ad538c
2020-04-13 11:18:37 -07:00
sdong
1be3be5522 Auto-Format two recent diffs and add HISTORY.md (#6685)
Summary:
Two recent diffs can be autoformatted.
Also add HISTORY.md entry for https://github.com/facebook/rocksdb/pull/6214
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6685

Test Plan: Run all existing tests

Reviewed By: cheng-chang

Differential Revision: D20965780

fbshipit-source-id: 195b08d7849513d42fe14073112cd19fdda6af95
2020-04-10 11:32:44 -07:00
Cheng Chang
6e6f807917 Add two more optimization improvements to HISTORY (#6679)
Summary:
Although these optimizations are not user facing, still feel it's valuable to call out in HISTORY.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6679

Test Plan: no need

Reviewed By: zhichao-cao

Differential Revision: D20945916

Pulled By: cheng-chang

fbshipit-source-id: f3e790c07f3bcc4a8a74246c4fa232800ddd4438
2020-04-09 11:19:51 -07:00
Yi Wu
eb287c72d7 Fix wrong key being read on ingested file with global seqno and delta encoding (#6669)
Summary:
On reading an ingested SST file, `DataBlockIter` will replace seqno encoded in a key with global seqno. However, if the original seqno was part of the prefix used for the next key, the global seqno is by mistake used as part of the prefix to construct the next key, causing wrong result being returned. Although at this point it is only software error while data in the file is not corrupted, the issue can further cause compaction output out of order and corrupted result when the ingested SST participated in compaction. Fixing the issue by save the actual seqno and restore it before the key being used as prefix to construct next key.

The unit test is by Little-Wallace from https://github.com/facebook/rocksdb/issues/6666. Fixing https://github.com/facebook/rocksdb/issues/6666.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6669

Test Plan:
New unit test

Signed-off-by: Yi Wu <yiwu@pingcap.com>

Reviewed By: cheng-chang

Differential Revision: D20931808

Pulled By: ajkr

fbshipit-source-id: f01959c35d6a493954dca981663766c7a5a9e8ab
2020-04-08 21:22:15 -07:00
sdong
94f90ac6bc compression related options are not copied back from MutableCFOptions… (#6668)
Summary:
… to CFOptions
https://github.com/facebook/rocksdb/pull/6615 made several compression related options dynamically changeable. They are moved to MutableCFOptions. However, they are not copied back to ColumnFamilyOptions, so the changed values are not written to option files and for some other uses. Fix it by copying them back.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6668

Test Plan: Add a unit test to make sure that when a MutableCFOptions is converted to CFOptions and back to MutableCFOptions, they stay the same. This test would fail without the fix.

Reviewed By: ajkr

Differential Revision: D20923999

fbshipit-source-id: c3bccd6923b00d677764e2269bed6a95ad7ed780
2020-04-08 14:40:46 -07:00
Zhichao Cao
278911a2d9 Remove redundant in HISTORY (#6627)
Summary:
Remove redundant description in HISTORY

no code change
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6627

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D20797269

Pulled By: zhichao-cao

fbshipit-source-id: dee4c9a22f6d241c985f250c0f11bfaa9198f4c1
2020-04-02 12:12:05 -07:00
Ziyue Yang
03a781a90c Add pipelined & parallel compression optimization (#6262)
Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262

Reviewed By: ajkr

Differential Revision: D20651306

fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
2020-04-01 16:40:18 -07:00
sdong
57096ab13e Fix a bug that crashes the service when write buffer manager fails to insert to block cache (#6619)
Summary:
https://github.com/facebook/rocksdb/issues/6247 reports that when write buffer manager fails to insert the dummy entry to block cache, null pointer is still stored and used to release the handle and cause corruption. Fix the bug by not releasing it with null handle.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6619

Test Plan: Add a unit test that fails without the fix.

Reviewed By: ajkr

Differential Revision: D20776769

fbshipit-source-id: 4127fbd9f295a0a3e45774746ffcd91f939f6287
2020-04-01 11:27:40 -07:00
Levi Tamasi
e6f86cfb36 Revert the recent cache deleter change (#6620)
Summary:
Revert "Use function objects as deleters in the block cache (https://github.com/facebook/rocksdb/issues/6545)"

    This reverts commit 6301dbe7a7.

    Revert "Call out the cache deleter related interface change in HISTORY.md (https://github.com/facebook/rocksdb/issues/6606)"

    This reverts commit 3a35542f86.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6620

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D20773311

Pulled By: ltamasi

fbshipit-source-id: 7637a761f718f323ef0e7da959462e8fb06e7a2b
2020-03-31 16:11:06 -07:00
sdong
80979f81c7 Make options.bottommost_compression, compression_opts and bottommost_compression_opts dynamically changeable. (#6615)
Summary:
These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615

Test Plan: Add a unit test to make sure that SetOptions() can change the options.

Reviewed By: riversand963

Differential Revision: D20755951

fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff
2020-03-31 12:11:42 -07:00
Yanqin Jin
18cf0de640 Use flush time for the props.creation_time for FIFO compaction (#6612)
Summary:
For FIFO compaction, we use flush time instead of oldest key time as the
creation time. This is to prevent FIFO compaction dropping files whose oldest
key time is older than TTL but which has newer keys than TTL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6612

Test Plan: make check

Reviewed By: siying

Differential Revision: D20748217

Pulled By: riversand963

fbshipit-source-id: 3f7b00a847020760537cdddd12f6fe039e5bc663
2020-03-30 18:59:17 -07:00
Zhichao Cao
eaf95c7d1a Update release version to 6.9.0 (#6610)
Summary:
Update release version to 6.9.0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6610

Test Plan: no code change

Reviewed By: riversand963

Differential Revision: D20741094

Pulled By: zhichao-cao

fbshipit-source-id: 80a9e9ea8d164b6923112352d36fcbc1be85c034
2020-03-30 16:31:02 -07:00
Zhichao Cao
e8d332d97e Use FileChecksumGenFactory for SST file checksum (#6600)
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600

Test Plan: tested with make asan_check

Reviewed By: riversand963

Differential Revision: D20717670

Pulled By: zhichao-cao

fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
2020-03-29 15:58:46 -07:00
Cheng Chang
ee50b8d499 Be able to decrease background thread's CPU priority when creating database backup (#6602)
Summary:
When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries.

This PR makes it possible to decrease CPU priority of these threads when creating a new backup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602

Test Plan: make check

Reviewed By: siying, zhichao-cao

Differential Revision: D20683216

Pulled By: cheng-chang

fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84
2020-03-28 19:07:25 -07:00
Levi Tamasi
3a35542f86 Call out the cache deleter related interface change in HISTORY.md (#6606)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6606

Reviewed By: riversand963

Differential Revision: D20708411

Pulled By: ltamasi

fbshipit-source-id: c15b4ded19a4b5c84e3e4240bdcec15460806c88
2020-03-27 16:18:23 -07:00
Peter Dillinger
93b80ca7ba Update default BBTO::format_version from 2 to 4 (#6582)
Summary:
Version 4 has been around long enough, for compatibility and
extensive validation, that it should be default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6582

Test Plan:
CI (w.r.t. changing the default; format_version=4 is well
tested and massively in production at Facebook)

Reviewed By: siying

Differential Revision: D20625233

Pulled By: pdillinger

fbshipit-source-id: 2f83ed874cffa4a39bc7a66cdf3833b978fbb948
2020-03-24 21:22:21 -07:00