Summary:
When operation on an open file descriptor fails, we should close the file descriptor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6936
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D21885458
Pulled By: riversand963
fbshipit-source-id: ba077a76b256a8537f21e22e4ec198f45390bf50
Summary:
DB::OpenForReadOnly will not write anything to the file system (i.e., create directories or files for the DB) unless create_if_missing is true.
This change also fixes some subcommands of ldb, which write to the file system even if the purpose is for readonly.
Two tests for this updated behavior of DB::OpenForReadOnly are also added.
Other minor changes:
1. Updated HISTORY.md to include this API change of DB::OpenForReadOnly;
2. Updated the help information for the put and batchput subcommands of ldb with the option [--create_if_missing];
3. Updated the comment of Env::DeleteDir to emphasize that it returns OK only if the directory to be deleted is empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6900
Test Plan: passed make check; also manually tested a few ldb subcommands
Reviewed By: pdillinger
Differential Revision: D21822188
Pulled By: gg814
fbshipit-source-id: 604cc0f0d0326a937ee25a32cdc2b512f9a3be6e
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.
It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.
So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.
Also includes miscellaneous comment improvements / clarifications.
Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784
Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...
[ RUN ] DBTest.ApproximateSizesFilesWithErrorMargin
db/db_test.cc:1562: Failure
Expected: (size) <= (11 * 100), actual: 9478 vs 1100
Other tests updated to reflect consistent accounting of metadata.
Reviewed By: siying
Differential Revision: D21334706
Pulled By: pdillinger
fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
Summary:
`IterKey::UpdateInternalKey()` is an error-prone API as it's
incompatible with `IterKey::TrimAppend()`, which is used for
decoding delta-encoded internal keys. This PR stops using it in
`BlockIter`. Instead, it assigns global seqno in a separate `IterKey`'s
buffer when needed. The logic for safely getting a Slice with global
seqno properly assigned is encapsulated in `GlobalSeqnoAppliedKey`.
`BinarySeek()` is also migrated to use this API (previously it ignored
global seqno entirely).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6843
Test Plan:
benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
created by regular flush; "ingestion_db" created by ingesting a file. Both
DBs have same contents.
```
$ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
$ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
$ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
```
benchmark run command:
```
TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=10 -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=1048576000 -threads=1 -reads=40000000
```
results:
| DB | code | throughput |
|---|---|---|
| normal_db | master | 267.9 |
| normal_db | PR6843 | 254.2 (-5.1%) |
| ingestion_db | master | 259.6 |
| ingestion_db | PR6843 | 250.5 (-3.5%) |
Reviewed By: pdillinger
Differential Revision: D21562604
Pulled By: ajkr
fbshipit-source-id: 937596f836930515da8084d11755e1f247dcb264
Summary:
1. Add a value_size in read options which limits the cumulative value size of keys read in batches. Once the size exceeds read_options.value_size, all the remaining keys are returned with status Abort without further fetching any key.
2. Add a unit test case MultiGetBatchedValueSizeSimple the reads keys from memory and sst files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6826
Test Plan:
1. make check -j64
2. Add a new unit test case
Reviewed By: anand1976
Differential Revision: D21471483
Pulled By: akankshamahajan15
fbshipit-source-id: dea51b8e76d5d1df38ece8cdb29933b1d798b900
Summary:
If Option.file_checksum_gen_factory is set, rocksdb generates the file checksum during flush and compaction based on the checksum generator created by the factory and store the checksum and function name in vstorage and Manifest.
This PR enable file checksum generation in SstFileWrite and store the checksum and checksum function name in the ExternalSstFileInfo, such that application can use them for other purpose, for example, ingest the file checksum with files in IngestExternalFile().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6859
Test Plan: add unit test and pass make asan_check.
Reviewed By: ajkr
Differential Revision: D21656247
Pulled By: zhichao-cao
fbshipit-source-id: 78a3570c76031d8832e3d2de3d6c79cdf2b675d0
Summary:
sst_dump can issue many file reads from the file system. This doesn't work well with file systems without a OS cache, especially remote file systems. In order to mitigate this problem, several improvements are done:
1. --readahead_size is added, so that users can specify readahead size when scanning the data.
2. Force a 512KB tail readahead, which prevents three I/Os for footer, meta index and property blocks and hopefully index and filter blocks too.
3. Consoldiate SSTDump's I/Os before opening the file for read. Use the same file prefetch buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6836
Test Plan: Add a test that covers this new feature.
Reviewed By: pdillinger
Differential Revision: D21516607
fbshipit-source-id: 3ae43526286f67b2f4a5bdedfbc92719d579b87e
Summary:
When using ldb, users cannot turn on force consistency check in most commands, while they cannot use checksonsistnecy with --try_load_options. The change fixes both by:
1. checkconsistency now calls OpenDB() so that it gets all the options loading and sanitized options logic
2. use options.check_consistency_checks = true by default, and add a --disable_consistency_checks to turn it off.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6802
Test Plan: Add a new unit test. Some manual tests with corrupted DBs.
Reviewed By: pdillinger
Differential Revision: D21388051
fbshipit-source-id: 8d122732d391b426e3982a1c3232a8e3763ffad0
Summary:
1. Update column_family_memtables_ to point to latest column_family_set in
version_set after recovery.
2. Normalize file paths passed by application so that directories end with '/'
or '\\'.
3. In addition to missing files, corrupted files are also ignored in
best-efforts recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6824
Test Plan: COMPILE_WITH_ASAN=1 make check
Reviewed By: anand1976
Differential Revision: D21463905
Pulled By: riversand963
fbshipit-source-id: c48db8843cc93c8c1c7139c474b64e6f775307d2
Summary:
Delete triggered compaction in universal compaction mode was causing a corruption when scheduled in parallel with other compactions.
1. When num_levels = 1, a file marked for compaction may be picked along with all older files in L0, without checking if any of them are already being compaction. This can cause unpredictable results like resurrection of older versions of keys or deleted keys.
2. When num_levels > 1, a delete triggered compaction would not get scheduled if it overlaps with a running regular compaction. However, the reverse is not true. This is due to the fact that in ```UniversalCompactionBuilder::CalculateSortedRuns```, it assumes that entire sorted runs are picked for compaction and only checks the first file in a sorted run to determine conflicts. This is violated by a delete triggered compaction as it works on a subset of a sorted run.
Fix the bug for num_levels > 1, and disable the feature for now when num_levels = 1. After disabling this feature, files would still get marked for compaction, but no compaction would get scheduled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6799
Reviewed By: pdillinger
Differential Revision: D21431286
Pulled By: anand1976
fbshipit-source-id: ae9f0bdb1d6ae2f10284847db731c23f43af164a
Summary:
…e range"
Moved it from the wrong section (6.10) to the right section (Unreleased).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6825
Reviewed By: zhichao-cao
Differential Revision: D21464577
Pulled By: ajkr
fbshipit-source-id: a836b4ab10be2464182826f9411c9c424c933b70
Summary:
The error is assigning KeyContext::s to NotFound status in a
table reader for a "not found in this table" case, which skips searching
in later tables, like only a delete should. (The hash search index iterator
is the only one that can return status NotFound even if Valid() == false.)
This was detected by intermittent failure in
MultiThreadedDBTest.MultiThreaded/5, a kHashSearch configuration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6821
Test Plan: modified existing unit test to reproduce problem
Reviewed By: anand1976
Differential Revision: D21450469
Pulled By: pdillinger
fbshipit-source-id: 7478003684d637dbd491cdac81468041a791be2c
Summary:
We found some files containing nothing but negative range tombstones,
and unsurprisingly their metadata specified a negative range, which made
things crash. Time to add a bit of user input validation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6788
Reviewed By: zhichao-cao
Differential Revision: D21343719
Pulled By: ajkr
fbshipit-source-id: f1c16e4c3e9fa150958c8c866176632a3206fb74
Summary:
https://github.com/facebook/rocksdb/pull/6807 extends the logic that
identifies and purges obsolete files to blob files handled by RocksDB
itself. In order to prevent that from interfering with the current BlobDB code,
we need to make sure that `BlobDBOptions::blob_dir` is different from
the storage directories used by the base DB. (Note: this is true by default.)
The patch adds a check that explicitly disallows this configuration and
returns `Status::NotSupported` from `BlobDB::Open` in such cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6810
Test Plan: Tested using the BlobDB mode of `db_bench`.
Reviewed By: riversand963
Differential Revision: D21412676
Pulled By: ltamasi
fbshipit-source-id: 6630cc7481e48c8bf55d59423b25f14d52ffe681
Summary:
Current impl. of db_stress will abort verification and report failure if
GetLiveFiles() causes a dropped column family to be flushed. This is not
desired.
To fix, this PR makes the following change:
In GetLiveFiles, if flush is triggered and returns
Status::IsColumnFamilyDropped(), then set status to Status::OK().
This is OK because dropped column families will be skipped during the rest of
this function, and valid column families will have their live files returned to
caller.
Test plan (dev server):
make check
./db_stress -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100
./db_stress -disable_wal=1 -reopen=0 -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6805
Reviewed By: ltamasi
Differential Revision: D21390044
Pulled By: riversand963
fbshipit-source-id: de67846b95a4f1b88aa0a30c3d70c43cc68625b9
Summary:
We are swallowing some file consistency checking failures. This is not expected. We are fixing two cases: DB reopen and manifest dump.
More places are not fixed and need follow-up.
Error from CheckConsistencyForDeletes() is also swallowed, which is not fixed in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6793
Test Plan: Add a unit test to cover the reopen case.
Reviewed By: riversand963
Differential Revision: D21366525
fbshipit-source-id: eb438a322237814e8d5125f916a3c6de97f39ded
Summary:
The feature of CompressionOptions::parallel_threads is still not yet mature. Mention it to be experimental in the comments for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6781
Reviewed By: pdillinger
Differential Revision: D21330678
fbshipit-source-id: d7dd7d099fb002a5c6a5d8da689ce5ee08a9eb13
Summary:
If an error happens during BlobDBImpl::Open after the base DB has been
opened, we need to destroy the `ColumnFamilyHandle`s returned by `DB::Open`
to prevent an assertion in `ColumnFamilySet`'s destructor from being hit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6763
Test Plan: Ran `make check` and tested using the BlobDB mode of `db_bench`.
Reviewed By: riversand963
Differential Revision: D21262643
Pulled By: ltamasi
fbshipit-source-id: 60ebc7ab19be66cf37fbe5f6d8957d58470f3d3b
Summary:
Summary : 1. Add two arguments --compression_level_from and --compression_level_to to check
the compression size with different compression level in the given range. Users must
specify one compression type else it will error out. Both from and to levels must
also be specified together.
2. Display the time taken to compress each file with different compressions by default.
Test Plan : make -j64 check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6634
Test Plan: make -j64 check
Reviewed By: anand1976
Differential Revision: D20810282
Pulled By: akankshamahajan15
fbshipit-source-id: ac9098d3c079a1fad098f6678dbedb4d888a791b
Summary:
In https://github.com/facebook/rocksdb/pull/6455, we modified the interface of `RandomAccessFileReader::Read` to be able to get rid of memcpy in direct IO mode.
This PR applies the new interface to `BlockFetcher` when reading blocks from SST files in direct IO mode.
Without this PR, in direct IO mode, when fetching and uncompressing compressed blocks, `BlockFetcher` will first copy the raw compressed block into `BlockFetcher::compressed_buf_` or `BlockFetcher::stack_buf_` inside `RandomAccessFileReader::Read` depending on the block size. then during uncompressing, it will copy the uncompressed block into `BlockFetcher::heap_buf_`.
In this PR, we get rid of the first memcpy and directly uncompress the block from `direct_io_buf_` to `heap_buf_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6689
Test Plan: A new unit test `block_fetcher_test` is added.
Reviewed By: anand1976
Differential Revision: D21006729
Pulled By: cheng-chang
fbshipit-source-id: 2370b92c24075692423b81277415feb2aed5d980
Summary:
After a successful recovery, the CURRENT file should be updated to point to the valid MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6746
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D21189876
Pulled By: riversand963
fbshipit-source-id: 7537b49988c5c425ebe9505a5cc260de351ad79b
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings. There is a mishmash of parameters in use here with more needed in the future. This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389
Reviewed By: siying
Differential Revision: D21163707
Pulled By: zhichao-cao
fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
Summary:
1. Add changes so that max_background_flushes can be set dynamically.
2. Add a testcase DBOptionsTest.SetBackgroundFlushThreads which set the
max_background_flushes dynamically using SetDBOptions.
TestPlan: 1. make -j64 check
2. Using new testcase DBOptionsTest.SetBackgroundFlushThreads
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6701
Reviewed By: ajkr
Differential Revision: D21028010
Pulled By: akankshamahajan15
fbshipit-source-id: 5f949e4a8fd3c32537b637947b7ee09a69cfc7c1
Summary:
IsDirectory() is a common API to check whether a path is a regular file or
directory.
POSIX: call stat() and use S_ISDIR(st_mode)
Windows: PathIsDirectoryA() and PathIsDirectoryW()
HDFS: FileSystem.IsDirectory()
Java: File.IsDirectory()
...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6711
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D21053520
Pulled By: riversand963
fbshipit-source-id: 680aadfd8ce982b63689190cf31b3145d5a89e27
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
Summary:
Add NewFileChecksumGenCrc32cFactory to file checksum public interface such that applications can use the build in crc32 checksum factory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6688
Test Plan: pass make asan_check
Reviewed By: riversand963
Differential Revision: D21006859
Pulled By: zhichao-cao
fbshipit-source-id: ea8a45196a8b77c310728ab05f6cc0f49f3baef0
Summary:
In index blocks since `format_version=3`, user keys are written
rather than internal keys. When reading such blocks, the comparator is
obtained via `InternalKeyComparator::user_comparator()`. That function
must not return an unwrapped result as the wrapper class provides
accounting logic to populate `PerfContext::user_key_comparison_count`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6650
Test Plan:
ran db_bench and verified
`PerfContext::user_key_comparison_count` became larger.
Reviewed By: cheng-chang
Differential Revision: D20866325
Pulled By: ajkr
fbshipit-source-id: ad755d46bda31157dacc5b66e532279f19ad538c
Summary:
Although these optimizations are not user facing, still feel it's valuable to call out in HISTORY.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6679
Test Plan: no need
Reviewed By: zhichao-cao
Differential Revision: D20945916
Pulled By: cheng-chang
fbshipit-source-id: f3e790c07f3bcc4a8a74246c4fa232800ddd4438
Summary:
On reading an ingested SST file, `DataBlockIter` will replace seqno encoded in a key with global seqno. However, if the original seqno was part of the prefix used for the next key, the global seqno is by mistake used as part of the prefix to construct the next key, causing wrong result being returned. Although at this point it is only software error while data in the file is not corrupted, the issue can further cause compaction output out of order and corrupted result when the ingested SST participated in compaction. Fixing the issue by save the actual seqno and restore it before the key being used as prefix to construct next key.
The unit test is by Little-Wallace from https://github.com/facebook/rocksdb/issues/6666. Fixing https://github.com/facebook/rocksdb/issues/6666.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6669
Test Plan:
New unit test
Signed-off-by: Yi Wu <yiwu@pingcap.com>
Reviewed By: cheng-chang
Differential Revision: D20931808
Pulled By: ajkr
fbshipit-source-id: f01959c35d6a493954dca981663766c7a5a9e8ab
Summary:
… to CFOptions
https://github.com/facebook/rocksdb/pull/6615 made several compression related options dynamically changeable. They are moved to MutableCFOptions. However, they are not copied back to ColumnFamilyOptions, so the changed values are not written to option files and for some other uses. Fix it by copying them back.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6668
Test Plan: Add a unit test to make sure that when a MutableCFOptions is converted to CFOptions and back to MutableCFOptions, they stay the same. This test would fail without the fix.
Reviewed By: ajkr
Differential Revision: D20923999
fbshipit-source-id: c3bccd6923b00d677764e2269bed6a95ad7ed780
Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262
Reviewed By: ajkr
Differential Revision: D20651306
fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
Summary:
https://github.com/facebook/rocksdb/issues/6247 reports that when write buffer manager fails to insert the dummy entry to block cache, null pointer is still stored and used to release the handle and cause corruption. Fix the bug by not releasing it with null handle.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6619
Test Plan: Add a unit test that fails without the fix.
Reviewed By: ajkr
Differential Revision: D20776769
fbshipit-source-id: 4127fbd9f295a0a3e45774746ffcd91f939f6287
Summary:
These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615
Test Plan: Add a unit test to make sure that SetOptions() can change the options.
Reviewed By: riversand963
Differential Revision: D20755951
fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff
Summary:
For FIFO compaction, we use flush time instead of oldest key time as the
creation time. This is to prevent FIFO compaction dropping files whose oldest
key time is older than TTL but which has newer keys than TTL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6612
Test Plan: make check
Reviewed By: siying
Differential Revision: D20748217
Pulled By: riversand963
fbshipit-source-id: 3f7b00a847020760537cdddd12f6fe039e5bc663
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600
Test Plan: tested with make asan_check
Reviewed By: riversand963
Differential Revision: D20717670
Pulled By: zhichao-cao
fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
Summary:
When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries.
This PR makes it possible to decrease CPU priority of these threads when creating a new backup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602
Test Plan: make check
Reviewed By: siying, zhichao-cao
Differential Revision: D20683216
Pulled By: cheng-chang
fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84
Summary:
Version 4 has been around long enough, for compatibility and
extensive validation, that it should be default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6582
Test Plan:
CI (w.r.t. changing the default; format_version=4 is well
tested and massively in production at Facebook)
Reviewed By: siying
Differential Revision: D20625233
Pulled By: pdillinger
fbshipit-source-id: 2f83ed874cffa4a39bc7a66cdf3833b978fbb948
Summary:
When applying a new version in non DB open case, optimize_filters_for_hits is used for max_threads, which is clearly a bug. It is not clear what the indented value in the first place, but it value 1 makes sense here, which would create no extra threads. This bug is not expected to cause user visible problems, assuming C++ implicitly cast bool to 0 or 1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6576
Test Plan: Run all exsiting test.
Reviewed By: ajkr
Differential Revision: D20602467
fbshipit-source-id: 40b2cd8619aba09ae9242b36c415464db3c9b737
Summary:
There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status.
This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version.
`DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed.
To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush.
Test plan (on devserver):
```
$make check
$COMPILE_WITH_ASAN=1 make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334
Reviewed By: anand1976
Differential Revision: D19778960
Pulled By: riversand963
fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8