Summary:
If `options.wal_recovery_mode == WALRecoveryMode::kPointInTimeRecovery`, RocksDB stops replaying WAL once hitting an error and discards the rest of the WAL. This can lead to data loss if the error occurs at an offset smaller than the last sync'ed offset.
Ideally, RocksDB point-in-time recovery should permit recovery if the error occurs after last synced offset while fail recovery if error occurs before the last synced offset. However, RocksDB does not track the synced offset of WALs. Consequently, RocksDB does not know whether an error occurs before or after the last synced offset. An error can be one of the following.
- WAL record checksum mismatch. This can result from both corruption of synced data and dropping of unsynced data during shutdown. We cannot be sure which one. In order not to defeat the original motivation to permit the latter case, we keep the original behavior of point-in-time WAL recovery.
- IOError. This means the WAL can be bad, an indicator of whole file becoming unavailable, not to mention synced part of the WAL. Therefore, we choose to modify the behavior of point-in-time recovery and fail the database recovery.
Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6963
Reviewed By: ajkr
Differential Revision: D22011083
Pulled By: riversand963
fbshipit-source-id: f9cbf29a37dc5cc40d3fa62f89eed1ad67ca1536
Summary:
https://github.com/facebook/rocksdb/pull/6901 subtly changed the handling of the corner case
when a table file is deleted from a level, then re-added to the same level. (Note: this
should be extremely rare; one scenario that comes to mind is a trivial move followed by
a call to `ReFitLevel` that moves the file back to the original level.) Before that change,
a new `FileMetaData` object was created as a result of this sequence; after the change,
the original `FileMetaData` was essentially resurrected (since the deletion and the addition
simply cancel each other out with the change). This patch restores the original behavior,
which is more intuitive considering the interface, and in sync with how trivial moves are handled.
(Also note that `FileMetaData` contains some mutable data members, the values of which
might be different in the resurrected object and the freshly created one.)
The PR also fixes a bug in this area: with the original pre-6901 code, `VersionBuilder`
would add the same file twice to the same level in the scenario described above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6939
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D21905580
Pulled By: ltamasi
fbshipit-source-id: da07ae45384ecf3c6c53506d106432d88a7ec9df
Summary:
`DBTest2.CompressionFailures` currently tests many configurations
sequentially using nested loops, which often leads to timeouts
in our test system. The patch turns it into a parameterized test
instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6968
Test Plan: `make check`
Reviewed By: siying
Differential Revision: D22006954
Pulled By: ltamasi
fbshipit-source-id: f71f2f7108086b7651ecfce3d79a7fab24620b2c
Summary:
Application can ingest SST files with file checksum information, such that during ingestion, DB is able to check data integrity and identify of the SST file. The PR introduces generate_and_verify_file_checksum to IngestExternalFileOption to control if the ingested checksum information should be verified with the generated checksum.
1. If generate_and_verify_file_checksum options is *FALSE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enables the SST file checksum and the checksum function name matches the checksum function name in DB, we trust the ingested checksum, store it in Manifest. If the checksum function name does not match, we treat that as an error and fail the IngestExternalFile() call.
2. If generate_and_verify_file_checksum options is *TRUE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enable the SST file checksum, we will use the checksum generator from DB to calculate the checksum for each ingested SST files after they are copied or moved. Then, compare the checksum results with the ingested checksum information: _A)_ if the checksum function name does not match, _verification always report true_ and we store the DB generated checksum information in Manifest. _B)_ if the checksum function name mach, and checksum match, ingestion continues and stores the checksum information in the Manifest. Otherwise, terminate file ingestion and report file corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6891
Test Plan: added unit test, pass make asan_check
Reviewed By: pdillinger
Differential Revision: D21935988
Pulled By: zhichao-cao
fbshipit-source-id: 7b55f486632db467e76d72602218d0658aa7f6ed
Summary:
This is required so that the test cases can safely be run in parallel.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6962
Test Plan: `make check`
Reviewed By: zhichao-cao
Differential Revision: D21980060
Pulled By: ltamasi
fbshipit-source-id: 616b7a0b686155d3874848b9098c67ad3f47efcc
Summary:
This saves up to two key comparisons in block seeks. The first key
comparison saved is a redundant key comparison against the restart key
where the linear scan starts. This comparison is saved in all cases
except when the found key is in the first restart interval. The
second key comparison saved is a redundant key comparison against the
restart key where the linear scan ends. This is only saved in cases
where all keys in the restart interval are less than the target
(probability roughly `1/restart_interval`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6646
Test Plan:
ran a benchmark with mostly default settings and counted key comparisons
before: `user_key_comparison_count = 19399529`
after: `user_key_comparison_count = 18431498`
setup command:
```
$ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000
```
benchmark command:
```
$ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3
```
Reviewed By: pdillinger
Differential Revision: D20849707
Pulled By: ajkr
fbshipit-source-id: 1f01c5cd99ea771fd27974046e37b194f1cdcfac
Summary:
Memory pinned by `pin_l0_filter_and_index_blocks_in_cache` needs to be predictable based on user config. This PR makes sure
we do not pin extra memory for large files generated by intra-L0 (see https://github.com/facebook/rocksdb/issues/6889).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6911
Test Plan: unit test
Reviewed By: siying
Differential Revision: D21835818
Pulled By: ajkr
fbshipit-source-id: a11a088549d06bed8aacc2548d266e5983f0ead4
Summary:
Moving towards the long term goal of moving most CI build to CircleCI when possible, add some Linux tests in CircleCI. This is not all what we can include to CircleCI. For example, Java builds are not includ
ed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6937
Test Plan: Watch CI build results.
Reviewed By: pdillinger
Differential Revision: D21941605
fbshipit-source-id: db6aead3c45f523386d4fb30d224cfde573cccad
Summary:
When MultiGet is called with duplicate keys, and the key matches the
largest key in an SST file and the value type is merge, only the first
instance of the duplicate key is returned with correct results. This is
due to the incorrect assumption that if a key in a batch is equal to the
largest key in the file, the next key cannot be present in that file.
Tests:
Add a new unit test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6953
Reviewed By: cheng-chang
Differential Revision: D21935898
Pulled By: anand1976
fbshipit-source-id: a2cc327a15150e23fd997546ca64d1c33021cb4c
Summary:
The patch adds a convenience method `GetFileMetaDataByNumber` that
builds on the `FileLocation` functionality introduced recently (see
https://github.com/facebook/rocksdb/pull/6862). This method makes it possible to
retrieve the `FileMetaData` directly as opposed to having to go through
`LevelFiles` and friends.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6940
Test Plan: `make check`
Reviewed By: cheng-chang
Differential Revision: D21905946
Pulled By: ltamasi
fbshipit-source-id: af99e19de21242b2b4a87594a535c6028d16ee72
Summary:
Implemented a subcommand of sst_dump called identify, which determines whether a file is an SST file or identifies and lists all the SST files in a directory;
This update also fixes the problem that sst_dump exits with a success state even if target file/directory does not exist/is not an SST file/is empty/is corrupted.
One test is added to sst_dump_test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6943
Test Plan: Passed make check and a few manual tests
Reviewed By: pdillinger
Differential Revision: D21928985
Pulled By: gg814
fbshipit-source-id: 9a8b48e0cf1a0e96b13f42b690aba8ad981afad3
Summary:
Since gflags use the global variable to store the flags passed in. In the unit test, if we git one flag per unit test, the result is that all the flags are combined together in the following tests. Therefore, it has the dependency. In this PR, we pass the full arguments each time to ensure that the old arguments will be overwritten by the new one such that the dependency is removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6941
Test Plan: make asan_check. run each unit test in trace_analyzer_test independently and in arbitrary orders.
Reviewed By: pdillinger
Differential Revision: D21909176
Pulled By: zhichao-cao
fbshipit-source-id: dca550a0a4a205c30faa620e258a020a3b5b4e13
Summary:
In db_options.c, we should avoid including header files in the `db` directory to avoid introducing unnecessary dependency. The reason why `version_edit.h` has been included in `db_options.cc` is because we need two constants, `kUnknownChecksum` and `kUnknownChecksumFuncName`. We can put these two constants as `constexpr` in the public header `file_checksum.h`.
Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6952
Reviewed By: zhichao-cao
Differential Revision: D21925341
Pulled By: riversand963
fbshipit-source-id: 2902f3b74c97f0cf16c58ad24c095c787c3a40e2
Summary:
RocksDB is an embedded library; we should not write to the application's
console. Note: in each case, the same information is returned in the form of a
`Status::Corruption` object.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6948
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D21914965
Pulled By: ltamasi
fbshipit-source-id: ae4b66789aa6b659eb8cc2ed4a048187962c86cc
Summary:
We currently do not have any validation that would ensure that the `FileMetaData`
objects are equivalent when a file gets deleted from the LSM tree and then re-added
(think trivial moves); however, if we did, this test case would be in violation. The patch
changes the values used in the test case so they are consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6942
Test Plan: `make check`
Reviewed By: zhichao-cao
Differential Revision: D21911366
Pulled By: ltamasi
fbshipit-source-id: 2f0486f8337373a6a111b6f28433d70507857104
Summary:
Make RocksDB run a predefined unit test so that it can be integrated with better tools.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6926
Test Plan: Watch tests
Reviewed By: pdillinger
Differential Revision: D21866216
fbshipit-source-id: cafca82efdf0b72671be8d30b665e88a75ae6000
Summary:
The ```for``` loop in ```VerifyChecksumInBlocks``` only checks ```index_iter->Valid()``` which could be ```false``` either due to reaching the end of the index or, in case of partitioned index, it could be due to a checksum mismatch error when reading a 2nd level index block. Instead of throwing away the index iterator status, we need to return any errors back to the caller.
Tests:
Add a test in block_based_table_reader_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6909
Reviewed By: pdillinger
Differential Revision: D21833922
Pulled By: anand1976
fbshipit-source-id: bc778ebf1121dbbdd768689de5183f07a9f0beae
Summary:
Currently, `DeleteDir` only deletes the directory if there are no other directories under the target dir. This PR makes it delete directories recursively.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6934
Test Plan:
Added a new unit test in testutil_test.cc.
`make testutil_test`
Reviewed By: zhichao-cao
Differential Revision: D21884211
Pulled By: cheng-chang
fbshipit-source-id: 0b9a48a200f494ee007aef5d1763b4aa331f8b5a
Summary:
Confusing checks for null that are never null
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6933
Test Plan: make check
Reviewed By: cheng-chang
Differential Revision: D21885466
Pulled By: pdillinger
fbshipit-source-id: 4b48e03c2a33727f2702b0d12292f9fda5a3c475
Summary:
Mostly uninitialized values: some probably written before use, but some seem like bugs. Also, destructor needs to be virtual, and possible use-after-free in test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6935
Test Plan: make check
Reviewed By: siying
Differential Revision: D21885484
Pulled By: pdillinger
fbshipit-source-id: e2e7cb0a0cf196f2b55edd16f0634e81f6cc8e08
Summary:
When operation on an open file descriptor fails, we should close the file descriptor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6936
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D21885458
Pulled By: riversand963
fbshipit-source-id: ba077a76b256a8537f21e22e4ec198f45390bf50
Summary:
StringAppendOperatorTest right now runs in a mode where RUN_ALL_TESTS() is executed twice for the same test but different settings. This creates a problem with a tool that expects every test to run once. Fix it by using a parameterized test instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6930
Test Plan: Run the test and see it passed.
Reviewed By: ltamasi
Differential Revision: D21874145
fbshipit-source-id: 55520b2d7f1ba9f3cba1e2d087fe86f43fb06145
Summary:
When running ThreadLocalTest.SequentialReadWriteTest individually, the test fails with:
] ./thread_local_test --gtest_filter="*SequentialReadWriteTest*"
Note: Google Test filter = *SequentialReadWriteTest*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ThreadLocalTest
[ RUN ] ThreadLocalTest.SequentialReadWriteTest
internal_repo_rocksdb/repo/util/thread_local_test.cc:144: Failure
Expected: IDChecker::PeekId()
Which is: 3
To be equal to: base_id + 1u
Which is: 2
[ FAILED ] ThreadLocalTest.SequentialReadWriteTest (1 ms)
[----------] 1 test from ThreadLocalTest (1 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] ThreadLocalTest.SequentialReadWriteTest
1 FAILED TEST
It appears that when running as the first test, PeakId() was updated twice. I didn't dig into it why but it doesn't seem to break the contract. Relax the assertion to make it pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6929
Test Plan: Run the test individually and as the whole thread_local_test
Reviewed By: riversand963
Differential Revision: D21873999
fbshipit-source-id: 1dcb6a2e9c38b6afd848027308bfe633342b7548
Summary:
The LDB create and drop column family commands failed to check if theere was a valid database prior to dereferencing it, leading to a core dump.
The SstFileDumper prefetch code would dereference a file when the file did not exist as part of the Prefetch code. This dereference was moved inside an st.ok() check.
Tests were added for both failure conditions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6922
Reviewed By: gg814
Differential Revision: D21884024
Pulled By: pdillinger
fbshipit-source-id: bddd45c299aa9dc7e928c17a37a96521f8c9149e
Summary:
When run */RunMany/* tests individually, e.g. ChrootEnvWithDirectIO/EnvPosixTestWithParam.RunMany/0, they hang. It's because they insert to background thread pool without initializing them. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6931
Test Plan: Run ChrootEnvWithDirectIO/EnvPosixTestWithParam.RunMany/0 by itself and see it passes.
Reviewed By: riversand963
Differential Revision: D21875603
fbshipit-source-id: 7f848174c1a660254a2b1f7e11cca5370793ba30
Summary:
As title. The prior change to the line is a typo. Fixing it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6928
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D21873587
Pulled By: riversand963
fbshipit-source-id: f4837fc8792d7106bc230b7b499dfbb7a2847430
Summary:
DB::OpenForReadOnly will not write anything to the file system (i.e., create directories or files for the DB) unless create_if_missing is true.
This change also fixes some subcommands of ldb, which write to the file system even if the purpose is for readonly.
Two tests for this updated behavior of DB::OpenForReadOnly are also added.
Other minor changes:
1. Updated HISTORY.md to include this API change of DB::OpenForReadOnly;
2. Updated the help information for the put and batchput subcommands of ldb with the option [--create_if_missing];
3. Updated the comment of Env::DeleteDir to emphasize that it returns OK only if the directory to be deleted is empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6900
Test Plan: passed make check; also manually tested a few ldb subcommands
Reviewed By: pdillinger
Differential Revision: D21822188
Pulled By: gg814
fbshipit-source-id: 604cc0f0d0326a937ee25a32cdc2b512f9a3be6e
Summary:
We recently removed the dependencies of core components on gtest. Add a Travis test to make sure it doesn't regress. Change cmake setting so that the gtest related components are only included when tests, benchmarks or stress tools are included in the build. Add this build setting in Travis to confirm it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6921
Test Plan: See Travis passes
Reviewed By: ajkr
Differential Revision: D21863564
fbshipit-source-id: df26f50a8305a04ff19ffa8069a1857ecee10289
Summary:
This reverts commit 8d87e9cea1.
Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923
Reviewed By: pdillinger
Differential Revision: D21864799
fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc
Summary:
GetTestDirectory implies a file system operation (it creates the
default test directory if missing), so it should be routed to
the FileSystem rather than the Env.
Also remove the GetTestDirectory implementation in the PosixEnv,
since it overrides GetTestDirectory in CompositeEnv making it
impossible to override with a custom FileSystem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6896
Reviewed By: cheng-chang
Differential Revision: D21868984
Pulled By: ajkr
fbshipit-source-id: e79bfef758d06dacef727c54b96abe62e78726fd
Summary:
Added setting of zstd_max_train_bytes compression option parameter to c interop.
rocksdb_options_set_bottommost_compression_options was using bool parameter and thus not exported, updated it to unsigned char and added to c.h as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6796
Reviewed By: cheng-chang
Differential Revision: D21611471
Pulled By: ajkr
fbshipit-source-id: caaaf153de934837ad9af283c7f8c025ff0b0cf5
Summary:
The OptionTypeInfo::Vector method allows a vector<T> to be converted to/from strings via the options.
The kVectorInt and kVectorCompressionType vectors were replaced with this methodology.
As part of this change, the NextToken method was added to the OptionTypeInfo. This method was refactored from code within the StringToMap function.
Future types that could use this functionality include the EventListener vectors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6424
Reviewed By: cheng-chang
Differential Revision: D21832368
Pulled By: zhichao-cao
fbshipit-source-id: e1ca766faff139d54e6e8407a9ec09ece6517439
Summary:
Rocksdb is using the c++11 std::threads feature. The issue is that
MINGW only supports it when using Posix threads.
This change will allow rocksdb::port::WindowsThread to be replaced
with std::thread, which in turn will allow Rocksdb to be cross
compiled using MINGW.
At the same time, we'll have to use GetCurrentProcessId instead of _getpid.
Signed-off-by: Lucian Petrut <lpetrut@cloudbasesolutions.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6865
Reviewed By: cheng-chang
Differential Revision: D21864285
Pulled By: ajkr
fbshipit-source-id: 0982eed313e7d34d351b1364c1ccc722da473205
Summary:
People keep breaking the gcc 4.8 compilation due to different
warnings for shadowing member functions with locals. Adding to Travis
to keep compatibility. (gcc 4.8 is default on CentOS 7.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6915
Test Plan: local and Travis
Reviewed By: siying
Differential Revision: D21842894
Pulled By: pdillinger
fbshipit-source-id: bdcd4385127ee5d1cc222d87e53fb3695c32a9d4
Summary:
The patch cleans up the code and improves the consistency checks around
adding/deleting table files in `VersionBuilder`. Namely, it makes the checks
stricter and improves them in the following ways:
1) A table file can now only be deleted from the LSM tree using the level it
resides on. Earlier, there was some unnecessary wiggle room for
trivially moved files (they could be deleted using a lower level number than
the actual one).
2) A table file cannot be added to the tree if it is already present in the tree
on any level (not just the target level). The earlier code only had an assertion
(which is a no-op in release builds) that the newly added file is not already
present on the target level.
3) The above consistency checks around state transitions are now mandatory,
as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op
in release mode unless `force_consistency_checks` was set to `true`. The rationale
here is that assuming that the initial state is consistent, a valid transition leads to a
next state that is also consistent; however, an *invalid* transition offers no such
guarantee. Hence it makes sense to validate the transitions unconditionally,
and save `force_consistency_checks` for the paranoid checks that re-validate
the entire state.
4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862,
which enables us to efficiently look up the location (level and position within level)
of files in a `Version` by file number. This makes the consistency checks much more
efficient than the earlier `CheckConsistencyForDeletes`, which essentially
performed a linear search.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901
Test Plan:
Extended the unit tests and ran:
`make check`
`make whitebox_crash_test`
Reviewed By: ajkr
Differential Revision: D21822714
Pulled By: ltamasi
fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
Summary:
Because ARM and some other platforms have a larger cache line
size, they have a larger minimum filter size, which causes recently
added PartitionedMultiGet test in db_bloom_filter_test to fail on those
platforms. The code would actually end up using larger partitions,
because keys_per_partition_ would be 0 and never == number of keys
added.
The code now attempts to get as close as possible to the small target
size, while fully utilizing that filter size, if the target partition
size is smaller than the minimum filter size.
Also updated the test to break more uniformly across platforms
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6905
Test Plan: updated test, tested on ARM
Reviewed By: anand1976
Differential Revision: D21840639
Pulled By: pdillinger
fbshipit-source-id: 11684b6d35f43d2e98b85ddb2c8dcfd59d670817
Summary:
x.size() -1 or y - 1 can overflow to an extremely large value when x.size() pr y is 0 when they are unsigned type. The end condition of i in the for loop will be extremely large, potentially causes segment fault. Fix them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6902
Test Plan: pass make asan_check
Reviewed By: ajkr
Differential Revision: D21843767
Pulled By: zhichao-cao
fbshipit-source-id: 5b8b88155ac5a93d86246d832e89905a783bb5a1
Summary:
Replace Status with IOStatus in CopyFile and CreateFile.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6916
Test Plan: pass make asan_check
Reviewed By: cheng-chang
Differential Revision: D21843775
Pulled By: zhichao-cao
fbshipit-source-id: 524d4a0fcf47f0941b923da0346e0de71607f5f6
Summary:
production code under utilities/cassandra depends on gtest.h. Remove them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6908
Test Plan: Run all existing tests.
Reviewed By: ajkr
Differential Revision: D21842606
fbshipit-source-id: a098e0b49c9aeac51cc90a79562ad9897a36122c
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.
It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.
So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.
Also includes miscellaneous comment improvements / clarifications.
Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784
Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...
[ RUN ] DBTest.ApproximateSizesFilesWithErrorMargin
db/db_test.cc:1562: Failure
Expected: (size) <= (11 * 100), actual: 9478 vs 1100
Other tests updated to reflect consistent accounting of metadata.
Reviewed By: siying
Differential Revision: D21334706
Pulled By: pdillinger
fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185