Summary:
There is an undocumented behavior about a certain combination of options and operations.
- inplace_update_support = true, and
- call `SeekForPrev()`, `SeekToLast()`, and/or `Prev()` on unflushed data.
We should stop the backward iteration and report an error of `Status::NotSupported`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7618
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D24769619
Pulled By: riversand963
fbshipit-source-id: 81d199fa55ed4739ab10e719cc345a992238ccbb
Summary:
https://github.com/facebook/rocksdb/issues/7556 introduced support for compaction iterator to perform timestamp-aware garbage collection.
However, there was a bug. The comparison between `ikey_.user_key` and `current_user_key_` should happen
before `key_ = current_key_.SetInternalKey(key_, &ikey_);` (line 336 of compaction_iterator.cc).
Otherwise, after this line, `current_key_` is always the same as `ikey_.user_key`.
This PR also re-arranged the order of some data members because some of them are state variables of `CompactionIterator` while others are inputs from callers.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7645
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D24845028
Pulled By: riversand963
fbshipit-source-id: c7e79914832701462b86867e8463cd463b6c0c25
Summary:
After replaying the WALs, the memtables are flushed synchronously to L0 instead of being flushed in background. Currently, we only track WAL obsoletion events in the code path of background flush jobs. This PR tracks these events in RecoverLogFiles.
After this change, we can enable `track_and_verify_wal_in_manifest` in `db_stress`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7649
Test Plan: `python tools/db_crashtest.py whitebox`
Reviewed By: riversand963
Differential Revision: D24824501
Pulled By: cheng-chang
fbshipit-source-id: 207129f7b845c50b333680ce6818a68a2fad54b9
Summary:
Consider the following sequence of events:
1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST.
2. Syncing MANIFEST failed and db crashed.
3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file.
4. Db crashed again.
5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed.
It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5.
After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621
Test Plan:
1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery.
2. a new unit test is added in db_basic_test to simulate step 3.
Reviewed By: riversand963
Differential Revision: D24668144
Pulled By: cheng-chang
fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b
Summary:
When a WAL is synced, an edit is written to MANIFEST.
After flushing memtables, the obsoleted WALs are piggybacked to MANIFEST while writing the new L0 files to MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7601
Test Plan:
`track_and_verify_wals_in_manifest` is enabled by default for all tests extending `DBBasicTest`, and in db_stress_test.
Unit test `wal_edit_test`, `version_edit_test`, and `version_set_test` are also updated.
Watch all tests to pass.
Reviewed By: ltamasi
Differential Revision: D24553957
Pulled By: cheng-chang
fbshipit-source-id: 66a569ff1bdced38e22900bd240b73113906e040
Summary:
The test often times out in internal test infra.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7640
Test Plan: watch test to pass internally
Reviewed By: anand1976
Differential Revision: D24764928
Pulled By: cheng-chang
fbshipit-source-id: 587f2afc97f52909837943fd938a86ca94544b2c
Summary:
In this test, after flushing memtable, it will read directly from the sst files, so `env_do_fsync` was `true` to ensure that the flushed sst files can be read afterwards. Considering that the test does not last long, the data should be available in os buffer even without fsync, so this PR tries to disable fsync to reduce test time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7639
Test Plan: watch the test to pass in internal infra
Reviewed By: anand1976
Differential Revision: D24764689
Pulled By: cheng-chang
fbshipit-source-id: ef827611a3eaca04201e4280ae801d6c8e60c138
Summary:
This test often times out in internal test infra.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7637
Test Plan: watch test to pass
Reviewed By: ajkr
Differential Revision: D24763939
Pulled By: cheng-chang
fbshipit-source-id: 6564ee2ef637e9faf6688d4b6a5d74a72a51c5e8
Summary:
In `BuildTable()`, we call `builder->Finish()` before evaluating `builder->NeedCompact()`.
However, we call `builder->NeedCompact()` before `builder->Finish()` in compaction job. This can be wrong because the table properties collectors may rely on the success of `Finish()` to provide correct result for `NeedCompact()`.
Test plan (on devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7627
Reviewed By: ajkr
Differential Revision: D24728741
Pulled By: riversand963
fbshipit-source-id: 5a0dce244e14eb1106c4f87021e6bebca82b486e
Summary:
Existing API `VerifyChecksum()` allows application to verify sst files' block checksums.
Since whole file, user-specified checksum is tracked in MANIFEST, we can expose a new
API to verify sst files' file checksums.
```
// Compute table file checksums if applicable and compare with MANIFEST.
// Returns OK if no file has mismatching whole-file checksum.
Status DB::VerifyFileChecksums(const ReadOptions& /*read_options*/);
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7578
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D24436783
Pulled By: riversand963
fbshipit-source-id: 52b51519b842f2b3c4e3351998a97c86cbec85b3
Summary:
Recent test report shows that some tests have been skipped.
For DBWALTest that inherits from DBTestBase, the following will always be
true, since `env_` is an instance of `SpecialEnv`, not `Env::Default()`. Thus the test
will always be skipped.
```
if (options.env != Env::Default()) {
ROCKSDB_GTEST_SKIP("Test requires default environment");
return;
}
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7628
Test Plan:
./db_wal_test --gtest_filter=DBWALTest.TruncateLastLogAfterRecoverWithoutFlush
MEM_ENV=1 ./db_wal_test --gtest_filter=DBWALTest.TruncateLastLogAfterRecoverWithoutFlush
make check
Reviewed By: jay-zhuang
Differential Revision: D24693006
Pulled By: riversand963
fbshipit-source-id: 7f2a772492a0f11bff17bbf5e9f493e9e9a1c125
Summary:
The filter query key should not contain timestamp. The timestamp is
stripped for Get(), but not MultiGet().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7589
Reviewed By: riversand963
Differential Revision: D24494661
Pulled By: jay-zhuang
fbshipit-source-id: fc5ff40f9d683a89a760c6ff0ab3aed05a70c317
Summary:
Test report shows that this test has been skipped recently due to
a condition that will never meet. `env_` is not equal to
`Env::Default()` for DBTest2 that inherits from DBTestBase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7629
Test Plan:
make check
./db_test2 --gtest_filter=DBTest2.PinnableSliceAndMmapReads
Reviewed By: jay-zhuang
Differential Revision: D24693317
Pulled By: riversand963
fbshipit-source-id: b1bbd5c1e05a6fa57c1de0d74462b69e3c2d5215
Summary:
In dictionary compression's initial implementation, in order to save CPU overhead, we only enabled it
for bottom level under the assumption that the vast majority of data is
stored there. At that time, there was no
such thing as `ColumnFamilyOptions::bottommost_compression_opts`, so we just
hardcoded disabling dictionary compression in flush and compactions to
non-bottommost level. Now, we have users who generate all their files
through flush and are considering using dictionary compression.
To support such a use case, this PR expands the scope of `ColumnFamilyOptions::compression_opts` to
additionally include flushed files and files generated by compaction to
a non-bottommost level. Users can still get the old behavior by moving
their dictionary settings to `ColumnFamilyOptions::bottommost_compression_opts`
and explicitly enabling both that and `ColumnFamilyOptions::bottommost_compression`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7619
Reviewed By: ltamasi
Differential Revision: D24665610
Pulled By: ajkr
fbshipit-source-id: 656b90bce1033fe21c71e09af931ef5bde3e464c
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D24579392
Pulled By: riversand963
fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
Summary:
CreateFileChecksumGenerator may uses requested_checksum_func_name in generator context to decide which generator will be used. GenerateOneFileChecksum has not being updated to use it, which will always get the generator when the name is empty. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7586
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D24491989
Pulled By: zhichao-cao
fbshipit-source-id: d9fdfdd431240f0a9a2e781ddbd48a7d6c609aad
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497
When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`
Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.
Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test
Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515
Reviewed By: ajkr
Differential Revision: D24240264
Pulled By: ramvadiv
fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
Summary:
Currently, the following interleaving of events can lead to SuperVersion containing both immutable memtables as well as the resulting L0. This can cause Get to return incorrect result if there are merge operands. This may also affect other operations such as single deletes.
```
time main_thr bg_flush_thr bg_compact_thr compact_thr set_opts_thr
0 | WriteManifest:0
1 | issue compact
2 | wait
3 | Merge(counter)
4 | issue flush
5 | wait
6 | WriteManifest:1
7 | wake up
8 | write manifest
9 | wake up
10 | Get(counter)
11 | remove imm
V
```
The reason behind is that: one bg flush thread's installing new `Version` can be batched and performed by another thread that is the "leader" MANIFEST writer. This bg thread removes the memtables from current super version only after `LogAndApply` returns. After the leader MANIFEST writer signals (releasing mutex) this bg flush thread, it is possible that another thread sees this cf with both memtables (whose data have been flushed to the newest L0) and the L0 before this bg flush thread removes the memtables.
To address this issue, each bg flush thread can pass a callback function to `LogAndApply`. The callback is responsible for removing the memtables. Therefore, the leader MANIFEST writer can call this callback and remove the memtables before releasing the mutex.
Test plan (devserver)
```
$make merge_test
$./merge_test --gtest_filter=MergeTest.MergeWithCompactionAndFlush
$make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6069
Reviewed By: cheng-chang
Differential Revision: D18790894
Pulled By: riversand963
fbshipit-source-id: e41bd600c0448b4f4b2deb3f7677f95e3076b4ed
Summary:
Similarly to how https://github.com/facebook/rocksdb/issues/7345
integrated blob file writing into the flush process,
the patch adds support for writing blob files to the compaction logic.
Namely, if `enable_blob_files` is set, large values encountered during
compaction are extracted to blob files and replaced with blob indexes.
The resulting blob files are then logged to the MANIFEST as part of the
compaction job's `VersionEdit` and added to the `Version` alongside any
table files written by the compaction. Any errors during blob file building fail
the compaction job.
There will be a separate follow-up patch to perform blob garbage collection
during compactions.
In addition, the patch continues to chip away at the mess around computing
various compaction related statistics by eliminating some code duplication
and by making the `num_output_files` and `bytes_written` stats more consistent
for flushes, compactions, and recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7573
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24404696
Pulled By: ltamasi
fbshipit-source-id: 21216af3a172ad3ce8f85d11cd30923784ae426c
Summary:
Add a threshold timestamp, full_history_ts_low_ of type `std::string*` to
`CompactionIterator`, so that RocksDB can also perform garbage collection during
compaction.
* If full_history_ts_low_ is nullptr, then compaction iterator does not perform
GC, preserving all timestamp history for all keys. Compaction iterator will
treat user key with different timestamps as different user keys.
* If full_history_ts_low_ is not nullptr, then compaction iterator performs
GC. GC will look at keys older than `*full_history_ts_low_` and determine their
eligibility based on factors including snapshots.
Current rules of GC:
* If an internal key is in the same snapshot as a previous counterpart
with the same user key, and this key is eligible for GC, and the key is
not single-delete or merge operand, then this key can be dropped. Note
that the previous internal key cannot be a merge operand either.
* If a tombstone is the most recent one in the earliest snapshot and it
is eligible for GC, and keyNotExistsBeyondLevel() is true, then this
tombstone can be dropped.
* If a tombstone is the most recent one in a snapshot and it is eligible
for GC, and the compaction is at bottommost level, then all other older
internal keys of the same user key must also be eligible for GC, thus
can be dropped
* Single-delete, delete-range and merge are not currently supported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7556
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D24507728
Pulled By: riversand963
fbshipit-source-id: 3c09c7301f41eed76dfcf4d1527e68cf6e0a8bb3
Summary:
This PR makes it able to `LogAndApply` `VersionEdit`s related to WALs, and also be able to `Recover` from MANIFEST with WAL related `VersionEdit`s.
The `VersionEdit`s related to WAL are treated similarly as those related to column family operations, they are not applied to versions, but can be in a commit group. Mixing WAL related `VersionEdit`s with other types of edits will make logic in `ProcessManifestWrite` more complicated, so `VersionEdit`s related to WAL can either be WAL additions or deletions, like column family add and drop.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7256
Test Plan: a set of unit tests are added in `version_set_test.cc`
Reviewed By: riversand963
Differential Revision: D23123238
Pulled By: cheng-chang
fbshipit-source-id: 246be2ed4744fd03fa2738aba408aaa611d0379c
Summary:
As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in https://github.com/facebook/rocksdb/issues/7523
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7580
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D24485420
Pulled By: zhichao-cao
fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799
Summary:
Remove function calling in assert statement as assert is a no
op in opt build and that function might not be called. This causes hang
in closing RocksDB when refit level is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7581
Test Plan: make check -j64
Reviewed By: riversand963
Differential Revision: D24466420
Pulled By: akankshamahajan15
fbshipit-source-id: 97db4ec5a95ae693c3290e176a3c12a9b1ad2f6d
Summary:
This PR adds support for writing a location identifier of the DB host to SST files as a table property. By default, the hostname is used, but can be overridden by the user. There have been some recent corruptions in files written by ```SstFileWriter``` before checksumming, so this property can be used to trace it back to the writing host and checking the host for hardware isues.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7479
Test Plan: Add new unit tests
Reviewed By: pdillinger
Differential Revision: D24340671
Pulled By: anand1976
fbshipit-source-id: 2038949fd8d160c0633ccb4f9da77740f19fa2a2
Summary:
The patch adds blob file support to the `Get` API by extending `Version` so that
whenever a blob reference is read from a file, the blob is retrieved from the corresponding
blob file and passed back to the caller. (This is assuming the blob reference is valid
and the blob file is actually part of the given `Version`.) It also introduces a cache
of `BlobFileReader`s called `BlobFileCache` that enables sharing `BlobFileReader`s
between callers. `BlobFileCache` uses the same backing cache as `TableCache`, so
`max_open_files` (if specified) limits the total number of open (table + blob) files.
TODO: proactively open/cache blob files and pin the cache handles of the readers in the
metadata objects similarly to what `VersionBuilder::LoadTableHandlers` does for
table files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7540
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24260219
Pulled By: ltamasi
fbshipit-source-id: a8a2a4f11d3d04d6082201b52184bc4d7b0857ba
Summary:
Make LoadLatestOptions return PathNotFound if the options file does not exist. Added tests for the LoadOptions related methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7554
Reviewed By: akankshamahajan15
Differential Revision: D24298985
Pulled By: zhichao-cao
fbshipit-source-id: c9ae3cb12fc4a5bbef07743e1c1300f98a2441b3
Summary:
Added unit tests that have paranoid_check = true that perform range deletions. At the moment, the deleted ranges do not appear to be checked as part of the paranoid checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7521
Reviewed By: zhichao-cao
Differential Revision: D24262175
Pulled By: ajkr
fbshipit-source-id: 1035e968f7ab8ccaa7af086b835a4e72c7e56743
Summary:
The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520
Test Plan:
- new unit test
- added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0`
Reviewed By: pdillinger
Differential Revision: D24200034
Pulled By: ajkr
fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e
Summary:
This option determines whether WALs will be tracked in MANIFEST and verified on recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7275
Test Plan:
db_options_test
options_test
Reviewed By: pdillinger
Differential Revision: D23181418
Pulled By: cheng-chang
fbshipit-source-id: 5dd1cdc166f3dfc1c93c094df4a2f7734e3b4547
Summary:
The `std::pair(const T1& x, const T2& y);` constructor is `constexpr`
only starting from C++14; relying on this breaks compilation on certain
compilers/platforms we need to support.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7519
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D24195747
Pulled By: ltamasi
fbshipit-source-id: 665e8fbc9747675bb49c5d895aad3dcf2714750f
Summary:
The patch does some cleanup in and around the legacy `BlobLogReader` class:
* It renames the class to `BlobLogSequentialReader` to emphasize that it is for
sequentially iterating through blobs in a blob file, as opposed to doing random
point reads using `BlobIndex`es (which is `BlobFileReader`'s jurisdiction).
* It removes some dead code from the old BlobDB implementation that references
`BlobLogReader` (namely the method `BlobFile::OpenRandomAccessReader`).
* It cleans up some `#include`s and forward declarations.
* It fixes some incorrect/outdated comments related to the reader class.
* It adds a few assertions to the `Read` methods of the class.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7517
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24172611
Pulled By: ltamasi
fbshipit-source-id: 43e2ae1eba5c3dd30c1070cb00f217edc45bd64f
Summary:
The patch adds a class called `BlobFileReader` that can be used to retrieve blobs
using the information available in blob references (e.g. blob file number, offset, and
size). This will come in handy when implementing blob support for `Get`, `MultiGet`,
and iterators, and also for compaction/garbage collection.
When a `BlobFileReader` object is created (using the factory method `Create`),
it first checks whether the specified file is potentially valid by comparing the file
size against the combined size of the blob file header and footer (files smaller than
the threshold are considered malformed). Then, it opens the file, and reads and verifies
the header and footer. The verification involves magic number/CRC checks
as well as checking for unexpected header/footer fields, e.g. incorrect column family ID
or TTL blob files.
Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression
type passed by the caller (because of the presence of the header and footer, the
specified offset cannot be too close to the start/end of the file; also, the compression type
has to match the one in the blob file header), and retrieves and potentially verifies and
uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set,
`BlobFileReader` reads the blob record header as well (as opposed to just the blob itself)
and verifies the key/value size, the key itself, as well as the CRC of the blob record header
and the key/value pair.
In addition, the patch exposes the compression type from `BlobIndex` (both using an
accessor and via `DebugString`), and adds a blob file read latency histogram to
`InternalStats` that can be used with `BlobFileReader`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23999219
Pulled By: ltamasi
fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
Summary:
Add following stats for MultiGet in Histogram to get more insight on MultiGet.
1. Number of index and filter blocks read from file as part of MultiGet
request per level.
2. Number of data blocks read from file per level.
3. Number of SST files loaded from file system per level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7366
Reviewed By: anand1976
Differential Revision: D24127040
Pulled By: akankshamahajan15
fbshipit-source-id: e63a003056b833729b277edc0639c08fb432756b
Summary:
If `BottommostLevelCompaction.kForce*` is set, compaction should avoid
trivial move and always compact the sst to the target size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7368
Reviewed By: ajkr
Differential Revision: D23629525
Pulled By: jay-zhuang
fbshipit-source-id: 79f23c79ecb31587e0593b28cce43131107bbcd0
Summary:
In opt mode, assertions are just no-ops. Therefore, we need to report errors instead of just doing an `assert(false)`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7483
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D24142725
Pulled By: riversand963
fbshipit-source-id: 5629556dbe29f00dd09e30a7d5df5e6cf09ee435
Summary:
`BeginWriteStall()` removes no_slowdown write from the write
list and updates `link_newer`, which makes `CreateMissingNewerLinks()`
thought all write list has valid `link_newer` and failed to create link
for all writers.
It caused flaky test and SegFault for release build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7508
Test Plan: Add unittest to reproduce the issue.
Reviewed By: anand1976
Differential Revision: D24126601
Pulled By: jay-zhuang
fbshipit-source-id: f8ac5dba653f7ee1b0950296427d4f5f8ee34a06
Summary:
Fix prefix_test so that it passes when ASSERT_STATUS_CHECKED=1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7495
Test Plan: Run the test with the option
Reviewed By: anand1976
Differential Revision: D24069715
fbshipit-source-id: 54f74b58575a1b49dbdee9ea2d24751fa956b620
Summary:
This PR schedules a background thread (shared across all DB instances)
to flush info log every ten seconds. This improves debuggability in case
of RocksDB hanging since it ensures the log messages leading up to the hang
will eventually become visible in the log.
The bulk of this PR is moving monitoring/stats_dump_scheduler* to db/periodic_work_scheduler*
and making the corresponding name changes since now the scheduler handles info
log flushing, not just stats dumping.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7488
Reviewed By: riversand963
Differential Revision: D24065165
Pulled By: ajkr
fbshipit-source-id: 339c47a0ff43b79fdbd055fbd9fefbb6f9d8d3b5
Summary:
Add all status handling in db_properties_test so that it can pass ASSERT_STATUS_CHECKED.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7490
Test Plan: Run the test with ASSERT_STATUS_CHECKED
Reviewed By: jay-zhuang
Differential Revision: D24065382
fbshipit-source-id: e008916155196891478c964df0226545308ca71d
Summary:
This exposes to the listener interface whether a compaction was
full or not. Also cleaned up API comment for CompactionJobInfo::stats,
which is not of a nullable type. And since CompactionJob is always
created with non-null CompactionJobStats, removed conditionals on it
being nullptr and instead assert non-null.
TODO later: update C and Java interfaces
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7451
Test Plan: updated existing unit tests to check new field, make check
Reviewed By: ltamasi
Differential Revision: D23977796
Pulled By: pdillinger
fbshipit-source-id: 1ae7e26cb949631c2b2fb9e696710daf53cc378d
Summary:
`DBTest.FileCreationRandomFailure` frequently times out during our
continuous test runs. (It's a case of "stress test posing as unit test.")
The patch reduces the number of iterations to avoid this. Note that
the lower numbers are still sufficient to trigger both flushes and
compactions, so test coverage is still the same.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7481
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24034712
Pulled By: ltamasi
fbshipit-source-id: 8731a9446e5a121a1041b00f0df473b9f714935a
Summary:
Introduce an new option options.check_flush_compaction_key_order, by default set to true, which checks key order of flush and compaction, and fail the operation if the order is violated.
Also did minor refactor hash checking code, which consolidates the hashing logic to a vlidation class, where the key ordering logic is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7467
Test Plan: Add unit tests to validate the check can catch reordering in flush and compaction, and can be properly disabled.
Reviewed By: riversand963
Differential Revision: D24010683
fbshipit-source-id: 8dd6292d2cda8006054e9ded7cfa4bf405f0527c
Summary:
This PR addresses some build and functional issues on MSVC targets, as a step towards an eventual goal of having RocksDB build successfully for Windows on ARM64.
Addressed issues include:
- BitsSetToOne and CountTrailingZeroBits do not compile on non-x64 MSVC targets. A fallback implementation of BitsSetToOne when Intel intrinsics are not available is added, based on the C++20 `<bit>` popcount implementation in Microsoft's STL.
- The implementation of FloorLog2 for MSVC targets (including x64) gives incorrect results. The unit test easily detects this, but CircleCI is currently configured to only run a specific set of tests for Windows CMake builds, so this seems to have been unnoticed.
- AsmVolatilePause does not use YieldProcessor on Windows ARM64 targets, even though it is available.
- When CondVar::TimedWait calls Microsoft STL's condition_variable::wait_for, it can potentially trigger a bug (just recently fixed in the upcoming VS 16.8's STL) that deadlocks various tests that wait for a timer to execute, since `Timer::Run` doesn't get a chance to execute before being blocked by the test function acquiring the mutex.
- In c_test, `GetTempDir` assumes a POSIX-style temp path.
- `NormalizePath` did not eliminate consecutive POSIX-style path separators on Windows, resulting in test failures in e.g., wal_manager_test.
- Various other test failures.
In a followup PR I hope to modify CircleCI's config.yml to invoke all RocksDB unit tests in Windows CMake builds with CTest, instead of the current use of `run_ci_db_test.ps1` which requires individual tests to be specified and is missing many of the existing tests.
Notes from peterd: FloorLog2 is not yet used in production code (it's for something in progress). I also added a few more inexpensive platform-dependent tests to Windows CircleCI runs. And included facebook/folly#1461 as requested
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7439
Reviewed By: jay-zhuang
Differential Revision: D24021563
Pulled By: pdillinger
fbshipit-source-id: 0ec2027c0d6a494d8a0fe38d9667fc2f7e29f7e7
Summary:
Do not assert the number of files after intra-L0 compaction is eligible to run since it could complete (and reduce the number of files) before the assertion executes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7477
Reviewed By: pdillinger
Differential Revision: D24032049
Pulled By: ajkr
fbshipit-source-id: e838ac7a24651ebd643b9e5a9d39d2e789c46929
Summary:
This has been running in production on some key workloads, so
we believe it to be safe and extremely low cost. Nevertheless, I've
added code to ensure that "force_consistency_checks" is mentioned in
any corruption reports so that people know how to disable in case of
false positive corruption reports.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7446
Test Plan:
make check, CI, temporary debug print new message with
./version_builder_test
Reviewed By: ajkr
Differential Revision: D23972101
Pulled By: pdillinger
fbshipit-source-id: 9623e400f3752577c0ecf977e6d0915562cf9968
Summary:
Add a new Option "allow_data_in_errors". When it's set by users, it allows them to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data.
By default value is set false to prevent users data to be exposed in the messages.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7420
Test Plan:
1. make check -j64
2. Add a new test case
Reviewed By: ajkr
Differential Revision: D23835028
Pulled By: akankshamahajan15
fbshipit-source-id: 8d2eba8fb898e79fcf1fccc07295065a75eb59b1
Summary:
The assertion checks that there is no overlap in sequence numbers across levels in universal compaction. However, this assumption doesn't hold when there is a delete triggered compaction or a trivial move, as they operate on a subset of a level.
Tests -
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7421
Reviewed By: ajkr
Differential Revision: D23872672
Pulled By: anand1976
fbshipit-source-id: c386deab8e01a5746ca996ff1f4ebcae3b15b7d2
Summary:
Add db_basic_test status check list. Some of the warnings are suppressed. It is possible that some of them are due to real bugs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7452
Test Plan: See CI tests pass.
Reviewed By: zhichao-cao
Differential Revision: D23979764
fbshipit-source-id: 6151570c2a9b931b0fbb3fe939a94b2bd1583cbe
Summary:
A generic algorithm in progress depends on a templatized
version of fastrange, so this change generalizes it and renames
it to fit our style guidelines, FastRange32, FastRange64, and now
FastRangeGeneric.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7436
Test Plan: added a few more test cases
Reviewed By: jay-zhuang
Differential Revision: D23958153
Pulled By: pdillinger
fbshipit-source-id: 8c3b76101653417804997e5f076623a25586f3e8
Summary:
Possible fix for a TSAN issue reported in EnableFileDeletions.
disable_delete_obsolete_files_ should only be accessed holding the db
mutex, but for logging it was being accessed outside holding the mutex,
now fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7435
Test Plan: existing tests, watch for recurrence
Reviewed By: ltamasi
Differential Revision: D23917578
Pulled By: pdillinger
fbshipit-source-id: 8573025bca3f6fe169b24b87bbfc4ce9667b0482
Summary:
Add a method `CheckWals` in `WalSet` to check the logs on disk. See `CheckWals`'s comments.
This method will be used to check consistency of WALs during DB recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7236
Test Plan: a set of tests are added to wal_edit_test.cc.
Reviewed By: riversand963
Differential Revision: D23036505
Pulled By: cheng-chang
fbshipit-source-id: 5b1d6857ac173429b00f950c32c4a5b8d063a732
Summary:
Fix few test cases and add them in ASSERT_STATUS_CHECKED build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7427
Test Plan:
1. ASSERT_STATUS_CHECKED=1 make -j48 check,
2. travis build for ASSERT_STATUS_CHECKED,
3. Without ASSERT_STATUS_CHECKED: make check -j64, CircleCI build and travis build
Reviewed By: pdillinger
Differential Revision: D23909983
Pulled By: akankshamahajan15
fbshipit-source-id: 42d7e4aea972acb9fcddb7ca73fcb82f93272434
Summary:
Implement a parsing tool io_tracer_parser that takes IO trace file (binary file) with command line argument --io_trace_file and output file with --output_file and dumps the IO trace records in outputfile in human readable form.
Also added unit test cases that generates IO trace records and calls io_tracer_parse to parse those records.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7333
Test Plan:
make check -j64,
Add unit test cases.
Reviewed By: anand1976
Differential Revision: D23772360
Pulled By: akankshamahajan15
fbshipit-source-id: 9c20519c189362e6663352d08863326f3e496271
Summary:
This option is apparently used by some teams within Facebook
(internal ref T75998621)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7431
Test Plan: USE_CLANG=1 make check before (fails) and after
Reviewed By: jay-zhuang
Differential Revision: D23876584
Pulled By: pdillinger
fbshipit-source-id: abb8b67a1f1aac75327944d266e284b2b6727191
Summary:
There are some tricky behaviors related to WAL sync:
- When creating a WAL, the WAL might not be synced, if the WAL directory is not synced, the WAL file's metadata may not even be synced to disk, so during recovery, when listing the WAL directory, the WAL may not even show up.
- During each DB::Write, the WriteOption can control whether the WAL should be synced, so a WAL previously not synced on creation can be synced during Write.
For each `SyncWAL`, we'll track the synced status and the current WAL size. Previously, we only track the WAL size on closing.
During recovery, we check that the on-disk WAL size is >= the last synced size.
So this PR introduces `synced_size` and `closed` to `WalMetadata` for the above design update.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7414
Test Plan:
- updated wal_edit_test
- updated version_edit_test
Reviewed By: riversand963
Differential Revision: D23796127
Pulled By: cheng-chang
fbshipit-source-id: 5498ab80f537c48a10157e71a4745716aef5cf30
Summary:
Current implementation holds db mutex while calling
`GetAggregatedIntProperty()`. For property kEstimateTableReadersMem,
this can be expensive, especially if the number of table readers is
high.
We can release and re-acquire db mutex if
property_info.need_out_of_mutex is true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7412
Test Plan:
make check
COMPILE_WITH_ASAN=1 make check
COMPILE_WITH_TSAN=1 make check
Also test internally on a shadow host. Used bpf to verify the
excessively long db mutex holding no longer exists when applications
call GetApproximateMemoryUsageByType().
Reviewed By: jay-zhuang
Differential Revision: D23794824
Pulled By: riversand963
fbshipit-source-id: 6bc02a59fd25613d343a62cf817467c7122c9721
Summary:
Make the test robust to spurious wakeups on condition variable,
and clear sync points to ensure no use-after-free.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7418
Test Plan: repeated runs on updated test, watch CircleCI for recurrence
Reviewed By: jay-zhuang
Differential Revision: D23828823
Pulled By: pdillinger
fbshipit-source-id: af85117d9c02602541a90252840e0e5a6996de5b
Summary:
Fix the flaky test failure in error_handler_fs_test. Add the sync point, solve the dependency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7415
Test Plan: make asan_check, ~/gtest-parallel/gtest-parallel -r 100 ./error_handler_fs_test
Reviewed By: siying
Differential Revision: D23804330
Pulled By: zhichao-cao
fbshipit-source-id: 5175108651f7652e47e15978f2a9c1669ef59d80
Summary:
In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310
Test Plan: adding new unit test, pass make check.
Reviewed By: anand1976
Differential Revision: D23710892
Pulled By: zhichao-cao
fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9
Summary:
Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '\_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.
This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.
We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).
We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.
Two final auxiliary notes:
Recognizing that the backup file names have become a de facto part of
the backup meta schema, this change makes them easier to parse and
extend by putting a distinct marker, 's', before DB session ids embedded
in the name. When we extend this to allow custom checksums in the name,
they can get their own marker to ensure safe parsing. For backward
compatibility, file size does not get a marker but is assumed for
`_[0-9]+[.]`
Another change from initial 6.12 default behavior is never including
file custom checksum in the file name. Looking ahead to 6.13, we do not
want the default behavior to cause backup space amplification for
someone turning on file custom checksum checking in BackupEngine; we
want that to be an easy decision. When implemented, including file
custom checksums in backup file names will be a non-default option.
Actual file name patterns and priorities, as regexes:
kLegacyCrc32cAndFileSize OR pre-6.12 SST file ->
[0-9]+_[0-9]+_[0-9]+[.]sst
kFlagMatchInterimNaming set (default) AND early 6.12 SST file ->
[0-9]+_[0-9a-fA-F-]+[.]sst
kUseDbSessionId AND NOT kFlagIncludeFileSize ->
[0-9]+_s[0-9A-Z]{20}[.]sst
kUseDbSessionId AND kFlagIncludeFileSize (default) ->
[0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst
We might add opt-in options for more '\_' separated data in the name,
but embedded file size, if present, will always be after last '\_' and
before '.sst'.
This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400
Test Plan:
unit tests included. Sync point callbacks are used to mimic
previous version SST files.
Reviewed By: ajkr
Differential Revision: D23759587
Pulled By: pdillinger
fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025
Summary:
`IntraL0CompactionAfterFlushCheckConsistencyFail` was flaky by sometimes failing due to no intra-L0 compactions happening. I was able to repro it by putting a `sleep(1)` in the compaction thread before it grabs the lock and picks a compaction. This also showed other intra-L0 tests are affected too, although some of them exhibit hanging forever rather than failing.
The problem was that all the flushes/ingestions could finish before any compaction got picked, so it would end up simply picking all the files that the test generates for L0->L1. But, these tests intend only the first few files to be picked for L0->L1, and the subsequent files to be picked for intra-L0. This PR adjusts the sync points of all the intra-L0 tests to enforce this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7382
Test Plan: run all the `db_compaction_test`s with and without the artificial `sleep()`
Reviewed By: jay-zhuang
Differential Revision: D23684985
Pulled By: ajkr
fbshipit-source-id: 6508399030dddec7738e9853a7b3dc53ef77a584
Summary:
The patch adds support for extracting large values into blob files when
performing a flush during recovery (when `avoid_flush_during_recovery` is
`false`). Blob files are built and added to the `Version` similarly to flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7388
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23709912
Pulled By: ltamasi
fbshipit-source-id: ce48b4227849cf25429ae98574e72b0e1cb9c67d
Summary:
Cleaned up the public API to use the EncryptedEnv. This change will allow providers to be developed and added to the system easier in the future. It will also allow better integration in the future with the OPTIONS file.
- The internal classes were moved out of the public API into an internal "env_encryption_ctr.h" header. Short-cut constructors were added to provide the original API functionality.
- The APIs to the constructors were changed to take shared_ptr, rather than raw pointers or references to allow better memory management and alternative implementations.
- CreateFromString methods were added to allow future expansion to other provider and cipher implementations through a standard API.
Additionally, there was a code duplication in the NewXXXFile methods. This common code was moved under a templatized function.
A first-pass at structuring the code was made to potentially allow multiple EncryptionProviders in a single EncryptedEnv. The idea was that different providers may use different cipher keys or different versions/algorithms. The EncryptedEnv should have some means of picking different providers based on information. The groundwork was started for this (the use of the provider_ member variable was localized) but the work has not been completed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7279
Reviewed By: jay-zhuang
Differential Revision: D23709440
Pulled By: zhichao-cao
fbshipit-source-id: 0e845fff0e03a52603eb9672b4ade32d063ff2f2
Summary:
The patch adds support for writing blob files during flush by integrating
`BlobFileBuilder` with the flush logic, most importantly, `BuildTable` and
`CompactionIterator`. If `enable_blob_files` is set, large values are extracted
to blob files and replaced with references. The resulting blob files are then
logged to the MANIFEST as part of the flush job's `VersionEdit` and
added to the `Version`, similarly to table files. Errors related to writing
blob files fail the flush, and any blob files written by such jobs are immediately
deleted (again, similarly to how SST files are handled). In addition, the patch
extends the logging and statistics around flushes to account for the presence
of blob files (e.g. `InternalStats::CompactionStats::bytes_written`, which is
used for calculating write amplification, now considers the blob files as well).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7345
Test Plan: Tested using `make check` and `db_bench`.
Reviewed By: riversand963
Differential Revision: D23506369
Pulled By: ltamasi
fbshipit-source-id: 646885f22dfbe063f650d38a1fedc132f499a159