Summary:
This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.).
The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer.
When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748
Test Plan:
- an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught
- add to stress/crash test to verify it works in variety of configs/operations without intentional corruption
- [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc.
Reviewed By: pdillinger
Differential Revision: D25754492
Pulled By: ajkr
fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866
Summary:
Removed the uses of the Legacy FileWrapper classes from the source code. The wrappers were creating an additional layer of indirection/wrapping, as the Env already has a FileSystem.
Moved the Custom FileWrapper classes into the CustomEnv, as these classes are really for the private use the the CustomEnv class.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7851
Reviewed By: anand1976
Differential Revision: D26114816
Pulled By: mrambacher
fbshipit-source-id: db32840e58d969d3a0fa6c25aaf13d6dcdc74150
Summary:
Closes https://github.com/facebook/rocksdb/issues/7035
Changed how build_version.cc was generated:
- Included the GIT tag/branch in the build_version file
- Changed the "Build Date" to be:
- If the GIT branch is "clean" (no changes), the date of the last git commit
- If the branch is not clean, the current date
- Added APIs to access the "build information", rather than accessing the strings directly.
The build_version.cc file is now regenerated whenever the library objects are rebuilt.
Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866
Reviewed By: pdillinger
Differential Revision: D26086565
Pulled By: mrambacher
fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f
Summary:
During recovery, RocksDB performs a kind of dummy flush; namely, entries
from the WAL are added to memtables, which then get written to SSTs and
blob files (if enabled) just like during a regular flush. Note that
multiple memtables might be flushed during recovery for the same column
family, for example, if the DB is reopened with a lower write buffer size,
and therefore, we need to make sure to collect all SST and blob file
additions. The patch fixes a bug in the earlier logic which resulted in
later blob file additions overwriting earlier ones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7903
Test Plan: Added a unit test and ran `db_stress`.
Reviewed By: jay-zhuang
Differential Revision: D26110847
Pulled By: ltamasi
fbshipit-source-id: eddb50a608a88f54f3cec3a423de8235aba951fd
Summary:
When retryable IO error occurs during compaction, it is mapped to soft error and set the BG error. However, auto resume is not called to clean the soft error since compaction will reschedule by itself. In this change, When retryable IO error occurs during compaction, BG error is not set. User will be informed the error via EventHelper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7899
Test Plan: tested with error_handler_fs_test
Reviewed By: anand1976
Differential Revision: D26094097
Pulled By: zhichao-cao
fbshipit-source-id: c53424f11d237405592cd762f43cbbdf8da8234f
Summary:
The recovery thread could hold the db.mutex, which is needed from sync
write in main thread.
Make sure the write is done before recovery thread starts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7897
Test Plan: `gtest-parallel ./error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.WALWriteRetryableErrorAutoRecover1 -r 10000 --workers=200`
Reviewed By: zhichao-cao
Differential Revision: D26082933
Pulled By: jay-zhuang
fbshipit-source-id: 226fc49228c0e5903f86ff45cc3fed3080abdb1f
Summary:
The error recovery thread may out-live DBImpl object, which causing
access released DBImpl.mutex. Close SstFileManager before closing DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7896
Test Plan:
the issue can be reproduced by adding sleep in recovery code.
Pass the tests with sleep.
Reviewed By: zhichao-cao
Differential Revision: D26076655
Pulled By: jay-zhuang
fbshipit-source-id: 0d9cc5639c12fcfc001427015e75a9736f33cd96
Summary:
Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
Reviewed By: pdillinger
Differential Revision: D26006406
Pulled By: mrambacher
fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
Summary:
1. In IOTracing, add filename with each IOTrace record. Filename is stored in file object (Tracing Wrappers).
2. Change the logic of figuring out which additional information (file_size,
length, offset etc) needs to be store with each operation
which is different for different operations.
When new information will be added in future (depends on operation),
this change would make the future additions simple.
Logic: In IOTraceRecord, io_op_data is added and its
bitwise positions represent which additional information need
to added in the record from enum IOTraceOp. Values in IOTraceOp represent bitwise positions.
So if length and offset needs to be stored (IOTraceOp::kIOLen
is 1 and IOTraceOp::kIOOffset is 2), position 1 and 2 (from rightmost bit) will be set
and io_op_data will contain 110.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7885
Test Plan: Updated io_tracer_test and verified the trace file manually.
Reviewed By: anand1976
Differential Revision: D25982353
Pulled By: akankshamahajan15
fbshipit-source-id: ebfc5539cc0e231d7794a6b42b73f5403e360b22
Summary:
In the original stacked BlobDB implementation, which writes blobs to blob files
immediately and treats blob files as logs, it makes sense to flush the file after
writing each blob to protect against process crashes; however, in the integrated
implementation, which builds blob files in the background jobs, this unnecessarily
reduces performance. This patch fixes this by simply adding a `do_flush` flag to
`BlobLogWriter`, which is set to `true` by the stacked implementation and to `false`
by the new code. Note: the change itself is trivial but the tests needed some work;
since in the new implementation, blobs are now buffered, adding a blob to
`BlobFileBuilder` is no longer guaranteed to result in an actual I/O. Therefore, we can
no longer rely on `FaultInjectionTestEnv` when testing failure cases; instead, we
manipulate the return values of I/O methods directly using `SyncPoint`s.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7892
Test Plan: `make check`
Reviewed By: jay-zhuang
Differential Revision: D26022814
Pulled By: ltamasi
fbshipit-source-id: b3dce419f312137fa70d84cdd9b908fd5d60d8cd
Summary:
…when unused. Causes many calls to clock_gettime, impacting performance.
Was looking for something else via Linux "perf" command when I spotted heavy usage of clock_gettime during a compaction. Our product heavily uses the rocksdb::Options::merge_operator. MergeHelper::FilterMerge() properly tests if timing is enabled/disabled upon entry, but not on exit. This patch fixes the exit.
Note: the entry test also verifies if "nullptr!=stats_". This test is redundant to code within ShouldReportDetailedTime(). Therefore I omitted it in my change.
merge_test.cc updated with test that shows failure before merge_helper.cc change ... and fix after change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7867
Reviewed By: jay-zhuang
Differential Revision: D25960175
Pulled By: ajkr
fbshipit-source-id: 56e66d7eb6ae5eae89c8e0d5a262bd2905a226b6
Summary:
This provides a workaround for two race conditions that will be fixed in
a more sophisticated way later. This PR:
(1) Makes the client serialize calls to `Timer::Start()` and `Timer::Shutdown()` (see https://github.com/facebook/rocksdb/issues/7711). The long-term fix will be to make those functions thread-safe.
(2) Makes `PeriodicWorkScheduler` atomically add/cancel work together with starting/shutting down its `Timer`. The long-term fix will be for `Timer` API to offer more specialized APIs so the client will not need to synchronize.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7888
Test Plan: ran the repro provided in https://github.com/facebook/rocksdb/issues/7881
Reviewed By: jay-zhuang
Differential Revision: D25990891
Pulled By: ajkr
fbshipit-source-id: a97fdaebbda6d7db7ddb1b146738b68c16c5be38
Summary:
BlobFileAddition and BlobFileGarbage should not be in the ignorable tag
range, since if they are present in the MANIFEST, users cannot downgrade
to a RocksDB version that does not understand them without losing access
to the data in the blob files. The patch moves these two tags to the
unignorable range; this should still be safe at this point, since the
integrated BlobDB project is still work in progress and thus there
shouldn't be any ignorable BlobFileAddition/BlobFileGarbage tags out
there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7886
Test Plan: `make check`
Reviewed By: cheng-chang
Differential Revision: D25980956
Pulled By: ltamasi
fbshipit-source-id: 13cf5bd61d77f049b513ecd5ad0be8c637e40a9d
Summary:
Although the tags for `WalAddition`, `WalDeletion` are after `kTagSafeIgnoreMask`, to actually be able to skip these entries in older versions of RocksDB, we require that they are encoded with their encoded size as the prefix. This requirement is not met in the current codebase, so a downgraded DB may fail to open if these entries exist in the MANIFEST.
If a DB wants to downgrade, and its MANIFEST contains `WalAddition` or `WalDeletion`, it can set `track_and_verify_wals_in_manifest` to `false`, then restart twice, then downgrade. On the first restart, a new MANIFEST will be created with a `WalDeletion` indicating that all previously tracked WALs are removed from MANIFEST. On the second restart, since there is no tracked WALs in MANIFEST now, a new MANIFEST will be created with neither `WalAddition` nor `WalDeletion`. Then the DB can downgrade.
Tags for `BlobFileAddition`, `BlobFileGarbage` also have the same problem, but this PR focuses on solving the problem for WAL edits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7873
Test Plan: Added a `VersionEditTest::IgnorableTags` unit test to verify all entries with tags larger than `kTagSafeIgnoreMask` can actually be skipped and won't affect parsing of other entries.
Reviewed By: ajkr
Differential Revision: D25935930
Pulled By: cheng-chang
fbshipit-source-id: 7a02fdba4311d6084328c14aed110a26d08c3efb
Summary:
The WAL's file size is stored as an unsigned 64 bit integer.
In db_info_dumper.cc, this integer gets converted to a string. Since 2^64 is approximately 10^19, we need 20 digits to represent the integer correctly. To store the decimal representation, we need 21 bytes (+1 due to the '\0' terminator at the end). The code previously used 16 bytes, which would overflow if the log is really big (>1 petabyte).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7870
Reviewed By: ajkr
Differential Revision: D25938776
Pulled By: jay-zhuang
fbshipit-source-id: 6ee9e21ebd65d297ea90fa1e7e74f3e1c533299d
Summary:
The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html
There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source.
Also some minor fixes to `Env::GetChildren`:
* Improve error handling in POSIX implementation
* Remove unnecessary array allocation on Windows
* Fix struct name for Windows Non-UTF-8 API
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819
Reviewed By: ajkr
Differential Revision: D25837394
Pulled By: jay-zhuang
fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e
Summary:
In RocksDB, when IO error happens, the flags of IOStatus can be set. If the IOStatus is set as "File Scope IO Error", it indicate that the error is constrained in the file level. Since RocksDB does not continues write data to a file when any IO Error happens, File Scope IO Error can be treated the same as Retryable IO Error. Adding the logic to ErrorHandler::SetBGError to include the file scope IO Error in its error handling logic, which is the same as retryable IO Error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7840
Test Plan: added new unit tests in error_handler_fs_test. make check
Reviewed By: anand1976
Differential Revision: D25820481
Pulled By: zhichao-cao
fbshipit-source-id: 69cabd3d010073e064d6142ce1cabf341b8a6806
Summary:
The IOStatus of TableBuilder is returned by copy the io status from builder->io_status(). pr https://github.com/facebook/rocksdb/issues/7718 swallowed the io status and it will cause the write IO error become non-retryable and no auto resume logic will handle it. Roll back to previous implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7838
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D25795387
Pulled By: zhichao-cao
fbshipit-source-id: bc35e69e0b71aa4148a6ed76f073357041b8e372
Summary:
This PR does the following:
-> Creates a WinFileSystem class. This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems.
-> Introduces a CustomEnv class. A CustomEnv is an Env that takes a FileSystem as constructor argument. I believe there will only ever be two implementations of this class (PosixEnv and WinEnv). There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls
-> Eliminates the public uses of the LegacyFileSystemWrapper.
With this change in place, there are effectively the following patterns of Env:
- "Base Env classes" (PosixEnv, WinEnv). These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem. These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system.
- Wrapped Composite Env classes (MemEnv). These classes take in an Env and a FileSystem. The core env functions are re-directed to the wrapped env. The file system calls are redirected to the input file system
- Legacy Wrapped Env classes. These classes take in an Env input (but no FileSystem). The core env functions are re-directed to the wrapped env. A "Legacy File System" is created using this env and the file system calls directed to the env itself.
With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created. Any other use of the PosixEnv is via another wrapped env. This cleans up some of the issues with the env construction and destruction.
Additionally, there were places in the code that required had an Env when they required a FileSystem. Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem(). These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703
Reviewed By: zhichao-cao
Differential Revision: D25762190
Pulled By: anand1976
fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b
Summary:
Change the StringEnv and related classes to be based on FileSystem APIs rather than the corresponding Env ones. The StringSink and StringSource classes were changed to be based on the corresponding FS file classes.
Part of a cleanup to use the newer interfaces. This change also eliminates some of the casts/wrappers to LegacyFile classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7786
Reviewed By: jay-zhuang
Differential Revision: D25761460
Pulled By: anand1976
fbshipit-source-id: 428ae8e32b3db97dbeeca08c9d3bb0d9d4d3a38f
Summary:
Previously we only had a debug assertion to check the right generator was being used for verification. However a user hit a problem in production where their factory was creating the wrong generator for some files, leading to checksum mismatches. It would have been easier to debug if we verified in optimized builds that the generator with the proper name is used. This PR adds such verification.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7824
Reviewed By: zhichao-cao
Differential Revision: D25740254
Pulled By: ajkr
fbshipit-source-id: a6231521747605021bad3231484b5d4f99f4044f
Summary:
The returned Status is ignored here as some stress tests are failing, presumably when attempting to add an empty file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7826
Reviewed By: jay-zhuang
Differential Revision: D25742931
fbshipit-source-id: a1fcd620d9472993a009929306dfc421f93eb43b
Summary:
In GenerateOneFileChecksum(), RocksDB reads the file and computes its checksum. A rate limiter can be passed to the constructor of RandomAccessFileReader so that read I/O can be rate limited.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7811
Test Plan: make check
Reviewed By: cheng-chang
Differential Revision: D25699896
Pulled By: zhichao-cao
fbshipit-source-id: e2688bc1126c543979a3bcf91dda784bd7b74164
Summary:
Added "no-elide-constructors to the ASSERT_STATUS_CHECK builds. This flag gives more errors/warnings for some of the Status checks where an inner class checks a Status and later returns it. In this case, without the elide check on, the returned status may not have been checked in the caller, thereby bypassing the checked code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7798
Reviewed By: jay-zhuang
Differential Revision: D25680451
Pulled By: pdillinger
fbshipit-source-id: c3f14ed9e2a13f0a8c54d839d5fb4d1fc1e93917
Summary:
RetrieveMultipleBlocks which is used by MultiGet to read data blocks is not updating num_data_read stat in
GetContextStats.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7770
Test Plan: make check -j64
Reviewed By: anand1976
Differential Revision: D25538982
Pulled By: akankshamahajan15
fbshipit-source-id: e3daedb035b1be8ab6af6f115cb3793ccc7b1ec6
Summary:
This test would occasionally fail like this:
WARNING: c:\users\circleci\project\db\db_test.cc(1343): error: Expected:
(dbfull()->TEST_MaxNextLevelOverlappingBytes(handles_[1])) <= (20 * 1048576), actual: 33501540 vs 20971520
And being a super old test, it's not structured in a sound way. And it appears that DBTest2.MaxCompactionBytesTest is a better test of what SparseMerge was intended to test. In fact, SparseMerge fails if I set
options.max_compaction_bytes = options.target_file_size_base * 1000;
Thus, we are removing this negative-value test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7800
Test Plan: Q.E.D.
Reviewed By: ajkr
Differential Revision: D25693366
Pulled By: pdillinger
fbshipit-source-id: 9da07d4dce0559547fc938b2163a2015e956c548
Summary:
`CompactedDB` is a kind of read-only DB, so it shouldn't support `SyncWAL`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7788
Test Plan: watch existing tests to pass.
Reviewed By: akankshamahajan15
Differential Revision: D25661209
Pulled By: cheng-chang
fbshipit-source-id: 9eb2cc3f73736dcc205c8410e5944aa203f002d3
Summary:
We saw DBWALTestWithParam/DBWALTestWithParam.WALTrashCleanupOnOpen sometimes fail with:
db/db_sst_test.cc:575: Failure
Expected: (trash_log_count) >= (1), actual: 0 vs 1
The suspicious is that delete scheduling actually deleted all trash files based on rate, but it is not expected. This can be reproduced if we manually add sleep after DB is closed for serveral seconds. Minimize its chance by setting the delete rate to be lowest possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7796
Test Plan: The test doesn't fail with the manual sleeping anymore
Reviewed By: anand1976
Differential Revision: D25675000
fbshipit-source-id: a39fd05e1a83719c41014e48843792e752368e22
Summary:
So that we can more easily get aggregate live table data such
as total filter, index, and data sizes.
Also adds ldb support for getting properties
Also fixed some missing/inaccurate related comments in db.h
For example:
$ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties
rocksdb.aggregated-table-properties.data_size: 102871
rocksdb.aggregated-table-properties.filter_size: 0
rocksdb.aggregated-table-properties.index_partitions: 0
rocksdb.aggregated-table-properties.index_size: 2232
rocksdb.aggregated-table-properties.num_data_blocks: 100
rocksdb.aggregated-table-properties.num_deletions: 0
rocksdb.aggregated-table-properties.num_entries: 15000
rocksdb.aggregated-table-properties.num_merge_operands: 0
rocksdb.aggregated-table-properties.num_range_deletions: 0
rocksdb.aggregated-table-properties.raw_key_size: 288890
rocksdb.aggregated-table-properties.raw_value_size: 198890
rocksdb.aggregated-table-properties.top_level_index_size: 0
$ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1
rocksdb.aggregated-table-properties-at-level1.data_size: 80909
rocksdb.aggregated-table-properties-at-level1.filter_size: 0
rocksdb.aggregated-table-properties-at-level1.index_partitions: 0
rocksdb.aggregated-table-properties-at-level1.index_size: 1787
rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81
rocksdb.aggregated-table-properties-at-level1.num_deletions: 0
rocksdb.aggregated-table-properties-at-level1.num_entries: 12466
rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0
rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0
rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210
rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414
rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0
$
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779
Test Plan: Added a test to ldb_test.py
Reviewed By: jay-zhuang
Differential Revision: D25653103
Pulled By: pdillinger
fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
Summary:
In the write path, there is an optimization: when a new WAL is created during SwitchMemtable, we update the internal log number of the empty column families to the new WAL. `FindObsoleteFiles` marks a WAL as obsolete if the WAL's log number is less than `VersionSet::MinLogNumberWithUnflushedData`. After updating the empty column families' internal log number, `VersionSet::MinLogNumberWithUnflushedData` might change, so some WALs might become obsolete to be purged from disk.
For example, consider there are 3 column families: 0, 1, 2:
1. initially, all the column families' log number is 1;
2. write some data to cf0, and flush cf0, but the flush is pending;
3. now a new WAL 2 is created;
4. write data to cf1 and WAL 2, now cf0's log number is 1, cf1's log number is 2, cf2's log number is 2 (because cf1 and cf2 are empty, so their log numbers will be set to the highest log number);
5. now cf0's flush hasn't finished, flush cf1, a new WAL 3 is created, and cf1's flush finishes, now cf0's log number is 1, cf1's log number is 3, cf2's log number is 3, since WAL 1 still contains data for the unflushed cf0, no WAL can be deleted from disk;
6. now cf0's flush finishes, cf0's log number is 2 (because when cf0 was switching memtable, WAL 3 does not exist yet), cf1's log number is 3, cf2's log number is 3, so WAL 1 can be purged from disk now, but WAL 2 still cannot because `MinLogNumberToKeep()` is 2;
7. write data to cf2 and WAL 3, because cf0 is empty, its log number is updated to 3, so now cf0's log number is 3, cf1's log number is 3, cf2's log number is 3;
8. now if the background threads want to purge obsolete files from disk, WAL 2 can be purged because `MinLogNumberToKeep()` is 3. But there are only two flush results written to MANIFEST: the first is for flushing cf1, and the `MinLogNumberToKeep` is 1, the second is for flushing cf0, and the `MinLogNumberToKeep` is 2. So without this PR, if the DB crashes at this point and try to recover, `WalSet` will still expect WAL 2 to exist.
When WAL tracking is enabled, we assume WALs will only become obsolete after a flush result is written to MANIFEST in `MemtableList::TryInstallMemtableFlushResults` (or its atomic flush counterpart). The above situation breaks this assumption.
This PR tracks WAL obsoletion if necessary before updating the empty column families' log numbers.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7781
Test Plan:
watch existing tests and stress tests to pass.
`make -j48 blackbox_crash_test` on devserver
Reviewed By: ltamasi
Differential Revision: D25631695
Pulled By: cheng-chang
fbshipit-source-id: ca7fff967bdb42204b84226063d909893bc0a4ec
Summary:
When ConcurrentTaskLimiter is enabled and there are too many outstanding compactions, BackgroundCompaction returns Status::Busy(), which shouldn't be treat as compaction failure.
This caused performance issue when outstanding compactions reached the limit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7739
Reviewed By: cheng-chang
Differential Revision: D25508319
Pulled By: ltamasi
fbshipit-source-id: 3b181b16ada0ca3393cfa3a7412985764e79c719
Summary:
sst file number in corruption error would be very useful for debugging
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7767
Reviewed By: zhichao-cao
Differential Revision: D25485872
Pulled By: jay-zhuang
fbshipit-source-id: 67315b582cedeefbce6676015303ebe5bf6526a3
Summary:
The patch adds initial support for reading blobs to the batched `MultiGet` API.
The current implementation simply retrieves the blob values as the blob indexes
are encountered; that is, reads from blob files are currently not batched. (This
will be optimized in a separate phase.) In addition, the patch removes some dead
code related to BlobDB from the batched `MultiGet` implementation, namely the
`is_blob` / `is_blob_index` flags that are passed around in `DBImpl` and `MemTable` /
`MemTableListVersion`. These were never hooked up to anything and wouldn't
work anyways, since a single flag is not sufficient to communicate the "blobness"
of multiple key-values.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7766
Test Plan: `make check`
Reviewed By: jay-zhuang
Differential Revision: D25479290
Pulled By: ltamasi
fbshipit-source-id: 7aba2d290e31876ee592bcf1adfd1018713a8000
Summary:
Uncommon bug seen by ASAN with
ColumnFamilyTest.LiveIteratorWithDroppedColumnFamily, if the last two
references to a ColumnFamilyData are both SuperVersions (during
InstallSuperVersion). The fix is to use UnrefAndTryDelete even in
SuperVersion::Cleanup but with a parameter to avoid re-entering Cleanup
on the same SuperVersion being cleaned up.
ColumnFamilyData::Unref is considered unsafe so removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7749
Test Plan: ./column_family_test --gtest_filter=*LiveIter* --gtest_repeat=100
Reviewed By: jay-zhuang
Differential Revision: D25354304
Pulled By: pdillinger
fbshipit-source-id: e78f3a3f67c40013b8432f31d0da8bec55c5321c
Summary:
min_wal_number_to_keep should not be decreasing, if it does not increase, then there is no need to log the WAL obsoletions in MANIFEST since a previous one has been logged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7765
Test Plan: watch existing tests and stress tests to pass
Reviewed By: pdillinger
Differential Revision: D25462542
Pulled By: cheng-chang
fbshipit-source-id: 0085fcb6edf5cf2b0fc32f9932a7566f508768ff
Summary:
When two phase commit is enabled, `VersionSet::min_log_number_to_keep_2pc` is set during flush.
But when a new MANIFEST is created, the `min_log_number_to_keep_2pc` is not carried over to the new MANIFEST. So if a new MANIFEST is created and then DB is reopened, the `min_log_number_to_keep_2pc` will be lost. This may cause DB recovery errors.
The bug is reproduced in a new unit test in `version_set_test.cc`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7747
Test Plan: The new unit test in `version_set_test.cc` should pass.
Reviewed By: jay-zhuang
Differential Revision: D25350661
Pulled By: cheng-chang
fbshipit-source-id: eee890d5b19f15769069670692e270ae31044ece
Summary:
Ensure that when direct IO is enabled and a compressed block cache is
configured, MultiGet inserts compressed data blocks into the compressed
block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7756
Test Plan: Add unit test to db_basic_test
Reviewed By: cheng-chang
Differential Revision: D25416240
Pulled By: anand1976
fbshipit-source-id: 75d57526370c9c0a45ff72651f3278dbd8a9086f
Summary:
If WAL tracking was enabled, then disabled during reopen, the previously tracked WALs should be removed from MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7757
Test Plan: a new unit test `DBBasicTest.DisableTrackWal` is added.
Reviewed By: jay-zhuang
Differential Revision: D25410508
Pulled By: cheng-chang
fbshipit-source-id: 9d8d9e665066135930a7c1035bb8c2f68bded6a0
Summary:
Currently, when a WAL becomes obsolete after flushing, if VersionSet::WalSet does not contain the WAL, we do not track the WAL obsoletion event in MANIFEST.
But consider this case:
* WAL 10 is synced, a VersionEdit is LogAndApplied to MANIFEST to log this WAL addition event, but the VersionEdit is not applied to WalSet yet since its corresponding ManifestWriter is still pending in the write queue;
* Since the above ManifestWriter is blocking, the LogAndApply will block on a conditional variable and release the db mutex, so another LogAndApply can proceed to enqueue other VersionEdits concurrently;
* Now flush happens, and WAL 10 becomes obsolete, although WalSet does not contain WAL 10 yet, we should call LogAndApply to enqueue a VersionEdit to indicate the obsoletion of WAL 10;
* otherwise, when the queued edit indicating WAL 10 addition is logged to MANIFEST, and DB crashes and reopens, the WAL 10 might have been removed from disk, but it still exists in MANIFEST.
This PR changes the behavior to: always `LogAndApply` any WAL addition or obsoletion event, without considering the order issues caused by concurrency, but when applying the edits to `WalSet`, do not add the WALs if they are already obsolete. In this approach, the logical events of WAL addition and obsoletion are always tracked in MANIFEST, so we can inspect the MANIFEST and know all the previous WAL events, but we choose to ignore certain events due to the concurrency issues such as the case above, or the case in https://github.com/facebook/rocksdb/pull/7725.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7759
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D25423089
Pulled By: cheng-chang
fbshipit-source-id: 9cb9a7fbc1875bf954f2a42f9b6cfd6d49a7b21c
Summary:
Consider the case:
1. All column families are flushed, so all WALs become obsolete, but no WAL is removed from disk yet because the removal is asynchronous, a VersionEdit is written to MANIFEST indicating that WALs before a certain WAL number are obsolete, let's say this number is 3;
2. `SyncWAL` is called, so all the on-disk WALs are synced, and if track_and_verify_wal_in_manifest=true, the WALs will be tracked in MANIFEST, let's say the WAL numbers are 1 and 2;
3. DB crashes;
4. During DB recovery, when replaying MANIFEST, we first see that WAL with number < 3 are obsolete, then we see that WAL 1 and 2 are synced, so according to current implementation of `WalSet`, the `WalSet` will be recovered to include WAL 1 and 2;
5. WAL 1 and 2 are asynchronously deleted from disk, then the WAL verification algorithm fails with `Corruption: missing WAL`.
The above case is reproduced in a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal`.
The fix is to maintain the upper bound of the obsolete WAL numbers, any WAL with number less than the maintained number is considered to be obsolete, so shouldn't be tracked even if they are later synced. The number is maintained in `WalSet`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7725
Test Plan:
1. a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal` is added.
2. run `make crash_test` on devserver.
Reviewed By: riversand963
Differential Revision: D25238914
Pulled By: cheng-chang
fbshipit-source-id: f5dccd57c3d89f19565ec5731f2d42f06d272b72
Summary:
This PR removes a nested loop inside ProcessManifestWrites. The new
implementation has the same behavior as the old code with simpler logic
and lower complexity.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7751
Test Plan:
make check
Run make crash_test on devserver and succeeds 3 times.
Reviewed By: ltamasi
Differential Revision: D25363526
Pulled By: riversand963
fbshipit-source-id: 27e681949dacd7501a752e5e517b9e85b54ccb2e
Summary:
This change eliminates the need for a lot of the PermitUncheckedError calls on return from ErrorHandler methods. The calls are no longer needed as the status is returned as a reference rather than a copy. Additionally, this means that the originating status (recovery_error_, bg_error_) is not cleared implicitly as a result of calling one of these methods.
For this class, I do not know if the proper behavior should be to call PermitUncheckedError in the destructor or if the checked state should be cleared when the status is cleared. I did tests both ways. Without the code in the destructor, the status will need to be cleared in at least some of the places where it is set to OK. When running tests, I found no instances where this class was destructed with a non-OK, non-checked Status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7539
Reviewed By: anand1976
Differential Revision: D25340565
Pulled By: pdillinger
fbshipit-source-id: 1730c035c81a475875ea745226112030ec25136c
Summary:
`googletest` uses exceptions to communicate assertion failures when
`GTEST_THROW_ON_FAILURE` is set, which does not go well with
`std::thread`s, since an exception escaping the top-level function of an
`std::thread` object or an `std::thread` getting destroyed without
having been `join`ed or `detach`ed first results in a call to
`std::terminate`. The patch fixes this by moving the `Status` assertions
of background operations in `ExternalSstFileTest.PickedLevelBug` to the
main thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7754
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D25383808
Pulled By: ltamasi
fbshipit-source-id: 32fb2721e5169ec898d218900bc0d83eead45d03
Summary:
Handle misuse of snprintf return value to avoid Out of bound
read/write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7686
Test Plan: make check -j64
Reviewed By: riversand963
Differential Revision: D25030831
Pulled By: akankshamahajan15
fbshipit-source-id: 1a1d181c067c78b94d720323ae00b79566b57cfa
Summary:
Added a fix for the failure of
DBTest2.PartitionedIndexUserToInternalKey on ppc64le in travis
Closes https://github.com/facebook/rocksdb/issues/7746
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7752
Test Plan:
Ran travis job multiple times and it passed. Will keep
watching the travis job after this patch.
Reviewed By: pdillinger
Differential Revision: D25373130
Pulled By: akankshamahajan15
fbshipit-source-id: fa0e3f85f75b687415044a506e42cc38ead87975
Summary:
Following https://github.com/facebook/rocksdb/issues/7655 and https://github.com/facebook/rocksdb/issues/7657, this PR adds `full_history_ts_low_` to `ColumnFamilyData`.
`ColumnFamilyData::full_history_ts_low_` will be used to create `FlushJob` and `CompactionJob`.
`ColumnFamilyData::full_history_ts_low` is persisted to the MANIFEST file. An application can only
increase its value. Consider the following case:
>
> The database has a key at ts=950. `full_history_ts_low` is first set to 1000, and then a GC is triggered
> and cleans up all data older than 1000. If the application sets `full_history_ts_low` to 900 afterwards,
> and tries to read at ts=960, the key at 950 is not seen. From the perspective of the read, the result
> is hard to reason. For simplicity, we just do now allow decreasing full_history_ts_low for now.
>
During recovery, the value of `full_history_ts_low` is restored for each column family if applicable. Note that
version edits in the MANIFEST file for the same column family may have `full_history_ts_low` unsorted due
to the potential interleaving of `LogAndApply` calls. Only the max will be used to restore the state of the
column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7740
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D25296217
Pulled By: riversand963
fbshipit-source-id: 24acda1df8262cd7cfdc6ce7b0ec56438abe242a
Summary:
The patch adds iterator support to the integrated BlobDB implementation.
Whenever a blob reference is encountered during iteration, the corresponding
blob is retrieved by calling `Version::GetBlob`, assuming the `expose_blob_index`
(formerly `allow_blob`) flag is *not* set. (Note: the flag is set by the old stacked
BlobDB implementation, which has its own blob file handling/blob retrieval logic.)
In addition, `DBIter` now uniformly returns `Status::NotSupported` with the error
message `"BlobDB does not support merge operator."` when encountering a
blob reference while performing a merge (instead of potentially returning a
message that implies the database should be opened using the stacked BlobDB's
`Open`.)
TODO: We can implement support for lazily retrieving the blob value (or in other
words, bypassing the retrieval of blob values based on key) by extending the `Iterator`
API with a new `PrepareValue` method (similarly to `InternalIterator`, which already
supports lazy values).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7731
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D25256293
Pulled By: ltamasi
fbshipit-source-id: c39cd782011495a526cdff99c16f5fca400c4811
Summary:
In current code base, in FlushMemtable, when `(Flush_reason == FlushReason::kErrorRecoveryRetryFlush && (!cfd->mem()->IsEmpty() || !cached_recoverable_state_empty_.load()))`, we assert that cfd->imm()->NumNotFlushed() > 0. However, there are some corner cases that can fail this assert: 1) if there are multiple CFs, some CF has immutable memtable, some CFs don't. In ResumeImpl, all CFs will call FlushMemtable, which will hit the assert. 2) Regular flush is scheduled and running, the resume thread is waiting. New KVs are inserted and SchedulePendingFlush is called. Regular flush will continue call MaybeScheduleFlushAndCompaction until all the immutable memtables are flushed. When regular flush ends and auto resume thread starts to schedule new flushes, cfd->imm()->NumNotFlushed() can be 0.
Remove the assert and added the comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7744
Test Plan: make check and pass the stress test
Reviewed By: riversand963
Differential Revision: D25340573
Pulled By: zhichao-cao
fbshipit-source-id: eac357bdace660247c197f01a9ff6857e3c97672
Summary:
In error_handler auto recovery case, if recovery_in_prog_ is false, the recover is finished or failed. In this case, the auto recovery thread should finish its execution so recovery_thread_ should be null. However, in some cases, it is not null, the caller should not directly returned. Instead, it should wait for a while and create a new thread to execute the new recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7700
Test Plan: make check, error_handler_fs_test
Reviewed By: anand1976
Differential Revision: D25098233
Pulled By: zhichao-cao
fbshipit-source-id: 5a1cba234ca18f6dd5d1be88e02d66e1d5ce931b
Summary:
When 2 phase commit is enabled, if there are prepared data in a WAL, the WAL should be kept, the minimum log number for such a WAL is written to MANIFEST during flush. In atomic flush, such information is not written to MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7570
Test Plan: Added a new unit test `DBAtomicFlushTest.ManualFlushUnder2PC`, this test fails in atomic flush without this PR, after this PR, it succeeds.
Reviewed By: riversand963
Differential Revision: D24394222
Pulled By: cheng-chang
fbshipit-source-id: 60ce74b21b704804943be40c8de01b41269cf116
Summary:
In the current code base, all the manifest writes with IO error will be set with reason: BackgroundErrorReason::kManifestWrite, which will be mapped to the kHardError if the IO Error is retryable. However, if the system does not use the WAL, all the retryable IO error should be mapped to kSoftError. Create this PR to handle is special case by adding kManifestWriteNoWAL to BackgroundErrorReason.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7693
Test Plan: make check, add new testing cases to error_handler_fs_test
Reviewed By: anand1976
Differential Revision: D25066204
Pulled By: zhichao-cao
fbshipit-source-id: d59553896c2eac3fb37c05238544d2b265379462
Summary:
Add timestamp to the `CompactRange()` and `GetApproximateSizes` range keys if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7684
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D25015421
Pulled By: jay-zhuang
fbshipit-source-id: 51ca0756087eb053a3b11801e5c7ce1c6e2d38a9
Summary:
https://github.com/facebook/rocksdb/issues/7340 reports and reproduces an assertion failure caused by a combination of the following:
- atomic flush is disabled.
- a column family can appear multiple times in the flush queue at the same time. This behavior was introduced in release 5.17.
Consequently, it is possible that two flushes race with each other. One bg flush thread flushes all memtables. The other thread calls `FlushMemTableToOutputFile()` afterwards, and hits the assertion error below.
```
assert(cfd->imm()->NumNotFlushed() != 0);
assert(cfd->imm()->IsFlushPending());
```
Fix this by reverting the behavior. In non-atomic-flush case, a column family can appear in the flush queue at most once at the same time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7362
Test Plan:
make check
Also run stress test successfully for 10 times.
```
make crash_test
```
Reviewed By: ajkr
Differential Revision: D25172996
Pulled By: riversand963
fbshipit-source-id: f1559b6366cc609e961e3fc83fae548f1fad08ce
Summary:
WAL may be truncated to an incomplete record due to crash while writing
the last record or corruption. In the former case, no hole will be
produced since no ACK'd data was lost. In the latter case, a hole could
be produced without this PR since we proceeded to recover the next WAL
as if nothing happened. This PR changes the record reading code to
always report a corruption for incomplete records in
`kPointInTimeRecovery` mode, and the upper layer will only ignore them
if the next WAL has consecutive seqnum (i.e., we are guaranteed no
hole).
While this solves the hole problem for the case of incomplete
records, the possibility is still there if the WAL is corrupted by
truncation to an exact record boundary. This PR also regresses how much data
can be recovered when writes are mixed with/without
`WriteOptions::disableWAL`, as then we can not distinguish between a
seqnum gap caused by corruption and a seqnum gap caused by a `disableWAL` write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7701
Test Plan:
Interestingly there already was a test for this case
(`DBWALTestWithParams.kPointInTimeRecovery`); it just had a typo bug in
the verification that prevented it from noticing holes in recovery.
Reviewed By: anand1976
Differential Revision: D25111765
Pulled By: ajkr
fbshipit-source-id: 5e330b13b1ee2b5be096cea9d0ff6075843e57b6
Summary:
The patch adds basic garbage collection support to the integrated BlobDB
implementation. Valid blobs residing in the oldest blob files are relocated
as they are encountered during compaction. The threshold that determines
which blob files qualify is computed based on the configuration option
`blob_garbage_collection_age_cutoff`, which was introduced in https://github.com/facebook/rocksdb/issues/7661 .
Once a blob is retrieved for the purposes of relocation, it passes through the
same logic that extracts large values to blob files in general. This means that
if, for instance, the size threshold for key-value separation (`min_blob_size`)
got changed or writing blob files got disabled altogether, it is possible for the
value to be moved back into the LSM tree. In particular, one way to re-inline
all blob values if needed would be to perform a full manual compaction with
`enable_blob_files` set to `false`, `enable_blob_garbage_collection` set to
`true`, and `blob_file_garbage_collection_age_cutoff` set to `1.0`.
Some TODOs that I plan to address in separate PRs:
1) We'll have to measure the amount of new garbage in each blob file and log
`BlobFileGarbage` entries as part of the compaction job's `VersionEdit`.
(For the time being, blob files are cleaned up solely based on the
`oldest_blob_file_number` relationships.)
2) When compression is used for blobs, the compression type hasn't changed,
and the blob still qualifies for being written to a blob file, we can simply copy
the compressed blob to the new file instead of going through decompression
and compression.
3) We need to update the formula for computing write amplification to account
for the amount of data read from blob files as part of GC.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7694
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D25069663
Pulled By: ltamasi
fbshipit-source-id: bdfa8feb09afcf5bca3b4eba2ba72ce2f15cd06a
Summary:
This PR updates `MemTable::Add()`, `MemTable::Update()`, and
`MemTable::UpdateCallback()` to return `Status` objects, and adapts the
client code in `MemTableInserter`. The goal is to prepare these
functions for key-value checksum, where we want to verify key-value
integrity while adding to memtable. After this PR, the memtable mutation
functions can report a failed integrity check by returning `Status::Corruption`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7656
Reviewed By: riversand963
Differential Revision: D24900497
Pulled By: ajkr
fbshipit-source-id: 1a7e80581e3774676f2bbba2f0a0b04890f40009
Summary:
Allow corruption_test to run on custom env loaded via
`Env::LoadEnv()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7699
Test Plan:
```
make corruption_test
./corruption_test
```
Also run on in-house custom env.
Reviewed By: zhichao-cao
Differential Revision: D25135525
Pulled By: riversand963
fbshipit-source-id: 7941e7ce342dc88ec2cd63e90f7674a2f57de6b7
Summary:
An empty WAL won't be backed up by the BackupEngine. So if we track the empty WALs in MANIFEST, then when restoring from a backup, it may report corruption that the empty WAL is missing, which is correct because the WAL is actually in the main DB but not in the backup DB, but missing an empty WAL does not logically break DB consistency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7697
Test Plan: watch existing tests to pass
Reviewed By: pdillinger
Differential Revision: D25077194
Pulled By: cheng-chang
fbshipit-source-id: 01917b57234b92b6063925f2ee9452c5732bdc03
Summary:
The logic for computing min_log_number_to_keep in atomic flush was incorrect.
For example, when all column families are flushed, the min_log_number_to_keep should be the latest new log. But the incorrect logic calls `PrecomputeMinLogNumberToKeepNon2PC` for each column family, and returns the minimum of them. However, `PrecomputeMinLogNumberToKeepNon2PC(cf)` assumes column families other than `cf` are flushed, but in case all column families are flushed, this assumption is incorrect.
Without this fix, the WAL referenced by the computed min_log_number_to_keep may actually contain no unflushed data, so the WAL might have actually been deleted from disk on recovery, then an incorrect error `Corruption: missing WAL` will be reported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7660
Test Plan:
run `make crash_test_with_atomic_flush` on devserver
added a unit test in `db_flush_test`
Reviewed By: riversand963
Differential Revision: D24906265
Pulled By: cheng-chang
fbshipit-source-id: 08deda62e71f67f59e3b7925cdd86dd09bd4f430
Summary:
In db_basic_test.cc, there are two tests that rely on the underlying
system's `LockFile` support to function correctly:
DBBasicTest.OpenWhenOpen and DBBasicTest.CheckLock. In both tests,
re-opening a db using `DB::Open` is expected to fail because the second
open cannot lock the LOCK file. Some distributed file systems, e.g. HDFS
do not support the POSIX-style file lock. Therefore, these unit tests will cause
assertion failure and the second `Open` will create a db instance.
Currently, these db instances are not closed after the assertion
failure. Since these db instances are registered with some process-wide, static
data structures, e.g. `PeriodicWorkScheduler::Default()`, they can still be
accessed after the unit tests. However, the `Env` object created for this db
instance is destroyed when the test finishes in `~DBTestBase()`. Consequently,
it causes illegal memory access.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7682
Test Plan:
Run the following on a distrubited file system:
```
make check
```
Reviewed By: anand1976
Differential Revision: D25004215
Pulled By: riversand963
fbshipit-source-id: f4327d7716c0e72b13bb43737ec9a5d156da4d52
Summary:
An application may accidentally write merge operands without properly configuring `merge_operator`. We should alert them as early as possible that there's an API misuse. Previously RocksDB only notified them when a query or background operation needed to merge but couldn't. With this PR, RocksDB notifies them of the problem before applying the merge operand to the memtable (although it may already be in WAL, which seems it'd cause a crash loop until they enable `merge_operator`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7667
Reviewed By: riversand963
Differential Revision: D24933360
Pulled By: ajkr
fbshipit-source-id: 3a4a2ceb0b7aed184113dd03b8efd735a8332f7f
Summary:
If fsync is disabled in a unit test, then do not track WAL in MANIFEST, because on DB recovery, the WAL might be missing because the directory is not fsynced.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7669
Test Plan: Tests with fsync enabled should pass.
Reviewed By: riversand963
Differential Revision: D24941431
Pulled By: cheng-chang
fbshipit-source-id: ab3ff0f90769795cfb4e4d6dcf084ea5545d1975
Summary:
Added experimental public API for Ribbon filter:
NewExperimentalRibbonFilterPolicy(). This experimental API will
take a "Bloom equivalent" bits per key, and configure the Ribbon
filter for the same FP rate as Bloom would have but ~30% space
savings. (Note: optimize_filters_for_memory is not yet implemented
for Ribbon filter. That can be added with no effect on schema.)
Internally, the Ribbon filter is configured using a "one_in_fp_rate"
value, which is 1 over desired FP rate. For example, use 100 for 1%
FP rate. I'm expecting this will be used in the future for configuring
Bloom-like filters, as I expect people to more commonly hold constant
the filter accuracy and change the space vs. time trade-off, rather than
hold constant the space (per key) and change the accuracy vs. time
trade-off, though we might make that available.
### Benchmarking
```
$ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 34.1341
Number of filters: 1993
Total size (MB): 238.488
Reported total allocated memory (MB): 262.875
Reported internal fragmentation: 10.2255%
Bits/key stored: 10.0029
----------------------------
Mixed inside/outside queries...
Single filter net ns/op: 18.7508
Random filter net ns/op: 258.246
Average FP rate %: 0.968672
----------------------------
Done. (For more info, run with -legend or -help.)
$ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 130.851
Number of filters: 1993
Total size (MB): 168.166
Reported total allocated memory (MB): 183.211
Reported internal fragmentation: 8.94626%
Bits/key stored: 7.05341
----------------------------
Mixed inside/outside queries...
Single filter net ns/op: 58.4523
Random filter net ns/op: 363.717
Average FP rate %: 0.952978
----------------------------
Done. (For more info, run with -legend or -help.)
```
168.166 / 238.488 = 0.705 -> 29.5% space reduction
130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction)
### Working around a hashing "flaw"
bloom_test discovered a flaw in the simple hashing applied in
StandardHasher when num_starts == 1 (num_slots == 128), showing an
excessively high FP rate. The problem is that when many entries, on the
order of number of hash bits or kCoeffBits, are associated with the same
start location, the correlation between the CoeffRow and ResultRow (for
efficiency) can lead to a solution that is "universal," or nearly so, for
entries mapping to that start location. (Normally, variance in start
location breaks the effective association between CoeffRow and
ResultRow; the same value for CoeffRow is effectively different if start
locations are different.) Without kUseSmash and with num_starts > 1 (thus
num_starts ~= num_slots), this flaw should be completely irrelevant. Even
with 10M slots, the chances of a single slot having just 16 (or more)
entries map to it--not enough to cause an FP problem, which would be local
to that slot if it happened--is 1 in millions. This spreadsheet formula
shows that: =1/(10000000*(1 - POISSON(15, 1, TRUE)))
As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is
intended for CPU efficiency of filters with many more entries/slots than
kCoeffBits, a very reasonable work-around is to disallow num_starts==1
when !kUseSmash, by making the minimum non-zero number of slots
2*kCoeffBits. This is the work-around I've applied. This also means that
the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not
space-efficient for less than a few hundred entries. Because of this, I
have made it fall back on constructing a Bloom filter, under existing
schema, when that is more space efficient for small filters. (We can
change this in the future if we want.)
TODO: better unit tests for this case in ribbon_test, and probably
update StandardHasher for kUseSmash case so that it can scale nicely to
small filters.
### Other related changes
* Add Ribbon filter to stress/crash test
* Add Ribbon filter to filter_bench as -impl=3
* Add option string support, as in "filter_policy=experimental_ribbon:5.678;"
where 5.678 is the Bloom equivalent bits per key.
* Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom
* Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on
binary searching CalculateSpace (inefficient), so that subclasses
(especially experimental ones) don't have to provide an efficient
implementation inverting CalculateSpace.
* Minor refactor FastLocalBloomBitsBuilder for new base class
XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder,
which allows the latter to fall back on Bloom construction in some
extreme cases.
* Mostly updated bloom_test for Ribbon filter, though a test like
FullBloomTest::Schema is a next TODO to ensure schema stability
(in case this becomes production-ready schema as it is).
* Add some APIs to ribbon_impl.h for configuring Ribbon filters.
Although these are reasonably covered by bloom_test, TODO more unit
tests in ribbon_test
* Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data
for constructing the linear approximations in GetNumSlotsFor95PctSuccess.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658
Test Plan:
Some unit tests updated but other testing is left TODO. This
is considered experimental but laying down schema compatibility as early
as possible in case it proves production-quality. Also tested in
stress/crash test.
Reviewed By: jay-zhuang
Differential Revision: D24899349
Pulled By: pdillinger
fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8
Summary:
https://github.com/facebook/rocksdb/issues/7556 enables `CompactionIterator` to perform garbage collection during compaction according
to a lower bound (user-defined) timestamp `full_history_ts_low_`.
This PR adds a data member `full_history_ts_low_` of type `std::string` to `FlushJob`, and
`full_history_ts_low_` does not change during flush. `FlushJob` will pass a pointer to this data member
to the `CompactionIterator` used during flush.
Also refactored flush_job_test.cc to re-use some existing code, which is actually the majority of this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7655
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D24933340
Pulled By: riversand963
fbshipit-source-id: 2e584bfd0cf6e5c295ab1af264e68e9d6a12fca3
Summary:
https://github.com/facebook/rocksdb/pull/7461 accidentally broke
`InternalStats::DumpCFStats` by making `DumpCFFileHistogram` overwrite
the output of `DumpCFStatsNoFileHistogram` instead of appending to it,
resulting in only the file histogram related information getting logged.
The patch fixes this by reverting to appending in `DumpCFFileHistogram`.
Fixes https://github.com/facebook/rocksdb/issues/7664 .
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7666
Test Plan: Ran `make check` and checked the info log of `db_bench`.
Reviewed By: riversand963
Differential Revision: D24929051
Pulled By: ltamasi
fbshipit-source-id: 636a3d5ebb5ce23de4f3fe4f03ad3f16cb2858f8
Summary:
https://github.com/facebook/rocksdb/issues/7556 enables `CompactionIterator` to perform garbage collection during compaction according
to a lower bound (user-defined) timestamp `full_history_ts_low_`.
This PR adds a data member `full_history_ts_low_` of type `std::string` to `CompactionJob`, and
`full_history_ts_low_` does not change during compaction. `CompactionJob` will pass a pointer to this
data member to the `CompactionIterator` used during compaction.
Also refactored compaction_job_test.cc to re-use some existing code, which is actually the majority of this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7657
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D24913803
Pulled By: riversand963
fbshipit-source-id: 11ad5329ddac365667152e7b3b02f84182c0ca8e
Summary:
`CompactionProxy` is currently both a concrete class used for actual `Compaction`s
and a base class that `FakeCompaction` (which is used in `compaction_iterator_test`)
is derived from. This is bad from an OO design standpoint, and also results in
`FakeCompaction` containing an (uninitialized and unused) `Compaction*` member.
The patch fixes this by making `CompactionProxy` a pure interface and introducing
a separate concrete class `RealCompaction` for non-test/non-fake compactions. It
also removes an unused parameter from the virtual method `level`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7662
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24907680
Pulled By: ltamasi
fbshipit-source-id: c100ecb1beef4b0ada35e799116c5bda71719ee7
Summary:
Previously, even when `bottommost_compression_opts`'s `enabled` flag was set, it only took effect when
`bottommost_compression` was also set to something other than `kDisableCompressionOption`.
This wasn't documented and, if we kept the old behavior, it'd make
things complicated like the migration instructions in https://github.com/facebook/rocksdb/issues/7619. We can
simplify the API by making `bottommost_compression_opts` always take
effect when its `enabled` flag is set.
Fixes https://github.com/facebook/rocksdb/issues/7631.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7633
Reviewed By: ltamasi
Differential Revision: D24710358
Pulled By: ajkr
fbshipit-source-id: bbbdf9c1b53c63a4239d902cc3f5a11da1874647
Summary:
Added a few classes in the same class hierarchy to remove code duplication and
refactor the logic of reading and processing MANIFEST files.
New classes are as follows.
```
class VersionEditHandlerBase;
class ListColumnFamiliesHandler : VersionEditHandlerBase;
class FileChecksumRetriever : VersionEditHandlerBase;
class DumpManifestHandler : VersionEditHandler;
```
Classes that already existed before this PR are as follows.
```
class VersionEditHandler : VersionEditHandlerBase;
```
With these classes, refactored functions: `VersionSet::Recover()`,
`VersionSet::ListColumnFamilies()`, `VersionSet::DumpManifest()`,
`GetFileChecksumFromManifest()`.
Test Plan (devserver):
```
make check
COMPILE_WITH_ASAN=1 make check
```
These refactored code, especially recovery-related logic, will be tested intensively by
all existing unit tests and stress tests. For example, run
```
make crash_test
```
Verified 3 successful runs on devserver.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6581
Reviewed By: ajkr
Differential Revision: D20616217
Pulled By: riversand963
fbshipit-source-id: 048c7743aa4be2623ccd0cc3e61c0027e604e78b
Summary:
Since the hashes should not be persisted in output_validator
nor mock_env.
Also updated NPHash64 to use 64-bit seed, and comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7632
Test Plan:
make check, and new build setting that enables modification
to NPHash64, to check for behavior depending on specific values. Added
that setting to one of the CircleCI configurations.
Reviewed By: jay-zhuang
Differential Revision: D24833780
Pulled By: pdillinger
fbshipit-source-id: 02a57652ccf1ac105fbca79e77875bb7bf7c071f
Summary:
There is an undocumented behavior about a certain combination of options and operations.
- inplace_update_support = true, and
- call `SeekForPrev()`, `SeekToLast()`, and/or `Prev()` on unflushed data.
We should stop the backward iteration and report an error of `Status::NotSupported`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7618
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D24769619
Pulled By: riversand963
fbshipit-source-id: 81d199fa55ed4739ab10e719cc345a992238ccbb
Summary:
https://github.com/facebook/rocksdb/issues/7556 introduced support for compaction iterator to perform timestamp-aware garbage collection.
However, there was a bug. The comparison between `ikey_.user_key` and `current_user_key_` should happen
before `key_ = current_key_.SetInternalKey(key_, &ikey_);` (line 336 of compaction_iterator.cc).
Otherwise, after this line, `current_key_` is always the same as `ikey_.user_key`.
This PR also re-arranged the order of some data members because some of them are state variables of `CompactionIterator` while others are inputs from callers.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7645
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D24845028
Pulled By: riversand963
fbshipit-source-id: c7e79914832701462b86867e8463cd463b6c0c25
Summary:
After replaying the WALs, the memtables are flushed synchronously to L0 instead of being flushed in background. Currently, we only track WAL obsoletion events in the code path of background flush jobs. This PR tracks these events in RecoverLogFiles.
After this change, we can enable `track_and_verify_wal_in_manifest` in `db_stress`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7649
Test Plan: `python tools/db_crashtest.py whitebox`
Reviewed By: riversand963
Differential Revision: D24824501
Pulled By: cheng-chang
fbshipit-source-id: 207129f7b845c50b333680ce6818a68a2fad54b9
Summary:
Consider the following sequence of events:
1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST.
2. Syncing MANIFEST failed and db crashed.
3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file.
4. Db crashed again.
5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed.
It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5.
After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621
Test Plan:
1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery.
2. a new unit test is added in db_basic_test to simulate step 3.
Reviewed By: riversand963
Differential Revision: D24668144
Pulled By: cheng-chang
fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b
Summary:
When a WAL is synced, an edit is written to MANIFEST.
After flushing memtables, the obsoleted WALs are piggybacked to MANIFEST while writing the new L0 files to MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7601
Test Plan:
`track_and_verify_wals_in_manifest` is enabled by default for all tests extending `DBBasicTest`, and in db_stress_test.
Unit test `wal_edit_test`, `version_edit_test`, and `version_set_test` are also updated.
Watch all tests to pass.
Reviewed By: ltamasi
Differential Revision: D24553957
Pulled By: cheng-chang
fbshipit-source-id: 66a569ff1bdced38e22900bd240b73113906e040
Summary:
The test often times out in internal test infra.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7640
Test Plan: watch test to pass internally
Reviewed By: anand1976
Differential Revision: D24764928
Pulled By: cheng-chang
fbshipit-source-id: 587f2afc97f52909837943fd938a86ca94544b2c
Summary:
In this test, after flushing memtable, it will read directly from the sst files, so `env_do_fsync` was `true` to ensure that the flushed sst files can be read afterwards. Considering that the test does not last long, the data should be available in os buffer even without fsync, so this PR tries to disable fsync to reduce test time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7639
Test Plan: watch the test to pass in internal infra
Reviewed By: anand1976
Differential Revision: D24764689
Pulled By: cheng-chang
fbshipit-source-id: ef827611a3eaca04201e4280ae801d6c8e60c138
Summary:
This test often times out in internal test infra.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7637
Test Plan: watch test to pass
Reviewed By: ajkr
Differential Revision: D24763939
Pulled By: cheng-chang
fbshipit-source-id: 6564ee2ef637e9faf6688d4b6a5d74a72a51c5e8
Summary:
In `BuildTable()`, we call `builder->Finish()` before evaluating `builder->NeedCompact()`.
However, we call `builder->NeedCompact()` before `builder->Finish()` in compaction job. This can be wrong because the table properties collectors may rely on the success of `Finish()` to provide correct result for `NeedCompact()`.
Test plan (on devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7627
Reviewed By: ajkr
Differential Revision: D24728741
Pulled By: riversand963
fbshipit-source-id: 5a0dce244e14eb1106c4f87021e6bebca82b486e
Summary:
Existing API `VerifyChecksum()` allows application to verify sst files' block checksums.
Since whole file, user-specified checksum is tracked in MANIFEST, we can expose a new
API to verify sst files' file checksums.
```
// Compute table file checksums if applicable and compare with MANIFEST.
// Returns OK if no file has mismatching whole-file checksum.
Status DB::VerifyFileChecksums(const ReadOptions& /*read_options*/);
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7578
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D24436783
Pulled By: riversand963
fbshipit-source-id: 52b51519b842f2b3c4e3351998a97c86cbec85b3
Summary:
Recent test report shows that some tests have been skipped.
For DBWALTest that inherits from DBTestBase, the following will always be
true, since `env_` is an instance of `SpecialEnv`, not `Env::Default()`. Thus the test
will always be skipped.
```
if (options.env != Env::Default()) {
ROCKSDB_GTEST_SKIP("Test requires default environment");
return;
}
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7628
Test Plan:
./db_wal_test --gtest_filter=DBWALTest.TruncateLastLogAfterRecoverWithoutFlush
MEM_ENV=1 ./db_wal_test --gtest_filter=DBWALTest.TruncateLastLogAfterRecoverWithoutFlush
make check
Reviewed By: jay-zhuang
Differential Revision: D24693006
Pulled By: riversand963
fbshipit-source-id: 7f2a772492a0f11bff17bbf5e9f493e9e9a1c125
Summary:
The filter query key should not contain timestamp. The timestamp is
stripped for Get(), but not MultiGet().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7589
Reviewed By: riversand963
Differential Revision: D24494661
Pulled By: jay-zhuang
fbshipit-source-id: fc5ff40f9d683a89a760c6ff0ab3aed05a70c317
Summary:
Test report shows that this test has been skipped recently due to
a condition that will never meet. `env_` is not equal to
`Env::Default()` for DBTest2 that inherits from DBTestBase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7629
Test Plan:
make check
./db_test2 --gtest_filter=DBTest2.PinnableSliceAndMmapReads
Reviewed By: jay-zhuang
Differential Revision: D24693317
Pulled By: riversand963
fbshipit-source-id: b1bbd5c1e05a6fa57c1de0d74462b69e3c2d5215
Summary:
In dictionary compression's initial implementation, in order to save CPU overhead, we only enabled it
for bottom level under the assumption that the vast majority of data is
stored there. At that time, there was no
such thing as `ColumnFamilyOptions::bottommost_compression_opts`, so we just
hardcoded disabling dictionary compression in flush and compactions to
non-bottommost level. Now, we have users who generate all their files
through flush and are considering using dictionary compression.
To support such a use case, this PR expands the scope of `ColumnFamilyOptions::compression_opts` to
additionally include flushed files and files generated by compaction to
a non-bottommost level. Users can still get the old behavior by moving
their dictionary settings to `ColumnFamilyOptions::bottommost_compression_opts`
and explicitly enabling both that and `ColumnFamilyOptions::bottommost_compression`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7619
Reviewed By: ltamasi
Differential Revision: D24665610
Pulled By: ajkr
fbshipit-source-id: 656b90bce1033fe21c71e09af931ef5bde3e464c
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D24579392
Pulled By: riversand963
fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
Summary:
CreateFileChecksumGenerator may uses requested_checksum_func_name in generator context to decide which generator will be used. GenerateOneFileChecksum has not being updated to use it, which will always get the generator when the name is empty. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7586
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D24491989
Pulled By: zhichao-cao
fbshipit-source-id: d9fdfdd431240f0a9a2e781ddbd48a7d6c609aad
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497
When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`
Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.
Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test
Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515
Reviewed By: ajkr
Differential Revision: D24240264
Pulled By: ramvadiv
fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
Summary:
Currently, the following interleaving of events can lead to SuperVersion containing both immutable memtables as well as the resulting L0. This can cause Get to return incorrect result if there are merge operands. This may also affect other operations such as single deletes.
```
time main_thr bg_flush_thr bg_compact_thr compact_thr set_opts_thr
0 | WriteManifest:0
1 | issue compact
2 | wait
3 | Merge(counter)
4 | issue flush
5 | wait
6 | WriteManifest:1
7 | wake up
8 | write manifest
9 | wake up
10 | Get(counter)
11 | remove imm
V
```
The reason behind is that: one bg flush thread's installing new `Version` can be batched and performed by another thread that is the "leader" MANIFEST writer. This bg thread removes the memtables from current super version only after `LogAndApply` returns. After the leader MANIFEST writer signals (releasing mutex) this bg flush thread, it is possible that another thread sees this cf with both memtables (whose data have been flushed to the newest L0) and the L0 before this bg flush thread removes the memtables.
To address this issue, each bg flush thread can pass a callback function to `LogAndApply`. The callback is responsible for removing the memtables. Therefore, the leader MANIFEST writer can call this callback and remove the memtables before releasing the mutex.
Test plan (devserver)
```
$make merge_test
$./merge_test --gtest_filter=MergeTest.MergeWithCompactionAndFlush
$make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6069
Reviewed By: cheng-chang
Differential Revision: D18790894
Pulled By: riversand963
fbshipit-source-id: e41bd600c0448b4f4b2deb3f7677f95e3076b4ed
Summary:
Similarly to how https://github.com/facebook/rocksdb/issues/7345
integrated blob file writing into the flush process,
the patch adds support for writing blob files to the compaction logic.
Namely, if `enable_blob_files` is set, large values encountered during
compaction are extracted to blob files and replaced with blob indexes.
The resulting blob files are then logged to the MANIFEST as part of the
compaction job's `VersionEdit` and added to the `Version` alongside any
table files written by the compaction. Any errors during blob file building fail
the compaction job.
There will be a separate follow-up patch to perform blob garbage collection
during compactions.
In addition, the patch continues to chip away at the mess around computing
various compaction related statistics by eliminating some code duplication
and by making the `num_output_files` and `bytes_written` stats more consistent
for flushes, compactions, and recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7573
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24404696
Pulled By: ltamasi
fbshipit-source-id: 21216af3a172ad3ce8f85d11cd30923784ae426c
Summary:
Add a threshold timestamp, full_history_ts_low_ of type `std::string*` to
`CompactionIterator`, so that RocksDB can also perform garbage collection during
compaction.
* If full_history_ts_low_ is nullptr, then compaction iterator does not perform
GC, preserving all timestamp history for all keys. Compaction iterator will
treat user key with different timestamps as different user keys.
* If full_history_ts_low_ is not nullptr, then compaction iterator performs
GC. GC will look at keys older than `*full_history_ts_low_` and determine their
eligibility based on factors including snapshots.
Current rules of GC:
* If an internal key is in the same snapshot as a previous counterpart
with the same user key, and this key is eligible for GC, and the key is
not single-delete or merge operand, then this key can be dropped. Note
that the previous internal key cannot be a merge operand either.
* If a tombstone is the most recent one in the earliest snapshot and it
is eligible for GC, and keyNotExistsBeyondLevel() is true, then this
tombstone can be dropped.
* If a tombstone is the most recent one in a snapshot and it is eligible
for GC, and the compaction is at bottommost level, then all other older
internal keys of the same user key must also be eligible for GC, thus
can be dropped
* Single-delete, delete-range and merge are not currently supported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7556
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D24507728
Pulled By: riversand963
fbshipit-source-id: 3c09c7301f41eed76dfcf4d1527e68cf6e0a8bb3
Summary:
This PR makes it able to `LogAndApply` `VersionEdit`s related to WALs, and also be able to `Recover` from MANIFEST with WAL related `VersionEdit`s.
The `VersionEdit`s related to WAL are treated similarly as those related to column family operations, they are not applied to versions, but can be in a commit group. Mixing WAL related `VersionEdit`s with other types of edits will make logic in `ProcessManifestWrite` more complicated, so `VersionEdit`s related to WAL can either be WAL additions or deletions, like column family add and drop.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7256
Test Plan: a set of unit tests are added in `version_set_test.cc`
Reviewed By: riversand963
Differential Revision: D23123238
Pulled By: cheng-chang
fbshipit-source-id: 246be2ed4744fd03fa2738aba408aaa611d0379c
Summary:
As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in https://github.com/facebook/rocksdb/issues/7523
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7580
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D24485420
Pulled By: zhichao-cao
fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799
Summary:
Remove function calling in assert statement as assert is a no
op in opt build and that function might not be called. This causes hang
in closing RocksDB when refit level is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7581
Test Plan: make check -j64
Reviewed By: riversand963
Differential Revision: D24466420
Pulled By: akankshamahajan15
fbshipit-source-id: 97db4ec5a95ae693c3290e176a3c12a9b1ad2f6d
Summary:
This PR adds support for writing a location identifier of the DB host to SST files as a table property. By default, the hostname is used, but can be overridden by the user. There have been some recent corruptions in files written by ```SstFileWriter``` before checksumming, so this property can be used to trace it back to the writing host and checking the host for hardware isues.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7479
Test Plan: Add new unit tests
Reviewed By: pdillinger
Differential Revision: D24340671
Pulled By: anand1976
fbshipit-source-id: 2038949fd8d160c0633ccb4f9da77740f19fa2a2
Summary:
The patch adds blob file support to the `Get` API by extending `Version` so that
whenever a blob reference is read from a file, the blob is retrieved from the corresponding
blob file and passed back to the caller. (This is assuming the blob reference is valid
and the blob file is actually part of the given `Version`.) It also introduces a cache
of `BlobFileReader`s called `BlobFileCache` that enables sharing `BlobFileReader`s
between callers. `BlobFileCache` uses the same backing cache as `TableCache`, so
`max_open_files` (if specified) limits the total number of open (table + blob) files.
TODO: proactively open/cache blob files and pin the cache handles of the readers in the
metadata objects similarly to what `VersionBuilder::LoadTableHandlers` does for
table files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7540
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24260219
Pulled By: ltamasi
fbshipit-source-id: a8a2a4f11d3d04d6082201b52184bc4d7b0857ba
Summary:
Make LoadLatestOptions return PathNotFound if the options file does not exist. Added tests for the LoadOptions related methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7554
Reviewed By: akankshamahajan15
Differential Revision: D24298985
Pulled By: zhichao-cao
fbshipit-source-id: c9ae3cb12fc4a5bbef07743e1c1300f98a2441b3
Summary:
Added unit tests that have paranoid_check = true that perform range deletions. At the moment, the deleted ranges do not appear to be checked as part of the paranoid checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7521
Reviewed By: zhichao-cao
Differential Revision: D24262175
Pulled By: ajkr
fbshipit-source-id: 1035e968f7ab8ccaa7af086b835a4e72c7e56743
Summary:
The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520
Test Plan:
- new unit test
- added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0`
Reviewed By: pdillinger
Differential Revision: D24200034
Pulled By: ajkr
fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e
Summary:
This option determines whether WALs will be tracked in MANIFEST and verified on recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7275
Test Plan:
db_options_test
options_test
Reviewed By: pdillinger
Differential Revision: D23181418
Pulled By: cheng-chang
fbshipit-source-id: 5dd1cdc166f3dfc1c93c094df4a2f7734e3b4547
Summary:
The `std::pair(const T1& x, const T2& y);` constructor is `constexpr`
only starting from C++14; relying on this breaks compilation on certain
compilers/platforms we need to support.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7519
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D24195747
Pulled By: ltamasi
fbshipit-source-id: 665e8fbc9747675bb49c5d895aad3dcf2714750f
Summary:
The patch does some cleanup in and around the legacy `BlobLogReader` class:
* It renames the class to `BlobLogSequentialReader` to emphasize that it is for
sequentially iterating through blobs in a blob file, as opposed to doing random
point reads using `BlobIndex`es (which is `BlobFileReader`'s jurisdiction).
* It removes some dead code from the old BlobDB implementation that references
`BlobLogReader` (namely the method `BlobFile::OpenRandomAccessReader`).
* It cleans up some `#include`s and forward declarations.
* It fixes some incorrect/outdated comments related to the reader class.
* It adds a few assertions to the `Read` methods of the class.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7517
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24172611
Pulled By: ltamasi
fbshipit-source-id: 43e2ae1eba5c3dd30c1070cb00f217edc45bd64f
Summary:
The patch adds a class called `BlobFileReader` that can be used to retrieve blobs
using the information available in blob references (e.g. blob file number, offset, and
size). This will come in handy when implementing blob support for `Get`, `MultiGet`,
and iterators, and also for compaction/garbage collection.
When a `BlobFileReader` object is created (using the factory method `Create`),
it first checks whether the specified file is potentially valid by comparing the file
size against the combined size of the blob file header and footer (files smaller than
the threshold are considered malformed). Then, it opens the file, and reads and verifies
the header and footer. The verification involves magic number/CRC checks
as well as checking for unexpected header/footer fields, e.g. incorrect column family ID
or TTL blob files.
Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression
type passed by the caller (because of the presence of the header and footer, the
specified offset cannot be too close to the start/end of the file; also, the compression type
has to match the one in the blob file header), and retrieves and potentially verifies and
uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set,
`BlobFileReader` reads the blob record header as well (as opposed to just the blob itself)
and verifies the key/value size, the key itself, as well as the CRC of the blob record header
and the key/value pair.
In addition, the patch exposes the compression type from `BlobIndex` (both using an
accessor and via `DebugString`), and adds a blob file read latency histogram to
`InternalStats` that can be used with `BlobFileReader`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23999219
Pulled By: ltamasi
fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
Summary:
Add following stats for MultiGet in Histogram to get more insight on MultiGet.
1. Number of index and filter blocks read from file as part of MultiGet
request per level.
2. Number of data blocks read from file per level.
3. Number of SST files loaded from file system per level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7366
Reviewed By: anand1976
Differential Revision: D24127040
Pulled By: akankshamahajan15
fbshipit-source-id: e63a003056b833729b277edc0639c08fb432756b
Summary:
If `BottommostLevelCompaction.kForce*` is set, compaction should avoid
trivial move and always compact the sst to the target size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7368
Reviewed By: ajkr
Differential Revision: D23629525
Pulled By: jay-zhuang
fbshipit-source-id: 79f23c79ecb31587e0593b28cce43131107bbcd0
Summary:
In opt mode, assertions are just no-ops. Therefore, we need to report errors instead of just doing an `assert(false)`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7483
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D24142725
Pulled By: riversand963
fbshipit-source-id: 5629556dbe29f00dd09e30a7d5df5e6cf09ee435
Summary:
`BeginWriteStall()` removes no_slowdown write from the write
list and updates `link_newer`, which makes `CreateMissingNewerLinks()`
thought all write list has valid `link_newer` and failed to create link
for all writers.
It caused flaky test and SegFault for release build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7508
Test Plan: Add unittest to reproduce the issue.
Reviewed By: anand1976
Differential Revision: D24126601
Pulled By: jay-zhuang
fbshipit-source-id: f8ac5dba653f7ee1b0950296427d4f5f8ee34a06
Summary:
Fix prefix_test so that it passes when ASSERT_STATUS_CHECKED=1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7495
Test Plan: Run the test with the option
Reviewed By: anand1976
Differential Revision: D24069715
fbshipit-source-id: 54f74b58575a1b49dbdee9ea2d24751fa956b620
Summary:
This PR schedules a background thread (shared across all DB instances)
to flush info log every ten seconds. This improves debuggability in case
of RocksDB hanging since it ensures the log messages leading up to the hang
will eventually become visible in the log.
The bulk of this PR is moving monitoring/stats_dump_scheduler* to db/periodic_work_scheduler*
and making the corresponding name changes since now the scheduler handles info
log flushing, not just stats dumping.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7488
Reviewed By: riversand963
Differential Revision: D24065165
Pulled By: ajkr
fbshipit-source-id: 339c47a0ff43b79fdbd055fbd9fefbb6f9d8d3b5
Summary:
Add all status handling in db_properties_test so that it can pass ASSERT_STATUS_CHECKED.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7490
Test Plan: Run the test with ASSERT_STATUS_CHECKED
Reviewed By: jay-zhuang
Differential Revision: D24065382
fbshipit-source-id: e008916155196891478c964df0226545308ca71d
Summary:
This exposes to the listener interface whether a compaction was
full or not. Also cleaned up API comment for CompactionJobInfo::stats,
which is not of a nullable type. And since CompactionJob is always
created with non-null CompactionJobStats, removed conditionals on it
being nullptr and instead assert non-null.
TODO later: update C and Java interfaces
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7451
Test Plan: updated existing unit tests to check new field, make check
Reviewed By: ltamasi
Differential Revision: D23977796
Pulled By: pdillinger
fbshipit-source-id: 1ae7e26cb949631c2b2fb9e696710daf53cc378d
Summary:
`DBTest.FileCreationRandomFailure` frequently times out during our
continuous test runs. (It's a case of "stress test posing as unit test.")
The patch reduces the number of iterations to avoid this. Note that
the lower numbers are still sufficient to trigger both flushes and
compactions, so test coverage is still the same.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7481
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24034712
Pulled By: ltamasi
fbshipit-source-id: 8731a9446e5a121a1041b00f0df473b9f714935a
Summary:
Introduce an new option options.check_flush_compaction_key_order, by default set to true, which checks key order of flush and compaction, and fail the operation if the order is violated.
Also did minor refactor hash checking code, which consolidates the hashing logic to a vlidation class, where the key ordering logic is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7467
Test Plan: Add unit tests to validate the check can catch reordering in flush and compaction, and can be properly disabled.
Reviewed By: riversand963
Differential Revision: D24010683
fbshipit-source-id: 8dd6292d2cda8006054e9ded7cfa4bf405f0527c
Summary:
This PR addresses some build and functional issues on MSVC targets, as a step towards an eventual goal of having RocksDB build successfully for Windows on ARM64.
Addressed issues include:
- BitsSetToOne and CountTrailingZeroBits do not compile on non-x64 MSVC targets. A fallback implementation of BitsSetToOne when Intel intrinsics are not available is added, based on the C++20 `<bit>` popcount implementation in Microsoft's STL.
- The implementation of FloorLog2 for MSVC targets (including x64) gives incorrect results. The unit test easily detects this, but CircleCI is currently configured to only run a specific set of tests for Windows CMake builds, so this seems to have been unnoticed.
- AsmVolatilePause does not use YieldProcessor on Windows ARM64 targets, even though it is available.
- When CondVar::TimedWait calls Microsoft STL's condition_variable::wait_for, it can potentially trigger a bug (just recently fixed in the upcoming VS 16.8's STL) that deadlocks various tests that wait for a timer to execute, since `Timer::Run` doesn't get a chance to execute before being blocked by the test function acquiring the mutex.
- In c_test, `GetTempDir` assumes a POSIX-style temp path.
- `NormalizePath` did not eliminate consecutive POSIX-style path separators on Windows, resulting in test failures in e.g., wal_manager_test.
- Various other test failures.
In a followup PR I hope to modify CircleCI's config.yml to invoke all RocksDB unit tests in Windows CMake builds with CTest, instead of the current use of `run_ci_db_test.ps1` which requires individual tests to be specified and is missing many of the existing tests.
Notes from peterd: FloorLog2 is not yet used in production code (it's for something in progress). I also added a few more inexpensive platform-dependent tests to Windows CircleCI runs. And included facebook/folly#1461 as requested
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7439
Reviewed By: jay-zhuang
Differential Revision: D24021563
Pulled By: pdillinger
fbshipit-source-id: 0ec2027c0d6a494d8a0fe38d9667fc2f7e29f7e7
Summary:
Do not assert the number of files after intra-L0 compaction is eligible to run since it could complete (and reduce the number of files) before the assertion executes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7477
Reviewed By: pdillinger
Differential Revision: D24032049
Pulled By: ajkr
fbshipit-source-id: e838ac7a24651ebd643b9e5a9d39d2e789c46929
Summary:
This has been running in production on some key workloads, so
we believe it to be safe and extremely low cost. Nevertheless, I've
added code to ensure that "force_consistency_checks" is mentioned in
any corruption reports so that people know how to disable in case of
false positive corruption reports.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7446
Test Plan:
make check, CI, temporary debug print new message with
./version_builder_test
Reviewed By: ajkr
Differential Revision: D23972101
Pulled By: pdillinger
fbshipit-source-id: 9623e400f3752577c0ecf977e6d0915562cf9968
Summary:
Add a new Option "allow_data_in_errors". When it's set by users, it allows them to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data.
By default value is set false to prevent users data to be exposed in the messages.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7420
Test Plan:
1. make check -j64
2. Add a new test case
Reviewed By: ajkr
Differential Revision: D23835028
Pulled By: akankshamahajan15
fbshipit-source-id: 8d2eba8fb898e79fcf1fccc07295065a75eb59b1
Summary:
The assertion checks that there is no overlap in sequence numbers across levels in universal compaction. However, this assumption doesn't hold when there is a delete triggered compaction or a trivial move, as they operate on a subset of a level.
Tests -
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7421
Reviewed By: ajkr
Differential Revision: D23872672
Pulled By: anand1976
fbshipit-source-id: c386deab8e01a5746ca996ff1f4ebcae3b15b7d2
Summary:
Add db_basic_test status check list. Some of the warnings are suppressed. It is possible that some of them are due to real bugs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7452
Test Plan: See CI tests pass.
Reviewed By: zhichao-cao
Differential Revision: D23979764
fbshipit-source-id: 6151570c2a9b931b0fbb3fe939a94b2bd1583cbe
Summary:
A generic algorithm in progress depends on a templatized
version of fastrange, so this change generalizes it and renames
it to fit our style guidelines, FastRange32, FastRange64, and now
FastRangeGeneric.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7436
Test Plan: added a few more test cases
Reviewed By: jay-zhuang
Differential Revision: D23958153
Pulled By: pdillinger
fbshipit-source-id: 8c3b76101653417804997e5f076623a25586f3e8
Summary:
Possible fix for a TSAN issue reported in EnableFileDeletions.
disable_delete_obsolete_files_ should only be accessed holding the db
mutex, but for logging it was being accessed outside holding the mutex,
now fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7435
Test Plan: existing tests, watch for recurrence
Reviewed By: ltamasi
Differential Revision: D23917578
Pulled By: pdillinger
fbshipit-source-id: 8573025bca3f6fe169b24b87bbfc4ce9667b0482
Summary:
Add a method `CheckWals` in `WalSet` to check the logs on disk. See `CheckWals`'s comments.
This method will be used to check consistency of WALs during DB recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7236
Test Plan: a set of tests are added to wal_edit_test.cc.
Reviewed By: riversand963
Differential Revision: D23036505
Pulled By: cheng-chang
fbshipit-source-id: 5b1d6857ac173429b00f950c32c4a5b8d063a732
Summary:
Fix few test cases and add them in ASSERT_STATUS_CHECKED build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7427
Test Plan:
1. ASSERT_STATUS_CHECKED=1 make -j48 check,
2. travis build for ASSERT_STATUS_CHECKED,
3. Without ASSERT_STATUS_CHECKED: make check -j64, CircleCI build and travis build
Reviewed By: pdillinger
Differential Revision: D23909983
Pulled By: akankshamahajan15
fbshipit-source-id: 42d7e4aea972acb9fcddb7ca73fcb82f93272434
Summary:
Implement a parsing tool io_tracer_parser that takes IO trace file (binary file) with command line argument --io_trace_file and output file with --output_file and dumps the IO trace records in outputfile in human readable form.
Also added unit test cases that generates IO trace records and calls io_tracer_parse to parse those records.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7333
Test Plan:
make check -j64,
Add unit test cases.
Reviewed By: anand1976
Differential Revision: D23772360
Pulled By: akankshamahajan15
fbshipit-source-id: 9c20519c189362e6663352d08863326f3e496271
Summary:
This option is apparently used by some teams within Facebook
(internal ref T75998621)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7431
Test Plan: USE_CLANG=1 make check before (fails) and after
Reviewed By: jay-zhuang
Differential Revision: D23876584
Pulled By: pdillinger
fbshipit-source-id: abb8b67a1f1aac75327944d266e284b2b6727191
Summary:
There are some tricky behaviors related to WAL sync:
- When creating a WAL, the WAL might not be synced, if the WAL directory is not synced, the WAL file's metadata may not even be synced to disk, so during recovery, when listing the WAL directory, the WAL may not even show up.
- During each DB::Write, the WriteOption can control whether the WAL should be synced, so a WAL previously not synced on creation can be synced during Write.
For each `SyncWAL`, we'll track the synced status and the current WAL size. Previously, we only track the WAL size on closing.
During recovery, we check that the on-disk WAL size is >= the last synced size.
So this PR introduces `synced_size` and `closed` to `WalMetadata` for the above design update.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7414
Test Plan:
- updated wal_edit_test
- updated version_edit_test
Reviewed By: riversand963
Differential Revision: D23796127
Pulled By: cheng-chang
fbshipit-source-id: 5498ab80f537c48a10157e71a4745716aef5cf30
Summary:
Current implementation holds db mutex while calling
`GetAggregatedIntProperty()`. For property kEstimateTableReadersMem,
this can be expensive, especially if the number of table readers is
high.
We can release and re-acquire db mutex if
property_info.need_out_of_mutex is true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7412
Test Plan:
make check
COMPILE_WITH_ASAN=1 make check
COMPILE_WITH_TSAN=1 make check
Also test internally on a shadow host. Used bpf to verify the
excessively long db mutex holding no longer exists when applications
call GetApproximateMemoryUsageByType().
Reviewed By: jay-zhuang
Differential Revision: D23794824
Pulled By: riversand963
fbshipit-source-id: 6bc02a59fd25613d343a62cf817467c7122c9721
Summary:
Make the test robust to spurious wakeups on condition variable,
and clear sync points to ensure no use-after-free.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7418
Test Plan: repeated runs on updated test, watch CircleCI for recurrence
Reviewed By: jay-zhuang
Differential Revision: D23828823
Pulled By: pdillinger
fbshipit-source-id: af85117d9c02602541a90252840e0e5a6996de5b
Summary:
Fix the flaky test failure in error_handler_fs_test. Add the sync point, solve the dependency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7415
Test Plan: make asan_check, ~/gtest-parallel/gtest-parallel -r 100 ./error_handler_fs_test
Reviewed By: siying
Differential Revision: D23804330
Pulled By: zhichao-cao
fbshipit-source-id: 5175108651f7652e47e15978f2a9c1669ef59d80
Summary:
In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310
Test Plan: adding new unit test, pass make check.
Reviewed By: anand1976
Differential Revision: D23710892
Pulled By: zhichao-cao
fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9
Summary:
Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '\_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.
This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.
We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).
We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.
Two final auxiliary notes:
Recognizing that the backup file names have become a de facto part of
the backup meta schema, this change makes them easier to parse and
extend by putting a distinct marker, 's', before DB session ids embedded
in the name. When we extend this to allow custom checksums in the name,
they can get their own marker to ensure safe parsing. For backward
compatibility, file size does not get a marker but is assumed for
`_[0-9]+[.]`
Another change from initial 6.12 default behavior is never including
file custom checksum in the file name. Looking ahead to 6.13, we do not
want the default behavior to cause backup space amplification for
someone turning on file custom checksum checking in BackupEngine; we
want that to be an easy decision. When implemented, including file
custom checksums in backup file names will be a non-default option.
Actual file name patterns and priorities, as regexes:
kLegacyCrc32cAndFileSize OR pre-6.12 SST file ->
[0-9]+_[0-9]+_[0-9]+[.]sst
kFlagMatchInterimNaming set (default) AND early 6.12 SST file ->
[0-9]+_[0-9a-fA-F-]+[.]sst
kUseDbSessionId AND NOT kFlagIncludeFileSize ->
[0-9]+_s[0-9A-Z]{20}[.]sst
kUseDbSessionId AND kFlagIncludeFileSize (default) ->
[0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst
We might add opt-in options for more '\_' separated data in the name,
but embedded file size, if present, will always be after last '\_' and
before '.sst'.
This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400
Test Plan:
unit tests included. Sync point callbacks are used to mimic
previous version SST files.
Reviewed By: ajkr
Differential Revision: D23759587
Pulled By: pdillinger
fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025
Summary:
`IntraL0CompactionAfterFlushCheckConsistencyFail` was flaky by sometimes failing due to no intra-L0 compactions happening. I was able to repro it by putting a `sleep(1)` in the compaction thread before it grabs the lock and picks a compaction. This also showed other intra-L0 tests are affected too, although some of them exhibit hanging forever rather than failing.
The problem was that all the flushes/ingestions could finish before any compaction got picked, so it would end up simply picking all the files that the test generates for L0->L1. But, these tests intend only the first few files to be picked for L0->L1, and the subsequent files to be picked for intra-L0. This PR adjusts the sync points of all the intra-L0 tests to enforce this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7382
Test Plan: run all the `db_compaction_test`s with and without the artificial `sleep()`
Reviewed By: jay-zhuang
Differential Revision: D23684985
Pulled By: ajkr
fbshipit-source-id: 6508399030dddec7738e9853a7b3dc53ef77a584
Summary:
The patch adds support for extracting large values into blob files when
performing a flush during recovery (when `avoid_flush_during_recovery` is
`false`). Blob files are built and added to the `Version` similarly to flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7388
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23709912
Pulled By: ltamasi
fbshipit-source-id: ce48b4227849cf25429ae98574e72b0e1cb9c67d
Summary:
Cleaned up the public API to use the EncryptedEnv. This change will allow providers to be developed and added to the system easier in the future. It will also allow better integration in the future with the OPTIONS file.
- The internal classes were moved out of the public API into an internal "env_encryption_ctr.h" header. Short-cut constructors were added to provide the original API functionality.
- The APIs to the constructors were changed to take shared_ptr, rather than raw pointers or references to allow better memory management and alternative implementations.
- CreateFromString methods were added to allow future expansion to other provider and cipher implementations through a standard API.
Additionally, there was a code duplication in the NewXXXFile methods. This common code was moved under a templatized function.
A first-pass at structuring the code was made to potentially allow multiple EncryptionProviders in a single EncryptedEnv. The idea was that different providers may use different cipher keys or different versions/algorithms. The EncryptedEnv should have some means of picking different providers based on information. The groundwork was started for this (the use of the provider_ member variable was localized) but the work has not been completed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7279
Reviewed By: jay-zhuang
Differential Revision: D23709440
Pulled By: zhichao-cao
fbshipit-source-id: 0e845fff0e03a52603eb9672b4ade32d063ff2f2
Summary:
The patch adds support for writing blob files during flush by integrating
`BlobFileBuilder` with the flush logic, most importantly, `BuildTable` and
`CompactionIterator`. If `enable_blob_files` is set, large values are extracted
to blob files and replaced with references. The resulting blob files are then
logged to the MANIFEST as part of the flush job's `VersionEdit` and
added to the `Version`, similarly to table files. Errors related to writing
blob files fail the flush, and any blob files written by such jobs are immediately
deleted (again, similarly to how SST files are handled). In addition, the patch
extends the logging and statistics around flushes to account for the presence
of blob files (e.g. `InternalStats::CompactionStats::bytes_written`, which is
used for calculating write amplification, now considers the blob files as well).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7345
Test Plan: Tested using `make check` and `db_bench`.
Reviewed By: riversand963
Differential Revision: D23506369
Pulled By: ltamasi
fbshipit-source-id: 646885f22dfbe063f650d38a1fedc132f499a159
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753
Reviewed By: ajkr
Differential Revision: D23385030
Pulled By: zhichao-cao
fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
Summary:
In a distributed file system, directory ownership is enforced by fencing
off the previous owner once they've been preempted by a new owner. This
PR adds a IOStatus subcode for ```StatusCode::IOError``` to indicate this.
Once this error is returned for a file write, the DB is put in read-only
mode and not allowed to resume in read-write mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7374
Test Plan: Add new unit tests in ```error_handler_fs_test```
Reviewed By: riversand963
Differential Revision: D23687777
Pulled By: anand1976
fbshipit-source-id: bef948642089dc0af399057864d9a8ca339e8b2f
Summary:
During bottommost compaction, RocksDB cannot simply drop a tombstone if
this tombstone is not in the earliest snapshot. The current behavior is: RocksDB
skips other internal keys (of the same user key) in the same snapshot range. In
the meantime, RocksDB should check for the `shutting_down` flag. Otherwise, it
is possible for a bottommost compaction that has already started running to take
a long time to finish, even if the application has tried to cancel all background jobs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7356
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D23663241
Pulled By: riversand963
fbshipit-source-id: 25f8e9b51bc3bfa3353cdf87557800f9d90ee0b5
Summary:
We've seen some segfaults in db_write_test, with at least one
suggesting corruption of a write group linked list. Adding an assertion
to have this fail in a more specific way if that is the broken
invariant.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7375
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D23638477
Pulled By: pdillinger
fbshipit-source-id: a76fd677cad60a3a516bd363947bfd9ce418edc1
Summary:
https://github.com/facebook/rocksdb/issues/3341 guaranteed that upon return of `GetSortedWalFiles` after
`DisableFileDeletions`, all pending purges of previously obsolete WAL
files will have finished. However, the addition of
avoid_unnecessary_blocking_io in https://github.com/facebook/rocksdb/issues/5043 opened a hole in the code making
that assurance, which can lead to files to be copied for checkpoint or
backup going missing before being copied, with that option enabled.
This change patches the hole.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7369
Test Plan:
apparent fix to backups in crash test observed. Will work
on a unit test for another commit
Reviewed By: ajkr
Differential Revision: D23620258
Pulled By: pdillinger
fbshipit-source-id: bea36b461a5b719c3e3ef802f967bc3e8ae71614
Summary:
When multiple background jobs are generating blob files in parallel, it is actually
possible for a blob file to be added with a file number that is lower than the
highest one in the base version. (This is a harmless race condition.) The patch
fixes the handling of this case and adds a unit test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7349
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23542453
Pulled By: ltamasi
fbshipit-source-id: 4ff6f3654bc58c391d10b9870e1cc40b5e3fa8e4
Summary:
Replace FSRandomRWFile pointer with FSRandomRWFilePtr object in the rocksdb internal code.
This new object wraps FSRandomRWFile pointer.
Objective: If tracing is enabled, FSRandomRWFile object returns FSRandomRWFileTracingWrapper pointer that includes all necessary information in IORecord and calls underlying FileSystem and invokes IOTracer to dump that record in a binary file. If tracing is disabled then, underlying FileSystem pointer is returned directly.
FSRandomRWFilePtr wrapper class is added to bypass the FSRandomRWFileWrapper when
tracing is disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7198
Test Plan: make check -j64
Reviewed By: anand1976
Differential Revision: D23421116
Pulled By: akankshamahajan15
fbshipit-source-id: 8a5ba0e7d9c1ba34c3a6f29829b107c5f09ab6a3
Summary:
Replace FSWritableFile pointer with FSWritableFilePtr
object in WritableFileWriter.
This new object wraps FSWritableFile pointer.
Objective: If tracing is enabled, FSWritableFile Ptr returns
FSWritableFileTracingWrapper pointer that includes all necessary
information in IORecord and calls underlying FileSystem and invokes
IOTracer to dump that record in a binary file. If tracing is disabled
then, underlying FileSystem pointer is returned directly.
FSWritableFilePtr wrapper class is added to bypass the
FSWritableFileWrapper when
tracing is disabled.
Test Plan: make check -j64
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7193
Reviewed By: anand1976
Differential Revision: D23355915
Pulled By: akankshamahajan15
fbshipit-source-id: e62a27a13c1fd77e36a6dbafc7006d969bed25cf
Summary:
The patch cleans up a few things in `CompactionJob::SubcompactionState`:
* Instead of using both the member initializer list and in-class initializers (and
sometimes both at the same time for the same member), the struct now uniformly
uses the latter to initialize integer members.
* The default parameter value for the constructor parameter `size` is removed.
* The explicitly deleted copy operations are removed, since they are implicitly deleted
anyways because of the `unique_ptr` members.
* The handwritten move operations, which did not move the member `c_iter` and
were not declared `nothrow`, are removed. Note that with the user-declared copy
operations gone (see the previous item), we can rely on the compiler to (correctly)
generate these methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7322
Test Plan: `make check`
Reviewed By: siying
Differential Revision: D23382408
Pulled By: ltamasi
fbshipit-source-id: a4ae5af150161c50ff7bdc07fa145482d0150bfe
Summary:
Currently, application may pass a statistics object to db but later
wants to reduce stats tracking overhead by setting stats level to
kExceptHistogramOrTimers (the current lowest level). Tickers will still
be incremented, causing up to 1% CPU. We can add a new lowest stats
level `kExceptTickers` to disable ticker incrementing as well, thus
reducing CPU cycles spent on tickers.
Test Plan (devserver):
```
make check
make clean
DEBUG_LEVEL=0 make db_bench
./db_bench -perf_level=1 -stats_level=0 -statistics -benchmarks=fillseq,readrandom -duration=120
```
Measure CPU util (%) before and after change:
CPU util by rocksdb::RecordTick: 1.1 vs (<0.1)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7329
Reviewed By: pdillinger
Differential Revision: D23434014
Pulled By: riversand963
fbshipit-source-id: 72ff0f02a192ac476d4b0044b9f37fd4a22ff0d4
Summary:
This PR creates `rocksdb_open_column_families_with_ttl` which allows C API users to open a DBWithTLL with column families.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7314
Reviewed By: cheng-chang
Differential Revision: D23430287
Pulled By: ajkr
fbshipit-source-id: 307aa21d170d1402653263a91f6f832ef76afba0
Summary:
- Closes https://github.com/facebook/rocksdb/issues/6490
- Currently MERGEs are converted to PUTs at bottom or compaction has reached the beginning of the key, this can wrongly cover a PUT future base case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7166
Test Plan:
- Automated: `make all check`
- Manual: With `allow_ingest_behind = true`, add Merge operations to a key then run compaction. Then run ingesting external files to make sure the base case is probably compacted with existing Merges.
Reviewed By: cheng-chang
Differential Revision: D23325425
Pulled By: ajkr
fbshipit-source-id: 3eb415eb7b381b5453e45245393566153b1abb68
Summary:
L0 score is based on size target and number of files. The size target
used is `max_bytes_for_level_base`. However, the base level's size can
dynamically expand in write burst mode. In fact, it can expand so much
that L0->Lbase becomes the highest fanout in target sizes. This doesn't
make sense from an efficiency perspective, so this PR bounds the
L0->Lbase fanout to the smoothed level multiplier. The L0 scoring based
on file count remains unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7325
Test Plan:
contrived benchmark that exhibits the problem:
```
$ TEST_TMPDIR=/data/users/andrewkr/ ./db_bench -benchmarks=filluniquerandom,readrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -level0_file_num_compaction_trigger=4 -level_compaction_dynamic_level_bytes=true -compression_type=none -max_background_jobs=12 -rate_limiter_bytes_per_sec=104857600 -benchmark_write_rate_limit=10485760 -num=100000000
```
Results:
- "Burst W-Amp" is the write-amp near the end of the fillrandom benchmark
- "Total W-Amp" is the write-amp after readrandom has run a while and all levels no longer need compaction
Branch | Burst W-Amp | Total W-Amp | fillrandom (MB/s)
-- | -- | -- | --
master | 20.2 | 21.5 | 4.7
dynamic-l0-score | 12.6 | 14.1 | 7.2
Reviewed By: siying
Differential Revision: D23412935
Pulled By: ajkr
fbshipit-source-id: f91f2067188e432dd39deab02f1c56f195057a0e
Summary:
The patch adds a log message to `BlobFileBuilder` that is logged upon
generating a blob file, similarly to how we log the generation of table files
during flush and compaction. The log message contains the column family
name, job id, blob file number, and the number and total size of blobs in
the new file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7324
Test Plan: Ran `make check` and checked the actual log messages using a custom `db_bench`.
Reviewed By: riversand963
Differential Revision: D23402229
Pulled By: ltamasi
fbshipit-source-id: ca42beb4db284b783d1eb2651f321032a45d0c5f
Summary:
Add a unit test case to check memory usage when
max_write_buffer_size_to_maintain is set if flushed immutable memtables are
trimmed timely or not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7311
Test Plan: Compared the results with before bug fix.
Reviewed By: ltamasi
Differential Revision: D23321702
Pulled By: akankshamahajan15
fbshipit-source-id: da04ee21137d641a07fd499a9e2749eb036fcb1e
Summary:
The patch adds a class called `BlobFileBuilder` that can be used to build
and cut blob files in background jobs (flushes/compactions). The class
enforces a value size threshold (`min_blob_size`; smaller blobs will be inlined
in the LSM tree itself), and supports specifying a blob file size limit (`blob_file_size`),
as well as compression (`blob_compression_type`) and checksums for blob files.
It also keeps track of the generated blob files and their associated `BlobFileAddition`
metadata, which can be applied as part of the background job's `VersionEdit`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7306
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23298817
Pulled By: ltamasi
fbshipit-source-id: 38f35d81dab1ba81f15236240612ec173d7f21b5
Summary:
Replace FSRandomAccessFile pointer with FSRandomAccessFilePtr
object in RandomAccessFileReader.
This new object wraps FSRandomAccessFile pointer.
Objective: If tracing is enabled, FSRandomAccessFile Ptr returns
FSRandomAccessFileTracingWrapper pointer that includes all necessary
information in IORecord and calls underlying FileSystem and invokes
IOTracer to dump that record in a binary file. If tracing is disabled
then, underlying FileSystem pointer is returned directly.
FSRandomAccessFilePtr wrapper class is added to bypass the FSRandomAccessFileWrapper when
tracing is disabled.
Test Plan: make check -j64
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7192
Reviewed By: anand1976
Differential Revision: D23356867
Pulled By: akankshamahajan15
fbshipit-source-id: 48f31168166a17a7444b40be44a9a9d4a5c7182c
Summary:
This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294.
To get DB checksum info for live files, we now read the manifest file
that will become part of the checkpoint/backup. This requires a little
extra handling in taking a custom checkpoint, including only reading the
manifest file up to the size prescribed by the checkpoint.
This moves GetFileChecksumsFromManifest from backup code to
file_checksum_helper.{h,cc} and removes apparently unnecessary checking
related to column families.
Updated HISTORY.md and warned potential future users of
DB::GetLiveFilesChecksumInfo()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309
Test Plan: updated unit test, before and after
Reviewed By: ajkr
Differential Revision: D23311994
Pulled By: pdillinger
fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5
Summary:
Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297
Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it.
Reviewed By: anand1976
Differential Revision: D23257950
fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03
Summary:
DBBasicTest.CompactBetweenSnapshots can time-out in some slow-I/O hosts. Parameterize it so that single test runs shorter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7301
Test Plan: Run the test and see see different runs are of different configerations in a hacky way.
Reviewed By: ltamasi
Differential Revision: D23277733
fbshipit-source-id: 1f717b4131322d175abf9e211131fe7e9b1ef758
Summary:
When SST file is created, application is able to know the file information through OnTableFileCreated callback in LogAndNotifyTableFileCreationFinished. Since file checksum information can be useful for application when the SST file is created, we add file_checksum and file_checksum_func_name information to TableFileCreationInfo, which will be passed through OnTableFileCreated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7108
Test Plan: make check, listener_test.
Reviewed By: ajkr
Differential Revision: D22470240
Pulled By: zhichao-cao
fbshipit-source-id: 92c20344d9b986eadfe3480f3769bf4add0dbaae
Summary:
After releasing a snapshot, it checks whether it is suitable to trigger bottom compactions.
When disabling auto compactions, it may still schedule compaction when releasing a snapshot. Whereas no compaction job will be actually handled, so the state of LSM is not changed and compaction will be triggered again and again every time releasing a snapshot.
Too frequent compactions lead to high CPU usage and high db_mutex lock contention which affects foreground write duration finally.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7267
Test Plan:
- make check
- manual test
Reviewed By: akankshamahajan15
Differential Revision: D23252880
Pulled By: ajkr
fbshipit-source-id: 4431e071a35d9912a2a3592875db27bae521434b
Summary:
More tests now pass. When in doubt, I added a TODO comment to check what should happen with an ignored error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7305
Reviewed By: akankshamahajan15
Differential Revision: D23301262
Pulled By: ajkr
fbshipit-source-id: 5f120edc7393560aefc0633250277bbc7e8de9e6
Summary:
Some ExternalSSTFileTest runs very long on some places. Disable fsync in some tests to speed them up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7303
Test Plan: Run these tests.
Reviewed By: riversand963
Differential Revision: D23280261
fbshipit-source-id: 0dca862e462f9e6d807f393320a1f82aa5b87e59
Summary:
When a memtable is trimmed in MemTableListVersion, the memtable
is only added to delete list if it is
the last reference. However it is not the last reference as it is held
by the super version. But the super version would not be switched if the
delete list is empty. So the memtable is never destroyed and memory
usage increases beyond write_buffer_size +
max_write_buffer_size_to_maintain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7296
Test Plan:
1. ./db_bench -benchmarks=randomtransaction
-optimistic_transaction_db=1 -statistics -stats_interval_seconds=1
-duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000
--transaction_set_snapshot
Reviewed By: ltamasi
Differential Revision: D23267395
Pulled By: akankshamahajan15
fbshipit-source-id: 3a8d437fe9f4015f851ff84c0e29528aa946b650
Summary:
This test uses database functionality and required more extensive work to get it to pass than the other tests. The DB functionality required for this test now passes the check.
When it was unclear what the proper behavior was for unchecked status codes, a TODO was added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283
Reviewed By: akankshamahajan15
Differential Revision: D23251497
Pulled By: ajkr
fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523
Summary:
There's a potential deadlock caused by MockTimeEnv time value get to a large number, which causes TimedWait() wait forever. The test misuses the microseconds as seconds, making it more likely to happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7277
Reviewed By: pdillinger
Differential Revision: D23183873
Pulled By: jay-zhuang
fbshipit-source-id: 6fc38ebd40b4125a99551204b271f91a27e70086
Summary:
Seems it's only causing assert failure during compaction pick, but in production code, the problematic compactions are excluded at a later step.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7281
Reviewed By: akankshamahajan15
Differential Revision: D23228000
Pulled By: jay-zhuang
fbshipit-source-id: 2e4055aeebe0f5a2b07e299e0a2d51a1ad2e216d
Summary:
This diff contains following changes:
1. Replace `FSSequentialFile` pointer with `FSSequentialFilePtr` object that wraps `FSSequentialFile` pointer in `SequenceFileReader`.
Objective: If tracing is enabled, `FSSequentialFilePtr` returns `FSSequentialFileTracingWrapper` pointer that includes all necessary information in `IORecord` and calls underlying FileSystem and invokes `IOTracer` to dump that record in a binary file. If tracing is disabled then, underlying `FileSystem` pointer is returned directly. `FSSequentialFilePtr` wrapper class is added to bypass the `FSSequentialFileTracingWrapper` when tracing is disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7190
Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
Reviewed By: anand1976
Differential Revision: D23059616
Pulled By: akankshamahajan15
fbshipit-source-id: 1564b94dd1297cd0fbfe2ed5c9cc3e20f7395301
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.
This commit reinstates https://github.com/facebook/rocksdb/issues/7049, whose un-revert was lost in an automatic
infrastructure mis-merge.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7274
Test Plan: Run all existing files.
Reviewed By: pdillinger
Differential Revision: D23177444
fbshipit-source-id: 1f61690b2ac6333c3b2c87176fef6b2cba086b33
Summary:
The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, `WALRecoveryMode::kTolerateCorruptedTailRecords` must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7271
Reviewed By: riversand963
Differential Revision: D23169923
Pulled By: ajkr
fbshipit-source-id: 2cf8a3bcd2c9a0ecb0055a84725047a10fd4db50
Summary:
Add `kEntryDeleteWithTimestamp` to `EntryType` which is a public API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7195
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D22914704
Pulled By: riversand963
fbshipit-source-id: 886f73c6b70c527cad1c8fc9fc8d3afe60e1ea39
Summary:
The patch makes sure that the functionality required for the new integrated
BlobDB implementation (most importantly, the classes related to reading and
writing blob files) is also built in LITE mode by removing the corresponding
`#ifndef`s.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7272
Test Plan: Ran `make check` in both regular and LITE mode.
Reviewed By: zhichao-cao
Differential Revision: D23173280
Pulled By: ltamasi
fbshipit-source-id: 1596bd1a76409a8a6d83d8f1dbfe08bfdea7ffe6
Summary:
There is potential data race related CompactRange() with level refitting. After the compaction step and refitting step, some automatic compaction could put data to the destination level and cause the DB to be corrupted. Fix the bug by checking the target level to be empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7261
Test Plan: Add a unit test, which would fail with "Corruption: L1 have overlapping ranges '666F6F' seq:6, type:1 vs. '626172' seq:2, type:1", and now it succeeds.
Reviewed By: ajkr
Differential Revision: D23142269
fbshipit-source-id: 28bc14d5ac934c192260b23a4ce3f10a95e3ee91
Summary:
Looks like somebody simply missed initializing a member variable. The column family ID, cf_id, is not set during OnCompactionBegin. But it is set properly in the next function for OnCompactionCompleted. Need this cf_id for tracking progress of a Stardog optimize since there may be multiple compactions required for a given column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6938
Reviewed By: siying
Differential Revision: D23153235
Pulled By: ajkr
fbshipit-source-id: 932938de3a4ebbc7ac89702f655583862587d251
Summary:
Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223
Reviewed By: riversand963
Differential Revision: D23056737
Pulled By: jay-zhuang
fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba
Summary:
If user-defined timestamp is enabled, current implementation can expose
newer data to queries even if an older sequence number is specified via
read_options.snapshot. This PR makes Get() respect sequence-number-based
snapshot.
Solution is simple. Besides using <ukey, ts, seq> to search the index for the key,
we also verify that the candidate result's seq is smaller than or equal to seq. This
requires passing a seq via `GetContext`, which results in the majority of code
change caused by this PR.
Also added a few unit tests to demonstrate standard visibility during point lookup
and range scan when timestamp and snapshot are both present.
Test plan (devserver):
```
make check
$./db_bench --benchmarks=fillseq,readrandom -cache_size=$[64*1024*1024]
```
Result
this PR: readrandom : 4.827 micros/op 207180 ops/sec; 22.9 MB/s (1000000 of 1000000 found)
master: readrandom : 4.936 micros/op 202610 ops/sec; 22.4 MB/s (1000000 of 1000000 found)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7227
Reviewed By: ltamasi
Differential Revision: D23015242
Pulled By: riversand963
fbshipit-source-id: ea7b85a728654553ba357d2e6a207b5e40f7376a
Summary:
Manual compaction with `CompactRangeOptions::change_levels` set could
refit to a level targeted by another manual compaction. If
force_consistency_checks were disabled, it could be possible for
overlapping files to be written at that target level.
This PR prevents the possibility by calling `DisableManualCompaction()`
prior to `ReFitLevel()`. It also improves the manual compaction disabling
mechanism to wait for pending manual compactions to complete before
returning, and support disabling from multiple threads.
Fixes https://github.com/facebook/rocksdb/issues/6432.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250
Test Plan:
crash test command that repro'd the bug reliably:
```
$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0
```
Reviewed By: ltamasi
Differential Revision: D23090800
Pulled By: ajkr
fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13
Summary:
Upgrade tool chain to the latest. It is done mostly manually as build_tools/build_detect_platform fails to update many of them.
Try to fix a new clang analyze warning with the new tool chain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7251
Test Plan: "make all", "USE_CLANG=1 make all"
Reviewed By: riversand963
Differential Revision: D23091090
fbshipit-source-id: 732e5a30137837431438f85f36296406b641f975
Summary:
The patch cleans up and refactors `CompressBlock` and `CompressBlockInternal` a bit.
In particular, it does the following:
* It renames `CompressBlockInternal` to `CompressData` and moves it to `util/compression.h`,
where other general compression-related utilities are located. This will facilitate reuse in the
BlobDB write path.
* The signature of the method is changed so it now takes `compression_format_version`
(similarly to the compression library specific methods) instead of `format_version` (which is
specific to the block based table).
* `GetCompressionFormatForVersion` no longer takes `compression_type` as a parameter.
This parameter was only used in a (not entirely up-to-date) assertion; also, removing it
eliminates the need to ensure this precondition holds at all call sites.
* Does some minor cleanup in `CompressBlock`, for instance, it is now possible to pass
only one of `sampled_output_fast` and `sampled_output_slow`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7249
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23087278
Pulled By: ltamasi
fbshipit-source-id: e6316e45baed8b4e7de7c1780c90501c2a3439b3
Summary:
As part of the IOTracing project, this PR
1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer.
2. FileSystemPtr object is created using FileSystem pointer and IOTracer
pointer.
3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor.
4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr
returns FileSystemTracingWrapper pointer for tracing purpose and when
it is disabled underlying FileSystem pointer is returned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180
Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
Reviewed By: anand1976
Differential Revision: D22987117
Pulled By: akankshamahajan15
fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1
Summary:
A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups.
Tests are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085
Test Plan: Passed make check
Reviewed By: pdillinger
Differential Revision: D22390756
Pulled By: gg814
fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
Summary:
`VersionStorageInfo::AddFile` currently has a debug-mode consistency
check to make sure the newly added file does not overlap with the
previous one (for levels below L0). Considering that
`VersionBuilder::CheckConsistency` also performs similar checks (in
fact, those checks are more comprehensive and cover L0 as well), this
check is redundant. The patch removes it and also cleans up `AddFile` a
little.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7237
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23041937
Pulled By: ltamasi
fbshipit-source-id: e00665f3b83bfd17f86c54c238800f3d77d739bd
Summary:
Delete triggered compaction (DTC) for universal compaction style with ```num_levels = 1``` has been disabled for sometime due to a data correctness bug. This PR re-enables it with a bug fix. A file marked for compaction can be picked, along with all L0 files after it as the compaction input. We stop adding files to the input once we encounter a file already being compacted (the original bug failed to check the compaction status of the files).
Tests:
Add unit tests to ```compaction_picker_test.cc```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7224
Reviewed By: ajkr
Differential Revision: D23031845
Pulled By: anand1976
fbshipit-source-id: 9de3cab5f9774cede666c2c48d309a7d9b88a505
Summary:
The debug is supposed to print out two keys to show the value mismatch, which was compared just a few lines above.
However, the actual print-out is the same values (so they obviously won't be mismatched)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6587
Reviewed By: riversand963
Differential Revision: D23025279
Pulled By: ajkr
fbshipit-source-id: 4c6c35bc60b273f13c08b5464b6f690d8a5cfe41
Summary:
Introduce io_timeout in ReadOptions and enabled deadline/io_timeout for
Iterators.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7161
Test Plan: New unit tests in db_basic_test
Reviewed By: riversand963
Differential Revision: D22687352
Pulled By: anand1976
fbshipit-source-id: 67bbb0e6d7ae80b256589244468494292538c6ec
Summary:
Pointed out by https://github.com/facebook/rocksdb/issues/7197 , there is a double lock in WriteImplWALOnly.
Also find another deadlock in UnorderedWriteMemtable. Move the check after switch_all_.notify_all().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7199
Test Plan: pass make check
Reviewed By: anand1976
Differential Revision: D22961714
Pulled By: zhichao-cao
fbshipit-source-id: 0707922dc50d28ea141a15a8cdcbd1c8993ea0d8
Summary:
A colon will be added after 'msg' automatically when invoke function Status(Code _code, const Slice& msg, const Slice& msg2),
it's not needed to append a colon explicitly to 'msg'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7041
Reviewed By: ajkr
Differential Revision: D22292801
fbshipit-source-id: 8f2d69065bb779d2613468bf9fc9169f32c3f1ec
Summary:
`WalAddition`, `WalDeletion` are defined in `wal_version.h` and used in `VersionEdit`.
`WalAddition` is used to represent events of creating a new WAL (no size, just log number), or closing a WAL (with size).
`WalDeletion` is used to represent events of deleting or archiving a WAL, it means the WAL is no longer alive (won't be replayed during recovery).
`WalSet` is the set of alive WALs kept in `VersionSet`.
1. Why use `WalDeletion` instead of relying on `MinLogNumber` to identify outdated WALs
On recovery, we can compute `MinLogNumber()` based on the log numbers kept in MANIFEST, any log with number < MinLogNumber can be ignored. So it seems that we don't need to persist `WalDeletion` to MANIFEST, since we can ignore the WALs based on MinLogNumber.
But the `MinLogNumber()` is actually a lower bound, it does not exactly mean that logs starting from MinLogNumber must exist. This is because in a corner case, when a column family is empty and never flushed, its log number is set to the largest log number, but not persisted in MANIFEST. So let's say there are 2 column families, when creating the DB, the first WAL has log number 1, so it's persisted to MANIFEST for both column families. Then CF 0 is empty and never flushed, CF 1 is updated and flushed, so a new WAL with log number 2 is created and persisted to MANIFEST for CF 1. But CF 0's log number in MANIFEST is still 1. So on recovery, MinLogNumber is 1, but since log 1 only contains data for CF 1, and CF 1 is flushed, log 1 might have already been deleted from disk.
We can make `MinLogNumber()` be the exactly minimum log number that must exist, by persisting the most recent log number for empty column families that are not flushed. But if there are N such column families, then every time a new WAL is created, we need to add N records to MANIFEST.
In current design, a record is persisted to MANIFEST only when WAL is created, closed, or deleted/archived, so the number of WAL related records are bounded to 3x number of WALs.
2. Why keep `WalSet` in `VersionSet` instead of applying the `VersionEdit`s to `VersionStorageInfo`
`VersionEdit`s are originally designed to track the addition and deletion of SST files. The SST files are related to column families, each column family has a list of `Version`s, and each `Version` keeps the set of active SST files in `VersionStorageInfo`.
But WALs are a concept of DB, they are not bounded to specific column families. So logically it does not make sense to store WALs in a column family's `Version`s.
Also, `Version`'s purpose is to keep reference to SST / blob files, so that they are not deleted until there is no version referencing them. But a WAL is deleted regardless of version references.
So we keep the WALs in `VersionSet` for the purpose of writing out the DB state's snapshot when creating new MANIFESTs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7164
Test Plan:
make version_edit_test && ./version_edit_test
make wal_edit_test && ./wal_edit_test
Reviewed By: ltamasi
Differential Revision: D22677936
Pulled By: cheng-chang
fbshipit-source-id: 5a3b6890140e572ffd79eb37e6e4c3c32361a859
Summary:
IteratorIterator::IsOutOfBound() and IteratorIterator::MayBeOutOfUpperBound() are two functions that related to upper bound check. It is hard for users to reason about this complexity. Consolidate the two functions into one and assign an enum as results to improve readability.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7200
Test Plan: Run all existing test. Would run crash test with atomic for a while.
Reviewed By: anand1976
Differential Revision: D22833181
fbshipit-source-id: a0c724267056adbd0476bde74650e6c7226077e6
Summary:
as title.
When ReadOptions.iter_start_ts is not nullptr, DBIter::key() should
return internal keys including value type.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7178
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D22935879
Pulled By: riversand963
fbshipit-source-id: 7508d962cf11ebcfa6386d2529b4f3606b47ccfd
Summary:
Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator`
and every `LevelIterator`. This redundancy consumes extra memory,
resulting in the `Arena` making more allocations, and iteration
observing worse cache performance.
This PR migrates callers of `NewInternalIterator()` and
`MakeInputIterator()` to provide a `ReadOptions` object guaranteed to
outlive the returned iterator. When the iterator's lifetime will be managed by the
user, this lifetime guarantee is achieved by storing the `ReadOptions`
value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and
`MakeInputIterator()` can hold a reference-to-const `ReadOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210
Test Plan:
- `make check` under ASAN and valgrind
- benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes.
Reviewed By: anand1976
Differential Revision: D22861323
Pulled By: ajkr
fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce
Summary:
Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165
Test Plan:
TTL and periodic can be checked by runnning db_bench with the options activated:
/db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1
./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1
Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected.
Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking
Reviewed By: ajkr
Differential Revision: D22693050
Pulled By: akabcenell
fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d
Summary:
NextAndGetResult() is not implemented in memtable and is very simply implemented in level iterator. The result is that for a normal leveled iterator, performance regression will be observed for calling PrepareValue() for most iterator Next(). Mitigate the problem by implementing the function for both iterators. In level iterator, the implementation cannot be perfect as when calling file iterator's SeekToFirst() we don't have information about whether the value is prepared. Fortunately, the first key should not cause a big portion of the CPu.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7179
Test Plan: Run normal crash test for a while.
Reviewed By: anand1976
Differential Revision: D22783840
fbshipit-source-id: c19f45cdf21b756190adef97a3b66ccde3936e05
Summary:
Make (most of) the env*_test pass when ASSERT_STATUS_CHECKED is enabled.
One test that opens a database is currently disabled in this mode, as there are many errors that need revisited for DB tests and status checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7176
Reviewed By: cheng-chang
Differential Revision: D22799278
Pulled By: ajkr
fbshipit-source-id: 16d8a02eaeecd6df1060249b6a5811292801f2ed
Summary:
SST Partitioner interface that allows to split SST files during compactions.
It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957
Reviewed By: ajkr
Differential Revision: D22461239
fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e
Summary:
BlobDB currently syncs each blob file periodically after writing a certain amount of
data (as specified by the configuration option `BlobDBOptions::bytes_per_sync`)
and all open blob files when the base DB's memtables are flushed. With the patch,
in addition to the above, blob files are also synced right before being closed, after
the footer has been written. This will be beneficial for the new integrated blob file
write path as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7160
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D22672646
Pulled By: ltamasi
fbshipit-source-id: 62b34263543a7e74abcbb7adf011daa1e699998f
Summary:
`DBTest::SnapshotFiles` runs the tests in a `while` loop.
Currently, the snapshot directory is not cleaned up in each loop, so previous snapshot files may remain in the next loop's snapshot.
When I'm working on https://github.com/facebook/rocksdb/pull/7129, when checking the tracked WALs in MANIFEST, I find that this test always fails because it reads some unknown WAL. It turns out that the unknown WAL is left from previous loops.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7156
Test Plan: make db_test && ./db_test --gtest_filters=*SnapshotFiles
Reviewed By: siying
Differential Revision: D22668360
Pulled By: cheng-chang
fbshipit-source-id: 69d4aa3506038ba30e218e8ae966357935a99c6c
Summary:
When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file. If the values do not match, the SST file is rejected.
Code put in place for the check for both flush and compaction jobs. Corresponding test added to corruption_test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134
Reviewed By: cheng-chang
Differential Revision: D22646149
fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f
Summary:
Issue https://github.com/facebook/rocksdb/issues/7133 reported that using `system_clock` in `FileOperationInfo::TimePoint` causes the duration of file flush operation (which can be a noop on MacOS in some scenarios) appears to be 0 and fail an assertion in listener_test. Using `steady_clock` supposedly fixed the problem.
`steady_clock` actually fits better into the use cases of `FileOperationInfo::TimePoint` as all usages care about durations but not wall clock time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7153
Test Plan: make check.
Reviewed By: riversand963
Differential Revision: D22654136
Pulled By: roghnin
fbshipit-source-id: 5980b1080734bdae496a18071a2c2b5887c67d85
Summary:
TSAN reports warning in one column_family_test:
WARNING: ThreadSanitizer: data race (pid=16352)
Write of size 8 at 0x7ffcdf042158 by main thread:
#0 pthread_cond_destroy <null> (column_family_test+0x471f65)
https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::~CondVar() /home/circleci/project/port/port_posix.cc:101:49 (column_family_test+0x8a627a)
https://github.com/facebook/rocksdb/issues/2 rocksdb::test::SleepingBackgroundTask::~SleepingBackgroundTask() /home/circleci/project/./test_util/testutil.h:397:7 (column_family_test+0x54b6e2)
https://github.com/facebook/rocksdb/issues/3 rocksdb::ColumnFamilyTest_FlushCloseWALFiles_Test::TestBody() /home/circleci/project/db/column_family_test.cc:3008:1 (column_family_test+0x54b6e2)
......
Previous read of size 8 at 0x7ffcdf042158 by thread T2 (mutexes: write M0):
#0 pthread_cond_broadcast <null> (column_family_test+0x471dd2)
https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::SignalAll() /home/circleci/project/port/port_posix.cc:139:28 (column_family_test+0x8a651a)
https://github.com/facebook/rocksdb/issues/2 rocksdb::test::SleepingBackgroundTask::DoSleep() /home/circleci/project/./test_util/testutil.h:412:12 (column_family_test+0x58574b)
......
Likely, SleepingBackgroundTask::DoSleep() started to execute after the main thread has finished everything, cancelled and waited for sleeping tasks to finish. At this time, although DoSlee() will not sleep, but it also accesses the mutex, creating a data race with destructor of the test. Fix this bug by waiting for the sleeping task to start sleeping after it is scheduled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7150
Test Plan: Run these modified tests and make sure it doesn't break.
Reviewed By: riversand963
Differential Revision: D22630716
fbshipit-source-id: cc5781cf69083685de406490438898238bdfc2d3
Summary:
Remove the 3 testing cases that cause the time out in linux build by https://github.com/facebook/rocksdb/issues/6765 . Will fix them later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7141
Test Plan: make asan_check, buck run
Reviewed By: ajkr
Differential Revision: D22593831
Pulled By: zhichao-cao
fbshipit-source-id: 14956c36476ecc3393f613178c22e13df843126e
Summary:
Previously when running `db_bench` with large value for `num_multi_dbs` and enabled `Options::dump_malloc_stats`, we would see most CPU spent in jemalloc locking. After this PR that no longer shows up at the top of the profile.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7145
Reviewed By: riversand963
Differential Revision: D22593031
Pulled By: ajkr
fbshipit-source-id: 3b3fc91f93249c6afee53f59f34c487c3fc5add6
Summary:
TSAN shows warning with clang with warning similar to this:
WARNING: ThreadSanitizer: data race (pid=10159)
Atomic write of size 8 at 0x7b5000002890 by thread T33:
#0 __tsan_atomic64_store <null> (db_test+0x4ca2b5)
https://github.com/facebook/rocksdb/issues/1 std::__atomic_base<unsigned long>::store(unsigned long, std::memory_order) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/atomic_base.h:374:2 (db_test+0x774fde)
https://github.com/facebook/rocksdb/issues/2 rocksdb::VersionSet::SetLastSequence(unsigned long) /home/circleci/project/./db/version_set.h:1057:20 (db_test+0x774fde)
https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*) /home/circleci/project/db/db_impl/db_impl_write.cc:449:18 (db_test+0x774fde)
......
Previous read of size 8 at 0x7b5000002890 by thread T5 (mutexes: write M1044689462619020832):
#0 rocksdb::DBImpl::ReleaseSnapshot(rocksdb::Snapshot const*) /home/circleci/project/db/db_impl/db_impl.cc (db_test+0x6f4ae7)
https://github.com/facebook/rocksdb/issues/1 rocksdb::(anonymous namespace)::MTThreadBody(void*) /home/circleci/project/db/db_test.cc:2514:13 (db_test+0x56ac59)
https://github.com/facebook/rocksdb/issues/2 rocksdb::(anonymous namespace)::StartThreadWrapper(void*) /home/circleci/project/env/env_posix.cc:443:3 (db_test+0x88c4cd)
It is not limited to ReleaseSnapshot() and rocksdb::DBImpl::MultiCFSnapshot().
While we are not 100% sure it doesn't indicate any correctness violation, we suppress them for now to keep TSAN clean with more tests so that we can cover more bugs with CI.
In the gcc runs we have been running, this warning rarely shows up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7126
Test Plan: See the mini-TSAN test to pass with reasonable run time.
Reviewed By: ajkr
Differential Revision: D22552375
fbshipit-source-id: ebdd3854cb3becec3403970326a1ca961db2ab00
Summary:
In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765
Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check
Reviewed By: anand1976
Differential Revision: D21916789
Pulled By: zhichao-cao
fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
Summary:
Currently, RocksDB lets compaction to go through even in case of
corrupted keys, the number of which is reported in CompactionJobStats.
However, RocksDB does not check this value. We should let compaction run
in a stricter mode.
Temporarily disable two tests that allow corrupted keys in compaction.
With this PR, the two tests will assert(false) and terminate. Still need
to investigate what is the recommended google-test way of doing it.
Death test (EXPECT_DEATH) in gtest has warnings now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7124
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D22530722
Pulled By: riversand963
fbshipit-source-id: 6a5a6a992028c6d4f92cb74693c92db462ae4ad6
Summary:
Periodic syncing of blob files is performed by `WritableFileWriter`;
`bytes_per_sync_` and `next_sync_offset_` in `BlobLogWriter` are
actually unused (or more precisely, only used by methods that are
themselves unused). The patch removes all this dead code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7125
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D22531021
Pulled By: ltamasi
fbshipit-source-id: 6b293ad5a79d3e6bf15c5c68f7aedd7ce7a15f10
Summary:
During memtable lookup, an unrecognized value type should be reported as
Status::Corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7121
Test Plan: make check
Reviewed By: cheng-chang
Differential Revision: D22512124
Pulled By: riversand963
fbshipit-source-id: 9b97be7d9b230c5aae9205f96054420e5ea09066
Summary:
There currently exist multiple `GetChildren()` calls in `DBImpl::Recover()`, which can be expensive in cases of distributed file systems.
This pull request try to call `DBImpl::Recover()` of each necessary directory only _once_ and reuse the results in the places of repeated calls in current code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7044
Test Plan:
Run `make check` and use the default test suite. The modified code should be semantically identical to the current code. As a proof of this solution, we may optionally deploy the system onto a (real or simulated) distributed system and expect reduced latency caused by manifest fetching.
(WIP)
Reviewed By: riversand963
Differential Revision: D22419925
Pulled By: roghnin
fbshipit-source-id: d3774fbfbc246c5527101bc16747eb5c90919886
Summary:
Cleans up some of the dependencies on test code in the Makefile while building tools:
- Moves the test::RandomString, DBBaseTest::RandomString into Random
- Moves the test::RandomHumanReadableString into Random
- Moves the DestroyDir method into file_utils
- Moves the SetupSyncPointsToMockDirectIO into sync_point.
- Moves the FaultInjection Env and FS classes under env
These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies. By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated.
Tested both release and debug builds via Make and CMake for both static and shared libraries.
More work remains to clean up how the tools are built and remove some unnecessary dependencies. There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097
Reviewed By: riversand963
Differential Revision: D22463160
Pulled By: pdillinger
fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2
Summary:
As title. The goal is to shorten the execution time of several tests
when they are combined together in a single TEST_F.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7105
Test Plan:
make db_wal_test
./db_wal_test
Reviewed By: ltamasi
Differential Revision: D22442705
Pulled By: riversand963
fbshipit-source-id: 0ad49b8f21fa86dcd5a4d3c9a06af313735ac217
Summary:
When format_version is high enough to support user-key and
there are index entries for same user key that spans multiple data
blocks then it changes from user-key mode to internal-key mode. But the
flush policy is not reset to point to Block Builder of internal-keys.
After this switch, no entries are added to user key index partition
result, thus it never triggers flushing the block.
Fix: 1. After adding the entry in sub_builder_index_, if there is a switch
from user-key to internal-key, then flush policy is updated to point to
Block Builder of internal-keys index partition.
2. Set sub_builder_index_->seperator_is_key_plus_seq_ = true if
seperator_is_key_plus_seq_ is set to true so that subsequent partitions
can also use internal key mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7096
Test Plan: make check -j64
Reviewed By: ajkr
Differential Revision: D22416598
Pulled By: akankshamahajan15
fbshipit-source-id: 01fc2dc07ea1b32f8fb803995ebe6e9a3fbe67ac
Summary:
Currently, `EventListener` in listner.h only have callback functions for file read and write. One may favor extended callback functions for more file I/O operations like flush, sync and close. This PR tries to add those interface and have them called when appropriate throughout the code base.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7055
Test Plan:
Write an experimental listener with those new callback functions with log output in them; run experiments and check logs to see those functions are actually called.
Default test suits `make check` should also be included.
Reviewed By: riversand963
Differential Revision: D22380624
Pulled By: roghnin
fbshipit-source-id: 4121491d45c2c2aae8c255e7998090559a241c6a
Summary:
When table file checksums are enabled and stored in the DB manifest by using the RocksDB default crc32c checksum function, BackupEngine will calculate the crc32c checksum of the file to be copied and compare the calculated result with the one stored in the DB manifest before copying the file to the backup directory.
After copying to the backup directory, BackupEngine will verify the checksum of the copied file with the one calculated before copying. This helps detect some rare corruption events such as bit-flips during the copying process.
No verification with checksums in DB manifest will be performed if the table file checksum function is not the RocksDB default crc32c checksum function.
In addition, If `share_table_files` and `share_files_with_checksum` are true, BackupEngine will compare the checksums computed before and after copying of the table files.
Corresponding tests are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7015
Test Plan: Passed make check
Reviewed By: pdillinger
Differential Revision: D22165732
Pulled By: gg814
fbshipit-source-id: ee0e8cc397c455eba64545c29380b9d9853588ec
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7049
Test Plan: Run all existing files.
Reviewed By: pdillinger
Differential Revision: D22301700
fbshipit-source-id: f9a9e3b3b26ce640665a47cb8bff33ba0c89b565
Summary:
When format_version is high enough to support user-key and there are index entries for same user key that spans multiple data blocks then it changes from user-key mode to internal-key mode. But the flush policy is not reset to point to Block Builder of internal-keys. After this switch, no entries are added to user key index partition result, thus it never triggers flushing the block.
Fix: After adding the entry in sub_builder_index_, if there is a switch from user-key to internal-key, then flush policy is updated to point to Block Builder of internal-keys index partition.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7022
Test Plan:
1. make check -j64
2. Added one unit test case
Reviewed By: ajkr
Differential Revision: D22197734
Pulled By: akankshamahajan15
fbshipit-source-id: d87e9e46bccab8e896ee6979d6b79c51f73d479e
Summary:
This check is flaky because compaction could run between the `Flush()` and the `TestGetTickerCount()`, which would increase the `BLOCK_CACHE_INDEX_MISS` count beyond what the test expects. Verified by adding a `sleep(1)` between those two lines and observing the counter is too high every time. The solution is just to remove this check as it doesn't have any use anyways. The latter check of index miss is sufficient to conclude the newest L0 file (i.e., the one generated by intra-L0) does not have its index block pinned in cache. It'd be nice to simultaneously check the L0 files generated by flush do have their index blocks pinned in cache, but that's not what the line deleted in this PR was checking..
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7065
Reviewed By: pdillinger
Differential Revision: D22340327
Pulled By: ajkr
fbshipit-source-id: e076b2c7228b7fa763dd0c0cb13828e176c1abee
Summary:
Random memtable layouts could cause random failure,
reproducible with command below running for a while. Test now using
deterministic behavior.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7064
Test Plan: while ./db_test --gtest_filter=*SizesMemTable*; do true; done
Reviewed By: siying
Differential Revision: D22339442
Pulled By: pdillinger
fbshipit-source-id: 8e74e5a9b5e88f7030854045a22c12cf561d5de6
Summary:
Change the linking of tests/tools to be against a library rather than a list of objects. This change substantially reduces the size of the objects produced.
peterd clean repo size: 264M
Before this change, with make all: 40G
After this change, with make all: 28G
With make LIB_MODE=shared all: 7.0G
The list of TESTS was changed from being hard-coded to generated from the test sources variable. Note that there are some test sources that are not built as tests (though the set of tests is identical to the previous version).
Added OBJ_DIR option to Makefile to allow objects to be placed in an alternative location. By default, OBJ_DIR is the same as before ("./").
This change is a precursor to being able to build/run the tests/tools linked against static libraries. Additionally, it should be possible to clean up and merge some of the rules for building tests and the like if so desired.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6660
Reviewed By: riversand963
Differential Revision: D22244463
Pulled By: pdillinger
fbshipit-source-id: db9c6341d81ed62c2270374f4ede02fb9604c754
Summary:
The earlier `VersionBuilder` code only cleaned up blob files that were
marked as entirely consisting of garbage using `VersionEdits` with
`BlobFileGarbage`. This covers the cases when table files go through
regular compaction, where we iterate through the KVs and thus have an
opportunity to calculate the amount of garbage (that is, most cases).
However, it does not help when table files are simply dropped (e.g. deletion
compactions or the `DeleteFile` API). To deal with such cases, the patch
adds logic that cleans up all blob files at the head of the list until the first
one with linked SSTs is found. (As an example, let's assume we have blob files
with numbers 1..10, and the first one with any linked SSTs is number 8.
This means that SSTs in the `Version` only rely on blob files with numbers >= 8,
and thus 1..7 are no longer needed.)
The code change itself is pretty small; however, changing the logic like this
necessitated changes to some tests that have been added recently (namely
to the ones that use blob files in isolation, i.e. without any table files referring
to them). Some of these cases were fixed by bypassing `VersionBuilder` altogether
in order to keep the tests simple (which actually makes them more proper unit tests
as well), while the `VersionBuilder` unit tests were fixed by adding dummy table
files to the test cases as needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7001
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D22119474
Pulled By: ltamasi
fbshipit-source-id: c6547141355667d4291d9661d6518eb741e7b54a
Summary:
WriteCallbackTest.WriteWithCallbackTest has a deep for-loop and in some cases runs very long. Parameterimized it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7037
Test Plan: Run the test and see it passes.
Reviewed By: ltamasi
Differential Revision: D22269259
fbshipit-source-id: a1b6687b5bf4609754833d14cf383d68bc7ab27a
Summary:
Added compaction filter support for BlobDB non-TTL values. Same as vanilla RocksDB, user compaction filter applies to all k/v pairs of the compaction for non-TTL values. It honors `min_blob_size`, which potentially results value transitions between inlined data and stored-in-blob data when size of value is changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6850
Reviewed By: siying
Differential Revision: D22263487
Pulled By: ltamasi
fbshipit-source-id: 8fc03f8cde2a5c831e63b436b3dbf1b7f90939e8
Summary:
Fsyncing files is not providing more test coverage in many tests. Provide an option in SpecialEnv to turn it off to speed it up and enable this option in some tests with relatively long run time.
Most of those tests can be divided as parameterized gtest too. This two speed up approaches are orthogonal and we can do both if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7036
Test Plan: Run all tests and make sure they pass.
Reviewed By: ltamasi
Differential Revision: D22268084
fbshipit-source-id: 6d4a838a1b7328c13931a2a5d93de57aa02afaab
Summary:
Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open.
The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future.
Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982
Test Plan: Add new unit tests in db_basic_test
Reviewed By: riversand963
Differential Revision: D22219515
Pulled By: anand1976
fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b
Summary:
After https://github.com/facebook/rocksdb/issues/6949 , VersionSet::io_status_ can be concurrently accessed by multiple
threads without lock, causing tsan test to fail. For example, a bg flush thread
resets io_status_ before calling LogAndApply(), while another thread already in
the process of LogAndApply() reads io_status_. This is a bug.
We do not have to reset io_status_ each time we call LogAndApply(). io_status_
is part of the state of VersionSet, and it indicates the outcome of preceding
MANIFEST/CURRENT files IO operations. Its value should be updated only when:
1. MANIFEST/CURRENT files IO fail for the first time.
2. MANIFEST/CURRENT files IO succeed as part of recovering from a prior
failure without process restart, e.g. calling Resume().
Test Plan (devserver):
COMPILE_WITH_TSAN=1 make check
COMPILE_WITH_TSAN=1 make db_test2
./db_test2 --gtest_filter=DBTest2.CompactionStall
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7034
Reviewed By: zhichao-cao
Differential Revision: D22247137
Pulled By: riversand963
fbshipit-source-id: 77b83e05390f3ee3cd2d96d3fdd6fe4f225e3216
Summary:
We are still keeping unity build working. So it's a good idea to add to a pre-commit CI.
A latest GCC docker image just to get a little bit more coverage. Fix three small issues to make it pass.
Also make unity_test to run db_basic_test rather than db_test to cut the test time. There is no point to run expensive tests here. It was set to run db_test before db_basic_test was separated out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7026
Test Plan: watch tests to pass.
Reviewed By: zhichao-cao
Differential Revision: D22223197
fbshipit-source-id: baa3b6cbb623bf359829b63ce35715c75bcb0ed4
Summary:
Some tests directly uses TmpDir() as temporary directory without adding any randomize factor. This would cause failures when tests run in parallel. Fix it by moving some of them to test::PerThreadDBPath()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7030
Test Plan: Watch existing tests pass
Reviewed By: zhichao-cao
Differential Revision: D22224710
fbshipit-source-id: 28c9932fede0a4a64670e5b5fdb08f4fb5dccdd0
Summary:
`BackupableDBOptions::new_naming_for_backup_files` is added. This option is false by default. When it is true, backup table filenames under directory shared_checksum are of the form `<file_number>_<crc32c>_<db_session_id>.sst`.
Note that when this option is true, it comes into effect only when both `share_files_with_checksum` and `share_table_files` are true.
Three new test cases are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6997
Test Plan: Passed make check.
Reviewed By: ajkr
Differential Revision: D22098895
Pulled By: gg814
fbshipit-source-id: a1d9145e7fe562d71cde7ac995e17cb24fd42e76
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
Summary:
It's useful to build RocksDB using a more recent clang version in CI. Add a CircleCI build and fix some issues with it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7025
Test Plan: See all tests pass.
Reviewed By: pdillinger
Differential Revision: D22215700
fbshipit-source-id: 914a729c2cd3f3ac4a627cc0ac58d4691dca2168
Summary:
The constant `kNoExpiration` is currently defined in an
internal/implementation header (`blob_log_format.h`); the patch moves it
to the public header `blob_db.h` so it is accessible to users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7018
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D22191354
Pulled By: ltamasi
fbshipit-source-id: 98c8012a83b999a3f1a30e955ce6bb71ba29dc5c
Summary:
FlushAfterIntraL0CompactionCheckConsistencyFail is flakey. It sometimes fails with:
db/db_compaction_test.cc:5186: Failure
Expected equality of these values:
10
NumTableFilesAtLevel(0)
Which is: 3
I don't see a clear reason why the assertion would always be true. The necessarily of the assertion is not clear either. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7003
Test Plan: See the test still builds.
Reviewed By: riversand963
Differential Revision: D22129753
fbshipit-source-id: 42f0bb05e32b369e8d726bfd3e35c29cf52fe008
Summary:
Although RocksDB falls over in various other ways with KVs
around 4GB or more, this change fixes how XXH32 and XXH64 were being
called by the block checksum code to support >= 4GB in case that should
ever happen, or the code copied for other uses.
This change is not a schema compatibility issue because the checksum
verification code would checksum the first (block_size + 1) mod 2^32
bytes while the checksum construction code would checksum the first
block_size mod 2^32 plus the compression type byte, meaning the
XXH32/64 checksums for >=4GB block would not match about 255/256 times.
While touching this code, I refactored to consolidate redundant
implementations, improving diagnostics and performance tracking in some
cases. Also used less confusing language in those diagnostics.
Makes https://github.com/facebook/rocksdb/issues/6875 obsolete.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6978
Test Plan:
I was able to write a test for this using an SST file writer
and VerifyChecksum in a reader. The test fails before the fix, though
I'm leaving the test disabled because I don't think it's worth the
expense of running regularly.
Reviewed By: gg814
Differential Revision: D22143260
Pulled By: pdillinger
fbshipit-source-id: 982993d16134e8c50bea2269047f901c1783726e
Summary:
The bug fixed in https://github.com/facebook/rocksdb/pull/1816/ is now applicable to iterator too. This was not an issue but https://github.com/facebook/rocksdb/pull/2886 caused the regression. If a put and DB flush happens just between iterator to get latest sequence number and getting super version, empty result for the key or an older value can be returned, which is wrong.
Fix it in the same way as the fix in https://github.com/facebook/rocksdb/issues/1816, that is to get the sequence number after referencing the super version.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6973
Test Plan: Will run stress tests for a while to make sure there is no general regression.
Reviewed By: ajkr
Differential Revision: D22029348
fbshipit-source-id: 94390f93630906796d6e2fec321f44a920953fd1
Summary:
https://github.com/facebook/rocksdb/issues/5411 refactored `VersionSet::Recover` but introduced a bug, explained as follows.
Before, once a checksum mismatch happens, `reporter` will set `s` to be non-ok. Therefore, Recover will stop processing the MANIFEST any further.
```
// Correct
// Inside Recover
LogReporter reporter;
reporter.status = &s;
log::Reader reader(..., reporter);
while (reader.ReadRecord() && s.ok()) {
...
}
```
The bug is that, the local variable `s` in `ReadAndRecover` won't be updated by `reporter` while reading the MANIFEST. It is possible that the reader sees a checksum mismatch in a record, but `ReadRecord` retries internally read and finds the next valid record. The mismatched record will be ignored and no error is reported.
```
// Incorrect
// Inside Recover
LogReporter reporter;
reporter.status = &s;
log::Reader reader(..., reporter);
s = ReadAndRecover(reader, ...);
// Inside ReadAndRecover
Status s; // Shadows the s in Recover.
while (reader.ReadRecord() && s.ok()) {
...
}
```
`LogReporter` can use a separate `log_read_status` to track the errors while reading the MANIFEST. RocksDB can process more MANIFEST entries only if `log_read_status.ok()`.
Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6996
Reviewed By: ajkr
Differential Revision: D22105746
Pulled By: riversand963
fbshipit-source-id: b22f717a423457a41ca152a242abbb64cf91fc38
Summary:
Compressed block cache is disabled in https://github.com/facebook/rocksdb/pull/4650 for no good reason. Re-enable it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6990
Test Plan: Add a unit test to make sure a general function works with read-only DB + compressed block cache.
Reviewed By: ltamasi
Differential Revision: D22072755
fbshipit-source-id: 2a55df6363de23a78979cf6c747526359e5dc7a1
Summary:
`db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file.
The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`.
In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`.
A table property test is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983
Test Plan: make check and some manual tests.
Reviewed By: zhichao-cao
Differential Revision: D22048826
Pulled By: gg814
fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9
Summary:
In best-efforts recovery, an error that is not Corruption or IOError::kNotFound or IOError::kPathNotFound will be overwritten silently. Fix this by checking all non-ok cases and return early.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6989
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D22071418
Pulled By: riversand963
fbshipit-source-id: 5a4ea5dfb1a41f41c7a3fdaf62b163007b42f04b
Summary:
Best-efforts recovery does not check the content of CURRENT file to determine which MANIFEST to recover from. However, it still checks the presence of CURRENT file to determine whether to create a new DB during `open()`. Therefore, we can tweak the logic in `open()` a little bit so that best-efforts recovery does not rely on CURRENT file at all.
Test plan (dev server):
make check
./db_basic_test --gtest_filter=DBBasicTest.RecoverWithNoCurrentFile
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6970
Reviewed By: anand1976
Differential Revision: D22013990
Pulled By: riversand963
fbshipit-source-id: db552a1868c60ed70e1f7cd252a3a076eb8ea58f
Summary:
Added DB::GetDbSessionId by using the same format and machinery as DB::GetDbIdentity.
The DB Session ID is generated (and therefore, updated) each time a DB object is opened. It is written to the LOG file right after the line of “DB SUMMARY”.
A test for the uniqueness, for different openings and during the same opening, is also added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6959
Test Plan: Passed make check
Reviewed By: zhichao-cao
Differential Revision: D21951721
Pulled By: gg814
fbshipit-source-id: 958a48a612db49a39998ea703cded45987d3fa8b
Summary:
Persistent cache feature caused rocks db crash on windows. I posted a issue for it, https://github.com/facebook/rocksdb/issues/6919. I found this is because no "persistent_cache_key_prefix" is generated for persistent cache. Looking repo history, "GetUniqueIdFromFile" is not implemented on Windows. So my fix is adding "NewId()" function in "persistent_cache" and using it to generate prefix for persistent cache. In this PR, i also re-enable related test cases defined in "db_test2" and "persistent_cache_test" for windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6932
Test Plan:
1. run related test cases in "db_test2" and "persistent_cache_test" on windows and see it passed.
2. manually run db_bench.exe with "read_cache_path" and verified.
Reviewed By: riversand963
Differential Revision: D21911608
Pulled By: cheng-chang
fbshipit-source-id: cdfd938d54a385edbb2836b13aaa1d39b0a6f1c2
Summary:
The `FileMetaData` objects associated with table files already contain the
number of the oldest blob file referenced by the SST in question. This patch
adds the inverse mapping to `BlobFileMetaData`, namely the set of table file
numbers for which the oldest blob file link points to the given blob file (these
are referred to as *linked SSTs*). This mapping will be used by the GC logic.
Implementation-wise, the patch builds on the `BlobFileMetaDataDelta`
functionality introduced in https://github.com/facebook/rocksdb/pull/6835: newly linked/unlinked SSTs are
accumulated in `BlobFileMetaDataDelta`, and the changes to the linked SST set
are applied in one shot when the new `Version` is saved. The patch also reworks
the blob file related consistency checks in `VersionBuilder` so they validate the
consistency of the forward table file -> blob file links and the backward blob file ->
table file links for blob files that are part of the `Version`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6945
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D21912228
Pulled By: ltamasi
fbshipit-source-id: c5bc7acf6e729a8fccbb12672dd5cd00f6f000f8
Summary:
If `options.wal_recovery_mode == WALRecoveryMode::kPointInTimeRecovery`, RocksDB stops replaying WAL once hitting an error and discards the rest of the WAL. This can lead to data loss if the error occurs at an offset smaller than the last sync'ed offset.
Ideally, RocksDB point-in-time recovery should permit recovery if the error occurs after last synced offset while fail recovery if error occurs before the last synced offset. However, RocksDB does not track the synced offset of WALs. Consequently, RocksDB does not know whether an error occurs before or after the last synced offset. An error can be one of the following.
- WAL record checksum mismatch. This can result from both corruption of synced data and dropping of unsynced data during shutdown. We cannot be sure which one. In order not to defeat the original motivation to permit the latter case, we keep the original behavior of point-in-time WAL recovery.
- IOError. This means the WAL can be bad, an indicator of whole file becoming unavailable, not to mention synced part of the WAL. Therefore, we choose to modify the behavior of point-in-time recovery and fail the database recovery.
Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6963
Reviewed By: ajkr
Differential Revision: D22011083
Pulled By: riversand963
fbshipit-source-id: f9cbf29a37dc5cc40d3fa62f89eed1ad67ca1536
Summary:
https://github.com/facebook/rocksdb/pull/6901 subtly changed the handling of the corner case
when a table file is deleted from a level, then re-added to the same level. (Note: this
should be extremely rare; one scenario that comes to mind is a trivial move followed by
a call to `ReFitLevel` that moves the file back to the original level.) Before that change,
a new `FileMetaData` object was created as a result of this sequence; after the change,
the original `FileMetaData` was essentially resurrected (since the deletion and the addition
simply cancel each other out with the change). This patch restores the original behavior,
which is more intuitive considering the interface, and in sync with how trivial moves are handled.
(Also note that `FileMetaData` contains some mutable data members, the values of which
might be different in the resurrected object and the freshly created one.)
The PR also fixes a bug in this area: with the original pre-6901 code, `VersionBuilder`
would add the same file twice to the same level in the scenario described above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6939
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D21905580
Pulled By: ltamasi
fbshipit-source-id: da07ae45384ecf3c6c53506d106432d88a7ec9df
Summary:
`DBTest2.CompressionFailures` currently tests many configurations
sequentially using nested loops, which often leads to timeouts
in our test system. The patch turns it into a parameterized test
instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6968
Test Plan: `make check`
Reviewed By: siying
Differential Revision: D22006954
Pulled By: ltamasi
fbshipit-source-id: f71f2f7108086b7651ecfce3d79a7fab24620b2c
Summary:
Application can ingest SST files with file checksum information, such that during ingestion, DB is able to check data integrity and identify of the SST file. The PR introduces generate_and_verify_file_checksum to IngestExternalFileOption to control if the ingested checksum information should be verified with the generated checksum.
1. If generate_and_verify_file_checksum options is *FALSE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enables the SST file checksum and the checksum function name matches the checksum function name in DB, we trust the ingested checksum, store it in Manifest. If the checksum function name does not match, we treat that as an error and fail the IngestExternalFile() call.
2. If generate_and_verify_file_checksum options is *TRUE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enable the SST file checksum, we will use the checksum generator from DB to calculate the checksum for each ingested SST files after they are copied or moved. Then, compare the checksum results with the ingested checksum information: _A)_ if the checksum function name does not match, _verification always report true_ and we store the DB generated checksum information in Manifest. _B)_ if the checksum function name mach, and checksum match, ingestion continues and stores the checksum information in the Manifest. Otherwise, terminate file ingestion and report file corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6891
Test Plan: added unit test, pass make asan_check
Reviewed By: pdillinger
Differential Revision: D21935988
Pulled By: zhichao-cao
fbshipit-source-id: 7b55f486632db467e76d72602218d0658aa7f6ed
Summary:
This is required so that the test cases can safely be run in parallel.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6962
Test Plan: `make check`
Reviewed By: zhichao-cao
Differential Revision: D21980060
Pulled By: ltamasi
fbshipit-source-id: 616b7a0b686155d3874848b9098c67ad3f47efcc
Summary:
Memory pinned by `pin_l0_filter_and_index_blocks_in_cache` needs to be predictable based on user config. This PR makes sure
we do not pin extra memory for large files generated by intra-L0 (see https://github.com/facebook/rocksdb/issues/6889).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6911
Test Plan: unit test
Reviewed By: siying
Differential Revision: D21835818
Pulled By: ajkr
fbshipit-source-id: a11a088549d06bed8aacc2548d266e5983f0ead4
Summary:
When MultiGet is called with duplicate keys, and the key matches the
largest key in an SST file and the value type is merge, only the first
instance of the duplicate key is returned with correct results. This is
due to the incorrect assumption that if a key in a batch is equal to the
largest key in the file, the next key cannot be present in that file.
Tests:
Add a new unit test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6953
Reviewed By: cheng-chang
Differential Revision: D21935898
Pulled By: anand1976
fbshipit-source-id: a2cc327a15150e23fd997546ca64d1c33021cb4c
Summary:
The patch adds a convenience method `GetFileMetaDataByNumber` that
builds on the `FileLocation` functionality introduced recently (see
https://github.com/facebook/rocksdb/pull/6862). This method makes it possible to
retrieve the `FileMetaData` directly as opposed to having to go through
`LevelFiles` and friends.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6940
Test Plan: `make check`
Reviewed By: cheng-chang
Differential Revision: D21905946
Pulled By: ltamasi
fbshipit-source-id: af99e19de21242b2b4a87594a535c6028d16ee72
Summary:
In db_options.c, we should avoid including header files in the `db` directory to avoid introducing unnecessary dependency. The reason why `version_edit.h` has been included in `db_options.cc` is because we need two constants, `kUnknownChecksum` and `kUnknownChecksumFuncName`. We can put these two constants as `constexpr` in the public header `file_checksum.h`.
Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6952
Reviewed By: zhichao-cao
Differential Revision: D21925341
Pulled By: riversand963
fbshipit-source-id: 2902f3b74c97f0cf16c58ad24c095c787c3a40e2
Summary:
RocksDB is an embedded library; we should not write to the application's
console. Note: in each case, the same information is returned in the form of a
`Status::Corruption` object.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6948
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D21914965
Pulled By: ltamasi
fbshipit-source-id: ae4b66789aa6b659eb8cc2ed4a048187962c86cc
Summary:
We currently do not have any validation that would ensure that the `FileMetaData`
objects are equivalent when a file gets deleted from the LSM tree and then re-added
(think trivial moves); however, if we did, this test case would be in violation. The patch
changes the values used in the test case so they are consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6942
Test Plan: `make check`
Reviewed By: zhichao-cao
Differential Revision: D21911366
Pulled By: ltamasi
fbshipit-source-id: 2f0486f8337373a6a111b6f28433d70507857104
Summary:
The ```for``` loop in ```VerifyChecksumInBlocks``` only checks ```index_iter->Valid()``` which could be ```false``` either due to reaching the end of the index or, in case of partitioned index, it could be due to a checksum mismatch error when reading a 2nd level index block. Instead of throwing away the index iterator status, we need to return any errors back to the caller.
Tests:
Add a test in block_based_table_reader_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6909
Reviewed By: pdillinger
Differential Revision: D21833922
Pulled By: anand1976
fbshipit-source-id: bc778ebf1121dbbdd768689de5183f07a9f0beae
Summary:
Confusing checks for null that are never null
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6933
Test Plan: make check
Reviewed By: cheng-chang
Differential Revision: D21885466
Pulled By: pdillinger
fbshipit-source-id: 4b48e03c2a33727f2702b0d12292f9fda5a3c475
Summary:
As title. The prior change to the line is a typo. Fixing it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6928
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D21873587
Pulled By: riversand963
fbshipit-source-id: f4837fc8792d7106bc230b7b499dfbb7a2847430
Summary:
DB::OpenForReadOnly will not write anything to the file system (i.e., create directories or files for the DB) unless create_if_missing is true.
This change also fixes some subcommands of ldb, which write to the file system even if the purpose is for readonly.
Two tests for this updated behavior of DB::OpenForReadOnly are also added.
Other minor changes:
1. Updated HISTORY.md to include this API change of DB::OpenForReadOnly;
2. Updated the help information for the put and batchput subcommands of ldb with the option [--create_if_missing];
3. Updated the comment of Env::DeleteDir to emphasize that it returns OK only if the directory to be deleted is empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6900
Test Plan: passed make check; also manually tested a few ldb subcommands
Reviewed By: pdillinger
Differential Revision: D21822188
Pulled By: gg814
fbshipit-source-id: 604cc0f0d0326a937ee25a32cdc2b512f9a3be6e
Summary:
This reverts commit 8d87e9cea1.
Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923
Reviewed By: pdillinger
Differential Revision: D21864799
fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc
Summary:
Added setting of zstd_max_train_bytes compression option parameter to c interop.
rocksdb_options_set_bottommost_compression_options was using bool parameter and thus not exported, updated it to unsigned char and added to c.h as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6796
Reviewed By: cheng-chang
Differential Revision: D21611471
Pulled By: ajkr
fbshipit-source-id: caaaf153de934837ad9af283c7f8c025ff0b0cf5
Summary:
People keep breaking the gcc 4.8 compilation due to different
warnings for shadowing member functions with locals. Adding to Travis
to keep compatibility. (gcc 4.8 is default on CentOS 7.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6915
Test Plan: local and Travis
Reviewed By: siying
Differential Revision: D21842894
Pulled By: pdillinger
fbshipit-source-id: bdcd4385127ee5d1cc222d87e53fb3695c32a9d4
Summary:
The patch cleans up the code and improves the consistency checks around
adding/deleting table files in `VersionBuilder`. Namely, it makes the checks
stricter and improves them in the following ways:
1) A table file can now only be deleted from the LSM tree using the level it
resides on. Earlier, there was some unnecessary wiggle room for
trivially moved files (they could be deleted using a lower level number than
the actual one).
2) A table file cannot be added to the tree if it is already present in the tree
on any level (not just the target level). The earlier code only had an assertion
(which is a no-op in release builds) that the newly added file is not already
present on the target level.
3) The above consistency checks around state transitions are now mandatory,
as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op
in release mode unless `force_consistency_checks` was set to `true`. The rationale
here is that assuming that the initial state is consistent, a valid transition leads to a
next state that is also consistent; however, an *invalid* transition offers no such
guarantee. Hence it makes sense to validate the transitions unconditionally,
and save `force_consistency_checks` for the paranoid checks that re-validate
the entire state.
4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862,
which enables us to efficiently look up the location (level and position within level)
of files in a `Version` by file number. This makes the consistency checks much more
efficient than the earlier `CheckConsistencyForDeletes`, which essentially
performed a linear search.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901
Test Plan:
Extended the unit tests and ran:
`make check`
`make whitebox_crash_test`
Reviewed By: ajkr
Differential Revision: D21822714
Pulled By: ltamasi
fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
Summary:
Because ARM and some other platforms have a larger cache line
size, they have a larger minimum filter size, which causes recently
added PartitionedMultiGet test in db_bloom_filter_test to fail on those
platforms. The code would actually end up using larger partitions,
because keys_per_partition_ would be 0 and never == number of keys
added.
The code now attempts to get as close as possible to the small target
size, while fully utilizing that filter size, if the target partition
size is smaller than the minimum filter size.
Also updated the test to break more uniformly across platforms
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6905
Test Plan: updated test, tested on ARM
Reviewed By: anand1976
Differential Revision: D21840639
Pulled By: pdillinger
fbshipit-source-id: 11684b6d35f43d2e98b85ddb2c8dcfd59d670817
Summary:
x.size() -1 or y - 1 can overflow to an extremely large value when x.size() pr y is 0 when they are unsigned type. The end condition of i in the for loop will be extremely large, potentially causes segment fault. Fix them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6902
Test Plan: pass make asan_check
Reviewed By: ajkr
Differential Revision: D21843767
Pulled By: zhichao-cao
fbshipit-source-id: 5b8b88155ac5a93d86246d832e89905a783bb5a1
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.
It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.
So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.
Also includes miscellaneous comment improvements / clarifications.
Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784
Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...
[ RUN ] DBTest.ApproximateSizesFilesWithErrorMargin
db/db_test.cc:1562: Failure
Expected: (size) <= (11 * 100), actual: 9478 vs 1100
Other tests updated to reflect consistent accounting of metadata.
Reviewed By: siying
Differential Revision: D21334706
Pulled By: pdillinger
fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
Summary:
`IterKey::UpdateInternalKey()` is an error-prone API as it's
incompatible with `IterKey::TrimAppend()`, which is used for
decoding delta-encoded internal keys. This PR stops using it in
`BlockIter`. Instead, it assigns global seqno in a separate `IterKey`'s
buffer when needed. The logic for safely getting a Slice with global
seqno properly assigned is encapsulated in `GlobalSeqnoAppliedKey`.
`BinarySeek()` is also migrated to use this API (previously it ignored
global seqno entirely).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6843
Test Plan:
benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
created by regular flush; "ingestion_db" created by ingesting a file. Both
DBs have same contents.
```
$ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
$ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
$ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
```
benchmark run command:
```
TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=10 -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=1048576000 -threads=1 -reads=40000000
```
results:
| DB | code | throughput |
|---|---|---|
| normal_db | master | 267.9 |
| normal_db | PR6843 | 254.2 (-5.1%) |
| ingestion_db | master | 259.6 |
| ingestion_db | PR6843 | 250.5 (-3.5%) |
Reviewed By: pdillinger
Differential Revision: D21562604
Pulled By: ajkr
fbshipit-source-id: 937596f836930515da8084d11755e1f247dcb264
Summary:
Preliminary user-timestamp support for delete.
If ["a", ts=100] exists, you can delete it by calling `DB::Delete(write_options, key)` in which `write_options.timestamp` points to a `ts` higher than 100.
Implementation
A new ValueType, i.e. `kTypeDeletionWithTimestamp` is added for deletion marker with timestamp.
The reason for a separate `kTypeDeletionWithTimestamp`: RocksDB may drop tombstones (keys with kTypeDeletion) when compacting them to the bottom level. This is OK and useful if timestamp is disabled. When timestamp is enabled, should we still reuse `kTypeDeletion`, we may drop the tombstone with a more recent timestamp, causing deleted keys to re-appear.
Test plan (dev server)
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6253
Reviewed By: ltamasi
Differential Revision: D20995328
Pulled By: riversand963
fbshipit-source-id: a9e5c22968ad76f98e3dc6ee0151265a3f0df619
Summary:
Does what it says on the can: the patch adds a hash map to `VersionStorageInfo`
that maps file numbers to file locations, i.e. (level, position in level) pairs. This
will enable stricter consistency checks in `VersionBuilder`. The patch also fixes
all the unit tests that used duplicate file numbers in a version (which would trigger
an assertion with the new code).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6862
Test Plan:
`make check`
`make whitebox_crash_test`
Reviewed By: riversand963
Differential Revision: D21670446
Pulled By: ltamasi
fbshipit-source-id: 2eac249945cf33d8fb8597b26bfff5221e1a861a
Summary:
1. Add a value_size in read options which limits the cumulative value size of keys read in batches. Once the size exceeds read_options.value_size, all the remaining keys are returned with status Abort without further fetching any key.
2. Add a unit test case MultiGetBatchedValueSizeSimple the reads keys from memory and sst files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6826
Test Plan:
1. make check -j64
2. Add a new unit test case
Reviewed By: anand1976
Differential Revision: D21471483
Pulled By: akankshamahajan15
fbshipit-source-id: dea51b8e76d5d1df38ece8cdb29933b1d798b900
Summary:
If Option.file_checksum_gen_factory is set, rocksdb generates the file checksum during flush and compaction based on the checksum generator created by the factory and store the checksum and function name in vstorage and Manifest.
This PR enable file checksum generation in SstFileWrite and store the checksum and checksum function name in the ExternalSstFileInfo, such that application can use them for other purpose, for example, ingest the file checksum with files in IngestExternalFile().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6859
Test Plan: add unit test and pass make asan_check.
Reviewed By: ajkr
Differential Revision: D21656247
Pulled By: zhichao-cao
fbshipit-source-id: 78a3570c76031d8832e3d2de3d6c79cdf2b675d0
Summary:
This patch is groundwork for an upcoming change to store the set of
linked SSTs in `BlobFileMetaData`. With the current code, a new
`BlobFileMetaData` object is created each time a `VersionEdit` touches
a certain blob file. This is fine as long as these objects are lightweight
and cheap to create; however, with the addition of the linked SST set, it would
be very inefficient since the set would have to be copied over and over again.
Note that this is the same kind of problem that `VersionBuilder` is solving
w/r/t `Version`s and files, and we can apply the same solution; that is, we can
accumulate the changes in a different mutable object, and apply the delta in
one shot when the changes are committed. The patch does exactly that by
adding a new `BlobFileMetaDataDelta` class to `VersionBuilder`. In addition,
it turns the existing `GetBlobFileMetaData` helper into `IsBlobFileInVersion`
(which is fine since that's the only thing the method's clients care about now),
and adds a couple of helper methods that can create a `BlobFileMetaData`
object from the `BlobFileMetaData` in the base (if applicable) and the delta
when the `Version` is saved.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6835
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D21505187
Pulled By: ltamasi
fbshipit-source-id: d81a48c5f2ca7b79d7124c935332a6bcf3d5d988
Summary:
sst_dump can issue many file reads from the file system. This doesn't work well with file systems without a OS cache, especially remote file systems. In order to mitigate this problem, several improvements are done:
1. --readahead_size is added, so that users can specify readahead size when scanning the data.
2. Force a 512KB tail readahead, which prevents three I/Os for footer, meta index and property blocks and hopefully index and filter blocks too.
3. Consoldiate SSTDump's I/Os before opening the file for read. Use the same file prefetch buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6836
Test Plan: Add a test that covers this new feature.
Reviewed By: pdillinger
Differential Revision: D21516607
fbshipit-source-id: 3ae43526286f67b2f4a5bdedfbc92719d579b87e
Summary:
Currently there is no check for whether BlockBasedTableBuilder will expose
compression error status if compression fails during the table building.
This commit adds fake faulting compressors and a unit test to test such
cases.
This check finds 5 bugs, and this commit also fixes them:
1. Not handling compression failure well in
BlockBasedTableBuilder::BGWorkWriteRawBlock.
2. verify_compression failing in BlockBasedTableBuilder when used with ZSTD.
3. Wrongly passing the same reference of block contents to
BlockBasedTableBuilder::CompressAndVerifyBlock in parallel compression.
4. Wrongly setting block_rep->first_key_in_next_block to nullptr in
BlockBasedTableBuilder::EnterUnbuffered when there are still incoming data
blocks.
5. Not maintaining variables for compression ratio estimation and first_block
in BlockBasedTableBuilder::EnterUnbuffered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6709
Reviewed By: ajkr
Differential Revision: D21236254
fbshipit-source-id: 101f6e62b2bac2b7be72be198adf93cd32a1ff46
Summary:
When using ldb, users cannot turn on force consistency check in most commands, while they cannot use checksonsistnecy with --try_load_options. The change fixes both by:
1. checkconsistency now calls OpenDB() so that it gets all the options loading and sanitized options logic
2. use options.check_consistency_checks = true by default, and add a --disable_consistency_checks to turn it off.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6802
Test Plan: Add a new unit test. Some manual tests with corrupted DBs.
Reviewed By: pdillinger
Differential Revision: D21388051
fbshipit-source-id: 8d122732d391b426e3982a1c3232a8e3763ffad0
Summary:
1. Update column_family_memtables_ to point to latest column_family_set in
version_set after recovery.
2. Normalize file paths passed by application so that directories end with '/'
or '\\'.
3. In addition to missing files, corrupted files are also ignored in
best-efforts recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6824
Test Plan: COMPILE_WITH_ASAN=1 make check
Reviewed By: anand1976
Differential Revision: D21463905
Pulled By: riversand963
fbshipit-source-id: c48db8843cc93c8c1c7139c474b64e6f775307d2
Summary:
Delete triggered compaction in universal compaction mode was causing a corruption when scheduled in parallel with other compactions.
1. When num_levels = 1, a file marked for compaction may be picked along with all older files in L0, without checking if any of them are already being compaction. This can cause unpredictable results like resurrection of older versions of keys or deleted keys.
2. When num_levels > 1, a delete triggered compaction would not get scheduled if it overlaps with a running regular compaction. However, the reverse is not true. This is due to the fact that in ```UniversalCompactionBuilder::CalculateSortedRuns```, it assumes that entire sorted runs are picked for compaction and only checks the first file in a sorted run to determine conflicts. This is violated by a delete triggered compaction as it works on a subset of a sorted run.
Fix the bug for num_levels > 1, and disable the feature for now when num_levels = 1. After disabling this feature, files would still get marked for compaction, but no compaction would get scheduled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6799
Reviewed By: pdillinger
Differential Revision: D21431286
Pulled By: anand1976
fbshipit-source-id: ae9f0bdb1d6ae2f10284847db731c23f43af164a
Summary:
The error is assigning KeyContext::s to NotFound status in a
table reader for a "not found in this table" case, which skips searching
in later tables, like only a delete should. (The hash search index iterator
is the only one that can return status NotFound even if Valid() == false.)
This was detected by intermittent failure in
MultiThreadedDBTest.MultiThreaded/5, a kHashSearch configuration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6821
Test Plan: modified existing unit test to reproduce problem
Reviewed By: anand1976
Differential Revision: D21450469
Pulled By: pdillinger
fbshipit-source-id: 7478003684d637dbd491cdac81468041a791be2c
Summary:
We found some files containing nothing but negative range tombstones,
and unsurprisingly their metadata specified a negative range, which made
things crash. Time to add a bit of user input validation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6788
Reviewed By: zhichao-cao
Differential Revision: D21343719
Pulled By: ajkr
fbshipit-source-id: f1c16e4c3e9fa150958c8c866176632a3206fb74
Summary:
The patch extends `FindObsoleteFiles` and `PurgeObsoleteFiles` with
support for blob files. The behavior is analogous to SST files: obsolete
blob files are put on the "candidates for deletion" list, while live (and pending)
files are preserved.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6807
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D21406249
Pulled By: ltamasi
fbshipit-source-id: 1948f71c31927564b61e8af394f50ca3964880d9
Summary:
Current DBCompactionTest.ConsistencyFailTest checks DB fails after L0 inconsitency is found. Add slightly more coverage by introducing DBCompactionTest.ConsistencyFailTest2 which checks non-L0 files too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6800
Test Plan: Run the new test.
Reviewed By: riversand963
Differential Revision: D21384806
fbshipit-source-id: 36db7b657eed42115283fe2f6afa4c3a31a3b510
Summary:
Add methods and constructors for handling enums to the OptionTypeInfo. This change allows enums to be converted/compared without adding a special "type" to the OptionType.
This change addresses a couple of issues:
- It allows new enumerated types to be added to the options without editing the OptionType base class (and related methods)
- It standardizes the procedure for adding enumerated types to the options, reducing potential mistakes
- It moves the enum maps to the location where they are used, allowing them to be static file members rather than global values
- It reduces the number of types and cases that need to be handled in the various OptionType methods
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6423
Reviewed By: siying
Differential Revision: D21408713
Pulled By: zhichao-cao
fbshipit-source-id: fc492af285d011822578b95d186a0fce25d35626
Summary:
This very old test code bug was causing a new valgrind failure
in MultiGetDeadlineExceeded
Also fix hang in MultiGetDeadlineExceeded by unifying with some logic from another test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6803
Test Plan: run that unit test under valgrind, make check
Reviewed By: siying
Differential Revision: D21388470
Pulled By: pdillinger
fbshipit-source-id: 0ce99d6d5eb8cd3195b17406892c8c5cff5fa5dd
Summary:
With consistency check enabled, VersionBuilder::SaveTo() may return error once
corruption is detected while building versions. We should handle these errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6801
Test Plan: make check
Reviewed By: siying
Differential Revision: D21385045
Pulled By: riversand963
fbshipit-source-id: 98f6424e2a4699b62befa21e9fe00e70a771118e
Summary:
Current impl. of db_stress will abort verification and report failure if
GetLiveFiles() causes a dropped column family to be flushed. This is not
desired.
To fix, this PR makes the following change:
In GetLiveFiles, if flush is triggered and returns
Status::IsColumnFamilyDropped(), then set status to Status::OK().
This is OK because dropped column families will be skipped during the rest of
this function, and valid column families will have their live files returned to
caller.
Test plan (dev server):
make check
./db_stress -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100
./db_stress -disable_wal=1 -reopen=0 -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6805
Reviewed By: ltamasi
Differential Revision: D21390044
Pulled By: riversand963
fbshipit-source-id: de67846b95a4f1b88aa0a30c3d70c43cc68625b9
Summary:
The patch adds logic that returns the set of live blob files from
`Version::AddLiveFiles` and `VersionSet::AddLiveFiles` (in addition to
live table files), and also cleans up the code a bit, for example, by
exposing only the numbers of table files as opposed to the earlier
`FileDescriptor`s that no clients used. Moreover, the patch extends
the `GetLiveFiles` API so that it also exposes blob files in the current version.
Similarly to https://github.com/facebook/rocksdb/pull/6755,
this is a building block for identifying and purging obsolete blob files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6785
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D21336210
Pulled By: ltamasi
fbshipit-source-id: fc1aede8a49eacd03caafbc5f6f9ce43b6270821
Summary:
We are swallowing some file consistency checking failures. This is not expected. We are fixing two cases: DB reopen and manifest dump.
More places are not fixed and need follow-up.
Error from CheckConsistencyForDeletes() is also swallowed, which is not fixed in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6793
Test Plan: Add a unit test to cover the reopen case.
Reviewed By: riversand963
Differential Revision: D21366525
fbshipit-source-id: eb438a322237814e8d5125f916a3c6de97f39ded
Summary:
The current way of implementing CompressionOptions.parallel_threads introduces a format change. We plan to change CompressionOptions's serailization format to a new JSON-like format, which would be another format change. We would like to consolidate the two format changes into one, rather than making some users to change twice. Hold CompressionOptions.parallel_threads from being supported by option string for now. Will add it back after the general CompressionOptions's format change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6782
Test Plan: Run all existing tests.
Reviewed By: zhichao-cao
Differential Revision: D21338614
fbshipit-source-id: bca2dac3cb37d4e6e64b52cbbe8ea749cd848685
Summary:
Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.
For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.
Tests:
Update existing unit tests to verify the correct timeout value is being passed
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751
Reviewed By: riversand963
Differential Revision: D21285631
Pulled By: anand1976
fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
Summary:
The patch adds logic to keep track of obsolete blob files. A blob file becomes
obsolete when the last `shared_ptr` that points to the corresponding
`SharedBlobFileMetaData` object goes away, which, in turn, happens when the
last `Version` that contains the blob file is destroyed. No longer needed blob
files are added to the obsolete list in `VersionSet` using a custom deleter to
avoid unnecessary coupling between `SharedBlobFileMetaData` and `VersionSet`.
Obsolete blob files are returned by `VersionSet::GetObsoleteFiles` and stored
in `JobContext`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6755
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D21233155
Pulled By: ltamasi
fbshipit-source-id: 47757e06fdc0127f27ed57f51abd27893d9a7b7a
Summary:
The issue is reported in https://github.com/facebook/rocksdb/issues/6753 . size_t is unsigned and if sorted_file.size() is 0, the end condition of i will be extremely large, cause segment fault in sorted_files[i] and sorted_files[i+1]. Added condition to fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6762
Test Plan: make asan_check
Reviewed By: pdillinger
Differential Revision: D21323063
Pulled By: zhichao-cao
fbshipit-source-id: 56ce59201949ed319448228553202b8642c2cc3a
Summary:
The dynamic_cast in the filter benchmark causes release mode to fail due to
no-rtti. Replace with static_cast_with_check.
Signed-off-by: Derrick Pallas <derrick@pallas.us>
Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732
Reviewed By: ltamasi
Differential Revision: D21304260
Pulled By: pdillinger
fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c
Summary:
Added functions for parsing, serializing, and comparing elements to OptionTypeInfo. These functions allow all of the special cases that could not be handled directly in the map of OptionTypeInfo to be moved into the map. Using these functions, every type can be handled via the map rather than special cased.
By adding these functions, the code for handling options can become more standardized (fewer special cases) and (eventually) handled completely by common classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6422
Test Plan: pass make check
Reviewed By: siying
Differential Revision: D21269005
Pulled By: zhichao-cao
fbshipit-source-id: 9ba71c721a38ebf9ee88259d60bd81b3282b9077
Summary:
Since read threads do not coordinate on loading data into block
cache, two threads between Lookup and Insert can end up loading and
inserting the same data. This is particularly concerning with
cache_index_and_filter_blocks since those are hot and more likely to
be race targets if ejected from (or not pre-populated in) the cache.
Particularly with moves toward disaggregated / network storage, the cost
of redundant retrieval might be high, and we should at least have some
hard statistics from which we can estimate impact.
Example with full filter thrashing "cliff":
$ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10
...
$ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
rocksdb.block.cache.add COUNT : 14181
rocksdb.block.cache.add.failures COUNT : 0
rocksdb.block.cache.add.redundant COUNT : 476
rocksdb.block.cache.data.add COUNT : 12749
rocksdb.block.cache.data.add.redundant COUNT : 18
rocksdb.block.cache.filter.add COUNT : 1003
rocksdb.block.cache.filter.add.redundant COUNT : 217
rocksdb.block.cache.index.add COUNT : 429
rocksdb.block.cache.index.add.redundant COUNT : 241
$ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
rocksdb.block.cache.add COUNT : 1182223
rocksdb.block.cache.add.failures COUNT : 0
rocksdb.block.cache.add.redundant COUNT : 302728
rocksdb.block.cache.data.add COUNT : 31425
rocksdb.block.cache.data.add.redundant COUNT : 12
rocksdb.block.cache.filter.add COUNT : 795455
rocksdb.block.cache.filter.add.redundant COUNT : 130238
rocksdb.block.cache.index.add COUNT : 355343
rocksdb.block.cache.index.add.redundant COUNT : 172478
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681
Test Plan: Some manual testing (above) and unit test covering key metrics is included
Reviewed By: ltamasi
Differential Revision: D21134113
Pulled By: pdillinger
fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87
Summary:
In crash test, the db directory might be set to /dev/shm or /tmp, in certain environments such as internal testing infrastructure, neither of these directories support direct IO, so direct IO is never enabled in crash test.
This PR sets up SyncPoints in direct IO related code paths to disable O_DIRECT flag in calls to `open`, so the direct IO code paths will be executed, all direct IO related assertions will be checked, but no real direct IO request will be issued to the file system.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6727
Test Plan:
export CRASH_TEST_EXT_ARGS="--use_direct_reads=1 --mmap_read=0"
make -j24 crash_test
Reviewed By: zhichao-cao
Differential Revision: D21139250
Pulled By: cheng-chang
fbshipit-source-id: db9adfe78d91aa4759835b1af91c5db7b27b62ee
Summary:
After a successful recovery, the CURRENT file should be updated to point to the valid MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6746
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D21189876
Pulled By: riversand963
fbshipit-source-id: 7537b49988c5c425ebe9505a5cc260de351ad79b
Summary:
The patch makes a couple of small cleanups to `SharedBlobFileMetaData` and `BlobFileMetaData`:
* It makes the constructors private and introduces factory methods to ensure these objects are always owned by `shared_ptr`s. Note that `SharedBlobFileMetaData` has an additional factory that takes a deleter object; we can utilize this to e.g. notify `VersionSet` when a blob file becomes obsolete (which is exactly when `SharedBlobFileMetaData` is destroyed).
* It disables move operations explicitly instead of relying on them being suppressed because of a user-declared destructor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6749
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D21206947
Pulled By: ltamasi
fbshipit-source-id: 9094c14cc335b3e226f883e5a0df4f87a5cdeb95
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings. There is a mishmash of parameters in use here with more needed in the future. This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389
Reviewed By: siying
Differential Revision: D21163707
Pulled By: zhichao-cao
fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
Summary:
Initial implementation of ReadOptions.deadline for MultiGet. If the request takes longer than the deadline, the keys not yet found will be returned with Status::TimedOut(). This
implementation enforces the deadline in DBImpl, which is fairly high
level. Its best effort and may not check the deadline after every key
lookup, but may do so after a batch of keys.
In subsequent stages, we will extend this to passing a timeout down to the FileSystem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6710
Test Plan: Add new unit tests
Reviewed By: riversand963
Differential Revision: D21149158
Pulled By: anand1976
fbshipit-source-id: 9f44eecffeb40873f5034ed59a66d21f9f88879e
Summary:
1. Add changes so that max_background_flushes can be set dynamically.
2. Add a testcase DBOptionsTest.SetBackgroundFlushThreads which set the
max_background_flushes dynamically using SetDBOptions.
TestPlan: 1. make -j64 check
2. Using new testcase DBOptionsTest.SetBackgroundFlushThreads
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6701
Reviewed By: ajkr
Differential Revision: D21028010
Pulled By: akankshamahajan15
fbshipit-source-id: 5f949e4a8fd3c32537b637947b7ee09a69cfc7c1
Summary:
Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended:
* Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists
* Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition
* std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API
* Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line
* Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20)
* Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor
* Added GCC 9 C++11 & GCC9 C++20 in Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697
Test Plan: make check and CI
Reviewed By: cheng-chang
Differential Revision: D21020318
Pulled By: pdillinger
fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91
Summary:
Invariant checking should use internal key comparator rather than
`sstableKeyCompare()`. The latter was intended for checking whether a
compaction input file's neighboring files need to be included in the
same compaction. Using it for invariant checking was leading to false
positives for files with overlapping endpoints.
Fixes https://github.com/facebook/rocksdb/issues/6647.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6663
Test Plan: regression test
Reviewed By: zhichao-cao
Differential Revision: D20910466
Pulled By: ajkr
fbshipit-source-id: f0b70dad7c4096fce635cab7a36f16e14f74ae3f
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
Summary:
Towards making compaction logic compatible with user timestamp.
When computing boundaries and overlapping ranges for inputs of compaction, We need to compare SSTs by user key without timestamp.
Test plan (devserver):
```
make check
```
Several individual tests:
```
./version_set_test --gtest_filter=VersionStorageInfoTimestampTest.GetOverlappingInputs
./db_with_timestamp_compaction_test
./db_with_timestamp_basic_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6645
Reviewed By: ltamasi
Differential Revision: D20960012
Pulled By: riversand963
fbshipit-source-id: ad377fa9eb481bf7a8a3e1824aaade48cdc653a4
Summary:
Summary : Set manual_compaction false in case of DeleteTriggeredCompaction object so that kFilesMarkedForComapaction can be reported.
Added a DeletionTriggeredUniversalCompactionMarking test case for Deletion Triggered compaction in case of Universal Compaction.
Test Plan : make check -j64
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6680
Test Plan: make check -j64
Reviewed By: anand1976
Differential Revision: D20945946
Pulled By: akankshamahajan15
fbshipit-source-id: af84e417bd7127652aaae9143c560d1ab3815d25
Summary:
(Based on Yanqin's idea) Add a new field in readoptions as lower timestamp bound for iterator. When the parameter is not supplied (nullptr), the iterator returns the latest visible version of a record. When it is supplied, the existing timestamp field is the upper bound. Together the two serves as a bounded time window. The iterator returns all versions of a record falling in the window.
SeekRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit e860f8840):
seekrandom : 7.836 micros/op 4082449 ops/sec; (0 of 73481999 found)
This PR:
seekrandom : 7.764 micros/op 4120935 ops/sec; (0 of 71303999 found)
db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=seekrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6544
Reviewed By: ltamasi
Differential Revision: D20844069
Pulled By: riversand963
fbshipit-source-id: d97f2bf38a323c8c6a68db213b2d3c694b1c1f74
Summary:
On reading an ingested SST file, `DataBlockIter` will replace seqno encoded in a key with global seqno. However, if the original seqno was part of the prefix used for the next key, the global seqno is by mistake used as part of the prefix to construct the next key, causing wrong result being returned. Although at this point it is only software error while data in the file is not corrupted, the issue can further cause compaction output out of order and corrupted result when the ingested SST participated in compaction. Fixing the issue by save the actual seqno and restore it before the key being used as prefix to construct next key.
The unit test is by Little-Wallace from https://github.com/facebook/rocksdb/issues/6666. Fixing https://github.com/facebook/rocksdb/issues/6666.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6669
Test Plan:
New unit test
Signed-off-by: Yi Wu <yiwu@pingcap.com>
Reviewed By: cheng-chang
Differential Revision: D20931808
Pulled By: ajkr
fbshipit-source-id: f01959c35d6a493954dca981663766c7a5a9e8ab
Summary:
The raw key bytes are currently dumped directly into the log messages,
which is not ideal if the keys aren't ASCII strings. Null bytes in
particular can cut off bits of the message early.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6616
Reviewed By: ajkr
Differential Revision: D20879218
Pulled By: anand1976
fbshipit-source-id: 825a20715fe6d8012c0163c6e7b8159f7926a1a7
Summary:
This is a predecessor to the Configurable PR. This change moves the OptionTypeInfo maps closer to where they will be used.
When the Configurable changes are adopted, these values will become static and not associated with the OptionsHelper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6198
Reviewed By: siying
Differential Revision: D20778108
Pulled By: zhichao-cao
fbshipit-source-id: a9f85fc73bc53503656e1958ecc1e764052fd1aa
Summary:
I suspect LRUCache could use some optimization, and to support
such an effort, a good benchmarking tool is needed. The existing
cache_bench was heavily skewed toward insertion and lookup misses, and
did not saturate memory with other work. This change should improve
those things to better resemble a real workload.
(All below using clang compiler, for some consistency, but not
necessarily same version and settings.)
The real workload is from production MySQL on RocksDB, filtering stacks
containing "LRU", "ShardedCache" or "CacheShard."
Lookup inclusive: 66%
Insert inclusive: 17%
Release inclusive: 15%
An alternate simulated workload is MySQL running a LinkBench read test:
Lookup inclusive: 54%
Insert inclusive: 24%
Release inclusive: 21%
cache_bench default settings, prior to this change:
Lookup inclusive: 35.8%
Insert inclusive: 63.6%
Release inclusive: 0%
cache_bench after this change (intended as somewhat "tighter" workload
than average production, more like LinkBench):
Lookup inclusive: 52%
Insert inclusive: 20%
Release inclusive: 26%
And top exclusive stacks (portion of stack samples as filtered above):
Production MySQL:
LRUHandleTable::FindPointer: 25.3%
rocksdb::operator==: 15.1% <-- Slice ==
LRUCacheShard::LRU_Remove: 13.8%
ShardedCache::Lookup: 8.9%
__pthread_mutex_lock: 7.1%
LRUCacheShard::LRU_Insert: 6.3%
MurmurHash64A: 4.8% <-- Since upgraded to XXH3p
...
Old cache_bench:
LRUHandleTable::FindPointer: 23.6%
__pthread_mutex_lock: 15.0%
__pthread_mutex_unlock_usercnt: 11.7%
__lll_lock_wait: 8.6%
__lll_unlock_wake: 6.8%
LRUCacheShard::LRU_Insert: 6.0%
ShardedCache::Lookup: 4.4%
LRUCacheShard::LRU_Remove: 2.8%
...
rocksdb::operator==: 0.2% <-- Slice ==
...
New cache_bench:
LRUHandleTable::FindPointer: 22.8%
__pthread_mutex_unlock_usercnt: 14.3%
rocksdb::operator==: 10.5% <-- Slice ==
LRUCacheShard::LRU_Insert: 9.0%
__pthread_mutex_lock: 5.9%
LRUCacheShard::LRU_Remove: 5.0%
...
ShardedCache::Lookup: 2.9%
...
So there's a bit more lock contention in the benchmark than in
production, but otherwise looks similar enough to me. At least it's a
big improvement over the existing code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6629
Test Plan: No production code changes, ran cache_bench with ASAN
Reviewed By: ltamasi
Differential Revision: D20824318
Pulled By: pdillinger
fbshipit-source-id: 6f8dc5891ead0f87edbed3a615ecd5289d9abe12
Summary:
In CompactionManifestWriteRetryableError in error_handler_fs_test, the manifest write of flush should pass with no fs error. After flush, fs is set to error status and the manifest write of compaction should fail due to the IO Error. Currently, the manifest write of flush is not synced with the compaction in order, which might cause manifest write fails, which will cause test failure. Fixed by adding the LoadDependency of sync-point after flush and before compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6637
Test Plan: pass error_hanlder_fs_tes. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20826969
Pulled By: zhichao-cao
fbshipit-source-id: fb2e702caa19bd63c82570320536b7acda870ff1
Summary:
With https://github.com/facebook/rocksdb/issues/6262, TSAN complains about data race of some variables. Those variables are used to estimate file size and are accessed in writer and background threads. Since file size estimation doesn't have to be 100% accurate, we make some variables atomic and use relaxed memory order.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6636
Test Plan: Run all tests with TSAN.
Reviewed By: anand1976
Differential Revision: D20820635
fbshipit-source-id: 1ea45ff38be15e33674ffe06b7d42fc9fe161ea5
Summary:
Does what it says on the can. Similarly to table files, we need to re-persist
the metadata of live blob files whenever a new manifest file is opened.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6630
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D20802126
Pulled By: ltamasi
fbshipit-source-id: 5738692d898790293bf09d66e9997369bbf89566