Summary:
As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in https://github.com/facebook/rocksdb/issues/7523
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7580
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D24485420
Pulled By: zhichao-cao
fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799
Summary:
Remove function calling in assert statement as assert is a no
op in opt build and that function might not be called. This causes hang
in closing RocksDB when refit level is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7581
Test Plan: make check -j64
Reviewed By: riversand963
Differential Revision: D24466420
Pulled By: akankshamahajan15
fbshipit-source-id: 97db4ec5a95ae693c3290e176a3c12a9b1ad2f6d
Summary:
This PR adds support for writing a location identifier of the DB host to SST files as a table property. By default, the hostname is used, but can be overridden by the user. There have been some recent corruptions in files written by ```SstFileWriter``` before checksumming, so this property can be used to trace it back to the writing host and checking the host for hardware isues.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7479
Test Plan: Add new unit tests
Reviewed By: pdillinger
Differential Revision: D24340671
Pulled By: anand1976
fbshipit-source-id: 2038949fd8d160c0633ccb4f9da77740f19fa2a2
Summary:
The patch adds blob file support to the `Get` API by extending `Version` so that
whenever a blob reference is read from a file, the blob is retrieved from the corresponding
blob file and passed back to the caller. (This is assuming the blob reference is valid
and the blob file is actually part of the given `Version`.) It also introduces a cache
of `BlobFileReader`s called `BlobFileCache` that enables sharing `BlobFileReader`s
between callers. `BlobFileCache` uses the same backing cache as `TableCache`, so
`max_open_files` (if specified) limits the total number of open (table + blob) files.
TODO: proactively open/cache blob files and pin the cache handles of the readers in the
metadata objects similarly to what `VersionBuilder::LoadTableHandlers` does for
table files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7540
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24260219
Pulled By: ltamasi
fbshipit-source-id: a8a2a4f11d3d04d6082201b52184bc4d7b0857ba
Summary:
Make LoadLatestOptions return PathNotFound if the options file does not exist. Added tests for the LoadOptions related methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7554
Reviewed By: akankshamahajan15
Differential Revision: D24298985
Pulled By: zhichao-cao
fbshipit-source-id: c9ae3cb12fc4a5bbef07743e1c1300f98a2441b3
Summary:
Added unit tests that have paranoid_check = true that perform range deletions. At the moment, the deleted ranges do not appear to be checked as part of the paranoid checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7521
Reviewed By: zhichao-cao
Differential Revision: D24262175
Pulled By: ajkr
fbshipit-source-id: 1035e968f7ab8ccaa7af086b835a4e72c7e56743
Summary:
The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520
Test Plan:
- new unit test
- added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0`
Reviewed By: pdillinger
Differential Revision: D24200034
Pulled By: ajkr
fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e
Summary:
This option determines whether WALs will be tracked in MANIFEST and verified on recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7275
Test Plan:
db_options_test
options_test
Reviewed By: pdillinger
Differential Revision: D23181418
Pulled By: cheng-chang
fbshipit-source-id: 5dd1cdc166f3dfc1c93c094df4a2f7734e3b4547
Summary:
The `std::pair(const T1& x, const T2& y);` constructor is `constexpr`
only starting from C++14; relying on this breaks compilation on certain
compilers/platforms we need to support.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7519
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D24195747
Pulled By: ltamasi
fbshipit-source-id: 665e8fbc9747675bb49c5d895aad3dcf2714750f
Summary:
The patch does some cleanup in and around the legacy `BlobLogReader` class:
* It renames the class to `BlobLogSequentialReader` to emphasize that it is for
sequentially iterating through blobs in a blob file, as opposed to doing random
point reads using `BlobIndex`es (which is `BlobFileReader`'s jurisdiction).
* It removes some dead code from the old BlobDB implementation that references
`BlobLogReader` (namely the method `BlobFile::OpenRandomAccessReader`).
* It cleans up some `#include`s and forward declarations.
* It fixes some incorrect/outdated comments related to the reader class.
* It adds a few assertions to the `Read` methods of the class.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7517
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24172611
Pulled By: ltamasi
fbshipit-source-id: 43e2ae1eba5c3dd30c1070cb00f217edc45bd64f
Summary:
The patch adds a class called `BlobFileReader` that can be used to retrieve blobs
using the information available in blob references (e.g. blob file number, offset, and
size). This will come in handy when implementing blob support for `Get`, `MultiGet`,
and iterators, and also for compaction/garbage collection.
When a `BlobFileReader` object is created (using the factory method `Create`),
it first checks whether the specified file is potentially valid by comparing the file
size against the combined size of the blob file header and footer (files smaller than
the threshold are considered malformed). Then, it opens the file, and reads and verifies
the header and footer. The verification involves magic number/CRC checks
as well as checking for unexpected header/footer fields, e.g. incorrect column family ID
or TTL blob files.
Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression
type passed by the caller (because of the presence of the header and footer, the
specified offset cannot be too close to the start/end of the file; also, the compression type
has to match the one in the blob file header), and retrieves and potentially verifies and
uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set,
`BlobFileReader` reads the blob record header as well (as opposed to just the blob itself)
and verifies the key/value size, the key itself, as well as the CRC of the blob record header
and the key/value pair.
In addition, the patch exposes the compression type from `BlobIndex` (both using an
accessor and via `DebugString`), and adds a blob file read latency histogram to
`InternalStats` that can be used with `BlobFileReader`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23999219
Pulled By: ltamasi
fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
Summary:
Add following stats for MultiGet in Histogram to get more insight on MultiGet.
1. Number of index and filter blocks read from file as part of MultiGet
request per level.
2. Number of data blocks read from file per level.
3. Number of SST files loaded from file system per level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7366
Reviewed By: anand1976
Differential Revision: D24127040
Pulled By: akankshamahajan15
fbshipit-source-id: e63a003056b833729b277edc0639c08fb432756b
Summary:
If `BottommostLevelCompaction.kForce*` is set, compaction should avoid
trivial move and always compact the sst to the target size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7368
Reviewed By: ajkr
Differential Revision: D23629525
Pulled By: jay-zhuang
fbshipit-source-id: 79f23c79ecb31587e0593b28cce43131107bbcd0
Summary:
In opt mode, assertions are just no-ops. Therefore, we need to report errors instead of just doing an `assert(false)`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7483
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D24142725
Pulled By: riversand963
fbshipit-source-id: 5629556dbe29f00dd09e30a7d5df5e6cf09ee435
Summary:
`BeginWriteStall()` removes no_slowdown write from the write
list and updates `link_newer`, which makes `CreateMissingNewerLinks()`
thought all write list has valid `link_newer` and failed to create link
for all writers.
It caused flaky test and SegFault for release build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7508
Test Plan: Add unittest to reproduce the issue.
Reviewed By: anand1976
Differential Revision: D24126601
Pulled By: jay-zhuang
fbshipit-source-id: f8ac5dba653f7ee1b0950296427d4f5f8ee34a06
Summary:
Fix prefix_test so that it passes when ASSERT_STATUS_CHECKED=1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7495
Test Plan: Run the test with the option
Reviewed By: anand1976
Differential Revision: D24069715
fbshipit-source-id: 54f74b58575a1b49dbdee9ea2d24751fa956b620
Summary:
This PR schedules a background thread (shared across all DB instances)
to flush info log every ten seconds. This improves debuggability in case
of RocksDB hanging since it ensures the log messages leading up to the hang
will eventually become visible in the log.
The bulk of this PR is moving monitoring/stats_dump_scheduler* to db/periodic_work_scheduler*
and making the corresponding name changes since now the scheduler handles info
log flushing, not just stats dumping.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7488
Reviewed By: riversand963
Differential Revision: D24065165
Pulled By: ajkr
fbshipit-source-id: 339c47a0ff43b79fdbd055fbd9fefbb6f9d8d3b5
Summary:
Add all status handling in db_properties_test so that it can pass ASSERT_STATUS_CHECKED.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7490
Test Plan: Run the test with ASSERT_STATUS_CHECKED
Reviewed By: jay-zhuang
Differential Revision: D24065382
fbshipit-source-id: e008916155196891478c964df0226545308ca71d
Summary:
This exposes to the listener interface whether a compaction was
full or not. Also cleaned up API comment for CompactionJobInfo::stats,
which is not of a nullable type. And since CompactionJob is always
created with non-null CompactionJobStats, removed conditionals on it
being nullptr and instead assert non-null.
TODO later: update C and Java interfaces
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7451
Test Plan: updated existing unit tests to check new field, make check
Reviewed By: ltamasi
Differential Revision: D23977796
Pulled By: pdillinger
fbshipit-source-id: 1ae7e26cb949631c2b2fb9e696710daf53cc378d
Summary:
`DBTest.FileCreationRandomFailure` frequently times out during our
continuous test runs. (It's a case of "stress test posing as unit test.")
The patch reduces the number of iterations to avoid this. Note that
the lower numbers are still sufficient to trigger both flushes and
compactions, so test coverage is still the same.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7481
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24034712
Pulled By: ltamasi
fbshipit-source-id: 8731a9446e5a121a1041b00f0df473b9f714935a
Summary:
Introduce an new option options.check_flush_compaction_key_order, by default set to true, which checks key order of flush and compaction, and fail the operation if the order is violated.
Also did minor refactor hash checking code, which consolidates the hashing logic to a vlidation class, where the key ordering logic is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7467
Test Plan: Add unit tests to validate the check can catch reordering in flush and compaction, and can be properly disabled.
Reviewed By: riversand963
Differential Revision: D24010683
fbshipit-source-id: 8dd6292d2cda8006054e9ded7cfa4bf405f0527c
Summary:
This PR addresses some build and functional issues on MSVC targets, as a step towards an eventual goal of having RocksDB build successfully for Windows on ARM64.
Addressed issues include:
- BitsSetToOne and CountTrailingZeroBits do not compile on non-x64 MSVC targets. A fallback implementation of BitsSetToOne when Intel intrinsics are not available is added, based on the C++20 `<bit>` popcount implementation in Microsoft's STL.
- The implementation of FloorLog2 for MSVC targets (including x64) gives incorrect results. The unit test easily detects this, but CircleCI is currently configured to only run a specific set of tests for Windows CMake builds, so this seems to have been unnoticed.
- AsmVolatilePause does not use YieldProcessor on Windows ARM64 targets, even though it is available.
- When CondVar::TimedWait calls Microsoft STL's condition_variable::wait_for, it can potentially trigger a bug (just recently fixed in the upcoming VS 16.8's STL) that deadlocks various tests that wait for a timer to execute, since `Timer::Run` doesn't get a chance to execute before being blocked by the test function acquiring the mutex.
- In c_test, `GetTempDir` assumes a POSIX-style temp path.
- `NormalizePath` did not eliminate consecutive POSIX-style path separators on Windows, resulting in test failures in e.g., wal_manager_test.
- Various other test failures.
In a followup PR I hope to modify CircleCI's config.yml to invoke all RocksDB unit tests in Windows CMake builds with CTest, instead of the current use of `run_ci_db_test.ps1` which requires individual tests to be specified and is missing many of the existing tests.
Notes from peterd: FloorLog2 is not yet used in production code (it's for something in progress). I also added a few more inexpensive platform-dependent tests to Windows CircleCI runs. And included facebook/folly#1461 as requested
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7439
Reviewed By: jay-zhuang
Differential Revision: D24021563
Pulled By: pdillinger
fbshipit-source-id: 0ec2027c0d6a494d8a0fe38d9667fc2f7e29f7e7
Summary:
Do not assert the number of files after intra-L0 compaction is eligible to run since it could complete (and reduce the number of files) before the assertion executes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7477
Reviewed By: pdillinger
Differential Revision: D24032049
Pulled By: ajkr
fbshipit-source-id: e838ac7a24651ebd643b9e5a9d39d2e789c46929
Summary:
This has been running in production on some key workloads, so
we believe it to be safe and extremely low cost. Nevertheless, I've
added code to ensure that "force_consistency_checks" is mentioned in
any corruption reports so that people know how to disable in case of
false positive corruption reports.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7446
Test Plan:
make check, CI, temporary debug print new message with
./version_builder_test
Reviewed By: ajkr
Differential Revision: D23972101
Pulled By: pdillinger
fbshipit-source-id: 9623e400f3752577c0ecf977e6d0915562cf9968
Summary:
Add a new Option "allow_data_in_errors". When it's set by users, it allows them to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data.
By default value is set false to prevent users data to be exposed in the messages.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7420
Test Plan:
1. make check -j64
2. Add a new test case
Reviewed By: ajkr
Differential Revision: D23835028
Pulled By: akankshamahajan15
fbshipit-source-id: 8d2eba8fb898e79fcf1fccc07295065a75eb59b1
Summary:
The assertion checks that there is no overlap in sequence numbers across levels in universal compaction. However, this assumption doesn't hold when there is a delete triggered compaction or a trivial move, as they operate on a subset of a level.
Tests -
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7421
Reviewed By: ajkr
Differential Revision: D23872672
Pulled By: anand1976
fbshipit-source-id: c386deab8e01a5746ca996ff1f4ebcae3b15b7d2
Summary:
Add db_basic_test status check list. Some of the warnings are suppressed. It is possible that some of them are due to real bugs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7452
Test Plan: See CI tests pass.
Reviewed By: zhichao-cao
Differential Revision: D23979764
fbshipit-source-id: 6151570c2a9b931b0fbb3fe939a94b2bd1583cbe
Summary:
A generic algorithm in progress depends on a templatized
version of fastrange, so this change generalizes it and renames
it to fit our style guidelines, FastRange32, FastRange64, and now
FastRangeGeneric.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7436
Test Plan: added a few more test cases
Reviewed By: jay-zhuang
Differential Revision: D23958153
Pulled By: pdillinger
fbshipit-source-id: 8c3b76101653417804997e5f076623a25586f3e8
Summary:
Possible fix for a TSAN issue reported in EnableFileDeletions.
disable_delete_obsolete_files_ should only be accessed holding the db
mutex, but for logging it was being accessed outside holding the mutex,
now fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7435
Test Plan: existing tests, watch for recurrence
Reviewed By: ltamasi
Differential Revision: D23917578
Pulled By: pdillinger
fbshipit-source-id: 8573025bca3f6fe169b24b87bbfc4ce9667b0482
Summary:
Add a method `CheckWals` in `WalSet` to check the logs on disk. See `CheckWals`'s comments.
This method will be used to check consistency of WALs during DB recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7236
Test Plan: a set of tests are added to wal_edit_test.cc.
Reviewed By: riversand963
Differential Revision: D23036505
Pulled By: cheng-chang
fbshipit-source-id: 5b1d6857ac173429b00f950c32c4a5b8d063a732
Summary:
Fix few test cases and add them in ASSERT_STATUS_CHECKED build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7427
Test Plan:
1. ASSERT_STATUS_CHECKED=1 make -j48 check,
2. travis build for ASSERT_STATUS_CHECKED,
3. Without ASSERT_STATUS_CHECKED: make check -j64, CircleCI build and travis build
Reviewed By: pdillinger
Differential Revision: D23909983
Pulled By: akankshamahajan15
fbshipit-source-id: 42d7e4aea972acb9fcddb7ca73fcb82f93272434
Summary:
Implement a parsing tool io_tracer_parser that takes IO trace file (binary file) with command line argument --io_trace_file and output file with --output_file and dumps the IO trace records in outputfile in human readable form.
Also added unit test cases that generates IO trace records and calls io_tracer_parse to parse those records.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7333
Test Plan:
make check -j64,
Add unit test cases.
Reviewed By: anand1976
Differential Revision: D23772360
Pulled By: akankshamahajan15
fbshipit-source-id: 9c20519c189362e6663352d08863326f3e496271
Summary:
This option is apparently used by some teams within Facebook
(internal ref T75998621)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7431
Test Plan: USE_CLANG=1 make check before (fails) and after
Reviewed By: jay-zhuang
Differential Revision: D23876584
Pulled By: pdillinger
fbshipit-source-id: abb8b67a1f1aac75327944d266e284b2b6727191
Summary:
There are some tricky behaviors related to WAL sync:
- When creating a WAL, the WAL might not be synced, if the WAL directory is not synced, the WAL file's metadata may not even be synced to disk, so during recovery, when listing the WAL directory, the WAL may not even show up.
- During each DB::Write, the WriteOption can control whether the WAL should be synced, so a WAL previously not synced on creation can be synced during Write.
For each `SyncWAL`, we'll track the synced status and the current WAL size. Previously, we only track the WAL size on closing.
During recovery, we check that the on-disk WAL size is >= the last synced size.
So this PR introduces `synced_size` and `closed` to `WalMetadata` for the above design update.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7414
Test Plan:
- updated wal_edit_test
- updated version_edit_test
Reviewed By: riversand963
Differential Revision: D23796127
Pulled By: cheng-chang
fbshipit-source-id: 5498ab80f537c48a10157e71a4745716aef5cf30
Summary:
Current implementation holds db mutex while calling
`GetAggregatedIntProperty()`. For property kEstimateTableReadersMem,
this can be expensive, especially if the number of table readers is
high.
We can release and re-acquire db mutex if
property_info.need_out_of_mutex is true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7412
Test Plan:
make check
COMPILE_WITH_ASAN=1 make check
COMPILE_WITH_TSAN=1 make check
Also test internally on a shadow host. Used bpf to verify the
excessively long db mutex holding no longer exists when applications
call GetApproximateMemoryUsageByType().
Reviewed By: jay-zhuang
Differential Revision: D23794824
Pulled By: riversand963
fbshipit-source-id: 6bc02a59fd25613d343a62cf817467c7122c9721
Summary:
Make the test robust to spurious wakeups on condition variable,
and clear sync points to ensure no use-after-free.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7418
Test Plan: repeated runs on updated test, watch CircleCI for recurrence
Reviewed By: jay-zhuang
Differential Revision: D23828823
Pulled By: pdillinger
fbshipit-source-id: af85117d9c02602541a90252840e0e5a6996de5b
Summary:
Fix the flaky test failure in error_handler_fs_test. Add the sync point, solve the dependency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7415
Test Plan: make asan_check, ~/gtest-parallel/gtest-parallel -r 100 ./error_handler_fs_test
Reviewed By: siying
Differential Revision: D23804330
Pulled By: zhichao-cao
fbshipit-source-id: 5175108651f7652e47e15978f2a9c1669ef59d80
Summary:
In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310
Test Plan: adding new unit test, pass make check.
Reviewed By: anand1976
Differential Revision: D23710892
Pulled By: zhichao-cao
fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9
Summary:
Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '\_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.
This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.
We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).
We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.
Two final auxiliary notes:
Recognizing that the backup file names have become a de facto part of
the backup meta schema, this change makes them easier to parse and
extend by putting a distinct marker, 's', before DB session ids embedded
in the name. When we extend this to allow custom checksums in the name,
they can get their own marker to ensure safe parsing. For backward
compatibility, file size does not get a marker but is assumed for
`_[0-9]+[.]`
Another change from initial 6.12 default behavior is never including
file custom checksum in the file name. Looking ahead to 6.13, we do not
want the default behavior to cause backup space amplification for
someone turning on file custom checksum checking in BackupEngine; we
want that to be an easy decision. When implemented, including file
custom checksums in backup file names will be a non-default option.
Actual file name patterns and priorities, as regexes:
kLegacyCrc32cAndFileSize OR pre-6.12 SST file ->
[0-9]+_[0-9]+_[0-9]+[.]sst
kFlagMatchInterimNaming set (default) AND early 6.12 SST file ->
[0-9]+_[0-9a-fA-F-]+[.]sst
kUseDbSessionId AND NOT kFlagIncludeFileSize ->
[0-9]+_s[0-9A-Z]{20}[.]sst
kUseDbSessionId AND kFlagIncludeFileSize (default) ->
[0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst
We might add opt-in options for more '\_' separated data in the name,
but embedded file size, if present, will always be after last '\_' and
before '.sst'.
This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400
Test Plan:
unit tests included. Sync point callbacks are used to mimic
previous version SST files.
Reviewed By: ajkr
Differential Revision: D23759587
Pulled By: pdillinger
fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025
Summary:
`IntraL0CompactionAfterFlushCheckConsistencyFail` was flaky by sometimes failing due to no intra-L0 compactions happening. I was able to repro it by putting a `sleep(1)` in the compaction thread before it grabs the lock and picks a compaction. This also showed other intra-L0 tests are affected too, although some of them exhibit hanging forever rather than failing.
The problem was that all the flushes/ingestions could finish before any compaction got picked, so it would end up simply picking all the files that the test generates for L0->L1. But, these tests intend only the first few files to be picked for L0->L1, and the subsequent files to be picked for intra-L0. This PR adjusts the sync points of all the intra-L0 tests to enforce this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7382
Test Plan: run all the `db_compaction_test`s with and without the artificial `sleep()`
Reviewed By: jay-zhuang
Differential Revision: D23684985
Pulled By: ajkr
fbshipit-source-id: 6508399030dddec7738e9853a7b3dc53ef77a584
Summary:
The patch adds support for extracting large values into blob files when
performing a flush during recovery (when `avoid_flush_during_recovery` is
`false`). Blob files are built and added to the `Version` similarly to flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7388
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23709912
Pulled By: ltamasi
fbshipit-source-id: ce48b4227849cf25429ae98574e72b0e1cb9c67d
Summary:
Cleaned up the public API to use the EncryptedEnv. This change will allow providers to be developed and added to the system easier in the future. It will also allow better integration in the future with the OPTIONS file.
- The internal classes were moved out of the public API into an internal "env_encryption_ctr.h" header. Short-cut constructors were added to provide the original API functionality.
- The APIs to the constructors were changed to take shared_ptr, rather than raw pointers or references to allow better memory management and alternative implementations.
- CreateFromString methods were added to allow future expansion to other provider and cipher implementations through a standard API.
Additionally, there was a code duplication in the NewXXXFile methods. This common code was moved under a templatized function.
A first-pass at structuring the code was made to potentially allow multiple EncryptionProviders in a single EncryptedEnv. The idea was that different providers may use different cipher keys or different versions/algorithms. The EncryptedEnv should have some means of picking different providers based on information. The groundwork was started for this (the use of the provider_ member variable was localized) but the work has not been completed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7279
Reviewed By: jay-zhuang
Differential Revision: D23709440
Pulled By: zhichao-cao
fbshipit-source-id: 0e845fff0e03a52603eb9672b4ade32d063ff2f2
Summary:
The patch adds support for writing blob files during flush by integrating
`BlobFileBuilder` with the flush logic, most importantly, `BuildTable` and
`CompactionIterator`. If `enable_blob_files` is set, large values are extracted
to blob files and replaced with references. The resulting blob files are then
logged to the MANIFEST as part of the flush job's `VersionEdit` and
added to the `Version`, similarly to table files. Errors related to writing
blob files fail the flush, and any blob files written by such jobs are immediately
deleted (again, similarly to how SST files are handled). In addition, the patch
extends the logging and statistics around flushes to account for the presence
of blob files (e.g. `InternalStats::CompactionStats::bytes_written`, which is
used for calculating write amplification, now considers the blob files as well).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7345
Test Plan: Tested using `make check` and `db_bench`.
Reviewed By: riversand963
Differential Revision: D23506369
Pulled By: ltamasi
fbshipit-source-id: 646885f22dfbe063f650d38a1fedc132f499a159
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753
Reviewed By: ajkr
Differential Revision: D23385030
Pulled By: zhichao-cao
fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
Summary:
In a distributed file system, directory ownership is enforced by fencing
off the previous owner once they've been preempted by a new owner. This
PR adds a IOStatus subcode for ```StatusCode::IOError``` to indicate this.
Once this error is returned for a file write, the DB is put in read-only
mode and not allowed to resume in read-write mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7374
Test Plan: Add new unit tests in ```error_handler_fs_test```
Reviewed By: riversand963
Differential Revision: D23687777
Pulled By: anand1976
fbshipit-source-id: bef948642089dc0af399057864d9a8ca339e8b2f
Summary:
During bottommost compaction, RocksDB cannot simply drop a tombstone if
this tombstone is not in the earliest snapshot. The current behavior is: RocksDB
skips other internal keys (of the same user key) in the same snapshot range. In
the meantime, RocksDB should check for the `shutting_down` flag. Otherwise, it
is possible for a bottommost compaction that has already started running to take
a long time to finish, even if the application has tried to cancel all background jobs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7356
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D23663241
Pulled By: riversand963
fbshipit-source-id: 25f8e9b51bc3bfa3353cdf87557800f9d90ee0b5
Summary:
We've seen some segfaults in db_write_test, with at least one
suggesting corruption of a write group linked list. Adding an assertion
to have this fail in a more specific way if that is the broken
invariant.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7375
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D23638477
Pulled By: pdillinger
fbshipit-source-id: a76fd677cad60a3a516bd363947bfd9ce418edc1
Summary:
https://github.com/facebook/rocksdb/issues/3341 guaranteed that upon return of `GetSortedWalFiles` after
`DisableFileDeletions`, all pending purges of previously obsolete WAL
files will have finished. However, the addition of
avoid_unnecessary_blocking_io in https://github.com/facebook/rocksdb/issues/5043 opened a hole in the code making
that assurance, which can lead to files to be copied for checkpoint or
backup going missing before being copied, with that option enabled.
This change patches the hole.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7369
Test Plan:
apparent fix to backups in crash test observed. Will work
on a unit test for another commit
Reviewed By: ajkr
Differential Revision: D23620258
Pulled By: pdillinger
fbshipit-source-id: bea36b461a5b719c3e3ef802f967bc3e8ae71614
Summary:
When multiple background jobs are generating blob files in parallel, it is actually
possible for a blob file to be added with a file number that is lower than the
highest one in the base version. (This is a harmless race condition.) The patch
fixes the handling of this case and adds a unit test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7349
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23542453
Pulled By: ltamasi
fbshipit-source-id: 4ff6f3654bc58c391d10b9870e1cc40b5e3fa8e4
Summary:
Replace FSRandomRWFile pointer with FSRandomRWFilePtr object in the rocksdb internal code.
This new object wraps FSRandomRWFile pointer.
Objective: If tracing is enabled, FSRandomRWFile object returns FSRandomRWFileTracingWrapper pointer that includes all necessary information in IORecord and calls underlying FileSystem and invokes IOTracer to dump that record in a binary file. If tracing is disabled then, underlying FileSystem pointer is returned directly.
FSRandomRWFilePtr wrapper class is added to bypass the FSRandomRWFileWrapper when
tracing is disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7198
Test Plan: make check -j64
Reviewed By: anand1976
Differential Revision: D23421116
Pulled By: akankshamahajan15
fbshipit-source-id: 8a5ba0e7d9c1ba34c3a6f29829b107c5f09ab6a3
Summary:
Replace FSWritableFile pointer with FSWritableFilePtr
object in WritableFileWriter.
This new object wraps FSWritableFile pointer.
Objective: If tracing is enabled, FSWritableFile Ptr returns
FSWritableFileTracingWrapper pointer that includes all necessary
information in IORecord and calls underlying FileSystem and invokes
IOTracer to dump that record in a binary file. If tracing is disabled
then, underlying FileSystem pointer is returned directly.
FSWritableFilePtr wrapper class is added to bypass the
FSWritableFileWrapper when
tracing is disabled.
Test Plan: make check -j64
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7193
Reviewed By: anand1976
Differential Revision: D23355915
Pulled By: akankshamahajan15
fbshipit-source-id: e62a27a13c1fd77e36a6dbafc7006d969bed25cf
Summary:
The patch cleans up a few things in `CompactionJob::SubcompactionState`:
* Instead of using both the member initializer list and in-class initializers (and
sometimes both at the same time for the same member), the struct now uniformly
uses the latter to initialize integer members.
* The default parameter value for the constructor parameter `size` is removed.
* The explicitly deleted copy operations are removed, since they are implicitly deleted
anyways because of the `unique_ptr` members.
* The handwritten move operations, which did not move the member `c_iter` and
were not declared `nothrow`, are removed. Note that with the user-declared copy
operations gone (see the previous item), we can rely on the compiler to (correctly)
generate these methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7322
Test Plan: `make check`
Reviewed By: siying
Differential Revision: D23382408
Pulled By: ltamasi
fbshipit-source-id: a4ae5af150161c50ff7bdc07fa145482d0150bfe
Summary:
Currently, application may pass a statistics object to db but later
wants to reduce stats tracking overhead by setting stats level to
kExceptHistogramOrTimers (the current lowest level). Tickers will still
be incremented, causing up to 1% CPU. We can add a new lowest stats
level `kExceptTickers` to disable ticker incrementing as well, thus
reducing CPU cycles spent on tickers.
Test Plan (devserver):
```
make check
make clean
DEBUG_LEVEL=0 make db_bench
./db_bench -perf_level=1 -stats_level=0 -statistics -benchmarks=fillseq,readrandom -duration=120
```
Measure CPU util (%) before and after change:
CPU util by rocksdb::RecordTick: 1.1 vs (<0.1)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7329
Reviewed By: pdillinger
Differential Revision: D23434014
Pulled By: riversand963
fbshipit-source-id: 72ff0f02a192ac476d4b0044b9f37fd4a22ff0d4
Summary:
This PR creates `rocksdb_open_column_families_with_ttl` which allows C API users to open a DBWithTLL with column families.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7314
Reviewed By: cheng-chang
Differential Revision: D23430287
Pulled By: ajkr
fbshipit-source-id: 307aa21d170d1402653263a91f6f832ef76afba0
Summary:
- Closes https://github.com/facebook/rocksdb/issues/6490
- Currently MERGEs are converted to PUTs at bottom or compaction has reached the beginning of the key, this can wrongly cover a PUT future base case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7166
Test Plan:
- Automated: `make all check`
- Manual: With `allow_ingest_behind = true`, add Merge operations to a key then run compaction. Then run ingesting external files to make sure the base case is probably compacted with existing Merges.
Reviewed By: cheng-chang
Differential Revision: D23325425
Pulled By: ajkr
fbshipit-source-id: 3eb415eb7b381b5453e45245393566153b1abb68
Summary:
L0 score is based on size target and number of files. The size target
used is `max_bytes_for_level_base`. However, the base level's size can
dynamically expand in write burst mode. In fact, it can expand so much
that L0->Lbase becomes the highest fanout in target sizes. This doesn't
make sense from an efficiency perspective, so this PR bounds the
L0->Lbase fanout to the smoothed level multiplier. The L0 scoring based
on file count remains unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7325
Test Plan:
contrived benchmark that exhibits the problem:
```
$ TEST_TMPDIR=/data/users/andrewkr/ ./db_bench -benchmarks=filluniquerandom,readrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -level0_file_num_compaction_trigger=4 -level_compaction_dynamic_level_bytes=true -compression_type=none -max_background_jobs=12 -rate_limiter_bytes_per_sec=104857600 -benchmark_write_rate_limit=10485760 -num=100000000
```
Results:
- "Burst W-Amp" is the write-amp near the end of the fillrandom benchmark
- "Total W-Amp" is the write-amp after readrandom has run a while and all levels no longer need compaction
Branch | Burst W-Amp | Total W-Amp | fillrandom (MB/s)
-- | -- | -- | --
master | 20.2 | 21.5 | 4.7
dynamic-l0-score | 12.6 | 14.1 | 7.2
Reviewed By: siying
Differential Revision: D23412935
Pulled By: ajkr
fbshipit-source-id: f91f2067188e432dd39deab02f1c56f195057a0e
Summary:
The patch adds a log message to `BlobFileBuilder` that is logged upon
generating a blob file, similarly to how we log the generation of table files
during flush and compaction. The log message contains the column family
name, job id, blob file number, and the number and total size of blobs in
the new file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7324
Test Plan: Ran `make check` and checked the actual log messages using a custom `db_bench`.
Reviewed By: riversand963
Differential Revision: D23402229
Pulled By: ltamasi
fbshipit-source-id: ca42beb4db284b783d1eb2651f321032a45d0c5f
Summary:
Add a unit test case to check memory usage when
max_write_buffer_size_to_maintain is set if flushed immutable memtables are
trimmed timely or not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7311
Test Plan: Compared the results with before bug fix.
Reviewed By: ltamasi
Differential Revision: D23321702
Pulled By: akankshamahajan15
fbshipit-source-id: da04ee21137d641a07fd499a9e2749eb036fcb1e
Summary:
The patch adds a class called `BlobFileBuilder` that can be used to build
and cut blob files in background jobs (flushes/compactions). The class
enforces a value size threshold (`min_blob_size`; smaller blobs will be inlined
in the LSM tree itself), and supports specifying a blob file size limit (`blob_file_size`),
as well as compression (`blob_compression_type`) and checksums for blob files.
It also keeps track of the generated blob files and their associated `BlobFileAddition`
metadata, which can be applied as part of the background job's `VersionEdit`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7306
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23298817
Pulled By: ltamasi
fbshipit-source-id: 38f35d81dab1ba81f15236240612ec173d7f21b5
Summary:
Replace FSRandomAccessFile pointer with FSRandomAccessFilePtr
object in RandomAccessFileReader.
This new object wraps FSRandomAccessFile pointer.
Objective: If tracing is enabled, FSRandomAccessFile Ptr returns
FSRandomAccessFileTracingWrapper pointer that includes all necessary
information in IORecord and calls underlying FileSystem and invokes
IOTracer to dump that record in a binary file. If tracing is disabled
then, underlying FileSystem pointer is returned directly.
FSRandomAccessFilePtr wrapper class is added to bypass the FSRandomAccessFileWrapper when
tracing is disabled.
Test Plan: make check -j64
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7192
Reviewed By: anand1976
Differential Revision: D23356867
Pulled By: akankshamahajan15
fbshipit-source-id: 48f31168166a17a7444b40be44a9a9d4a5c7182c
Summary:
This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294.
To get DB checksum info for live files, we now read the manifest file
that will become part of the checkpoint/backup. This requires a little
extra handling in taking a custom checkpoint, including only reading the
manifest file up to the size prescribed by the checkpoint.
This moves GetFileChecksumsFromManifest from backup code to
file_checksum_helper.{h,cc} and removes apparently unnecessary checking
related to column families.
Updated HISTORY.md and warned potential future users of
DB::GetLiveFilesChecksumInfo()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309
Test Plan: updated unit test, before and after
Reviewed By: ajkr
Differential Revision: D23311994
Pulled By: pdillinger
fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5
Summary:
Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297
Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it.
Reviewed By: anand1976
Differential Revision: D23257950
fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03
Summary:
DBBasicTest.CompactBetweenSnapshots can time-out in some slow-I/O hosts. Parameterize it so that single test runs shorter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7301
Test Plan: Run the test and see see different runs are of different configerations in a hacky way.
Reviewed By: ltamasi
Differential Revision: D23277733
fbshipit-source-id: 1f717b4131322d175abf9e211131fe7e9b1ef758
Summary:
When SST file is created, application is able to know the file information through OnTableFileCreated callback in LogAndNotifyTableFileCreationFinished. Since file checksum information can be useful for application when the SST file is created, we add file_checksum and file_checksum_func_name information to TableFileCreationInfo, which will be passed through OnTableFileCreated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7108
Test Plan: make check, listener_test.
Reviewed By: ajkr
Differential Revision: D22470240
Pulled By: zhichao-cao
fbshipit-source-id: 92c20344d9b986eadfe3480f3769bf4add0dbaae
Summary:
After releasing a snapshot, it checks whether it is suitable to trigger bottom compactions.
When disabling auto compactions, it may still schedule compaction when releasing a snapshot. Whereas no compaction job will be actually handled, so the state of LSM is not changed and compaction will be triggered again and again every time releasing a snapshot.
Too frequent compactions lead to high CPU usage and high db_mutex lock contention which affects foreground write duration finally.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7267
Test Plan:
- make check
- manual test
Reviewed By: akankshamahajan15
Differential Revision: D23252880
Pulled By: ajkr
fbshipit-source-id: 4431e071a35d9912a2a3592875db27bae521434b
Summary:
More tests now pass. When in doubt, I added a TODO comment to check what should happen with an ignored error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7305
Reviewed By: akankshamahajan15
Differential Revision: D23301262
Pulled By: ajkr
fbshipit-source-id: 5f120edc7393560aefc0633250277bbc7e8de9e6