Summary:
Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles.
When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed.
In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749
Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end.
Reviewed By: jay-zhuang
Differential Revision: D30735261
fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e
Summary:
currently histogram `NUM_FILES_IN_SINGLE_COMPACTION` just counted files in first level of compaction input, this fix counts files in all levels of compaction input.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9026
Reviewed By: ajkr
Differential Revision: D31668241
Pulled By: jay-zhuang
fbshipit-source-id: c02f6c4a5df9fbf0b7510036594811152e8738af
Summary:
This PR fix wrong ticker `WRITE_WITH_WAL`.
`RecordTick(WRITE_WITH_WAL)` will be called later in `WriteToWAL` and `ConcurrentWriteToWAL`.
Fixes:
1. Delete these two extra `RecordTick(WRITE_WITH_WAL)`
2. Fix corresponding test case
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9064
Reviewed By: ajkr
Differential Revision: D31944459
Pulled By: riversand963
fbshipit-source-id: f1aa8d2a4320456bc357bc5b0902032f7dcad086
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061
In write-prepared txn, checking a sequence's visibility in a released (old)
snapshot may return "Snapshot released". Suppose we have two snapshots:
```
earliest_snap < earliest_write_conflict_snap
```
If we release `earliest_write_conflict_snap` but keep `earliest_snap` during
bottommost level compaction, then it is possible that certain sequence of
events can lead to a PUT being seq-zeroed followed by a SingleDelete of the
same key. This violates the ascending order of keys, and will cause data
inconsistency.
Reviewed By: ltamasi
Differential Revision: D31813017
fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285
Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.
Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument
This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.
DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.
Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.
### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
As you can see, especially once warmed up, xxh3 is fastest.
### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
Test
for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
Results (ops/sec)
for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3
Reviewed By: mrambacher
Differential Revision: D31905249
Pulled By: pdillinger
fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
Summary:
The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.
Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.
The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.
Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).
The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.
Thanks to siying for finding the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077
Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.
Reviewed By: siying
Differential Revision: D31921908
Pulled By: ajkr
fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
Summary:
In atomic flush, concurrent background flush threads will commit to the MANIFEST
one by one, in the order of the IDs of their picked memtables for all included column
families. Each time, a background flush thread decides whether to wait based on two
criteria:
- Is db stopped? If so, don't wait.
- Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go.
When atomic flush was implemented, error writing to or syncing the MANIFEST would
cause the db to be stopped. Therefore, this background thread does not have to check
for the background error while waiting. If there has been such an error, `DBStopped()`
would have been true, and this thread will **not** wait forever.
After we improved error handling, RocksDB may map an IOError while writing to MANIFEST
to a soft error, if there is no WAL. This requires the background threads to check for
background error while waiting. Otherwise, a background flush thread may wait forever.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D31639225
Pulled By: riversand963
fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd
Summary:
This PR supports querying `GetMapProperty()` with "rocksdb.dbstats" to get the DB-level stats in a map format. It only reports cumulative stats over the DB lifetime and, as such, does not update the baseline for interval stats. Like other map properties, the string keys are not (yet) exposed in the public API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9057
Test Plan: new unit test
Reviewed By: zhichao-cao
Differential Revision: D31781495
Pulled By: ajkr
fbshipit-source-id: 6f77d3aee8b4b1a015061b8c260a123859ceaf9b
Summary:
This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty.
In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control.
Two set of write benchmarks are run:
1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163
2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655
Test Plan: Add unit tests to the functionality.
Reviewed By: ajkr
Differential Revision: D31787034
fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
Summary:
Currently, if Secondary Cache is provided to the lru cache, it is used by default. We add CacheTier to advanced_options.h to describe the cache tier we used. Add a `lowest_used_cache_tier` option to `DBOptions` (immutable) and pass it to BlockBasedTableReader to decide if secondary cache will be used or not. By default it is `CacheTier::kNonVolatileTier`, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileTier). By set it to `CacheTier::kVolatileTier`, the DB will not use the secondary cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9050
Test Plan: added new tests
Reviewed By: anand1976
Differential Revision: D31744769
Pulled By: zhichao-cao
fbshipit-source-id: a0575ebd23e1c6dfcfc2b4c8578764e73b15bce6
Summary:
... by bypassing tracking of last_key in BlockBuilder when
last_key is already known (for BlockBasedTableBuilder::data_block).
I tried extracting a base class of BlockBuilder without the last_key
tracking at all, but that became complicated by NewFlushBlockPolicy() in
the public API referencing BlockBuilder, which would need to be the base
class, and I don't want to replace nearly all the internal references to
BlockBuilder.
Possible follow-up:
* Investigate / consider using AddWithLastKey in more places
This improvement should stack with https://github.com/facebook/rocksdb/issues/9039
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9040
Test Plan:
TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000
Compiled with DEBUG_LEVEL=0
Test vs. control runs simulaneous for better accuracy, units = ops/sec
Run 1: 278929 vs. 267799 (+4.2%)
Run 2: 281836 vs. 267432 (+5.4%)
Run 3: 278279 vs. 270454 (+2.9%)
(This benchmark is chosen to have detectable signal-to-noise, not to
represent expected improvement percent on real workloads.)
Reviewed By: mrambacher
Differential Revision: D31706033
Pulled By: pdillinger
fbshipit-source-id: 8a50fe6fefdd67b6d7665ffa687bbdcf5ad0d5ec
Summary:
Implementation of https://github.com/facebook/rocksdb/issues/8221, plus/including extension of Java options API to allow the get() of options from RocksDB. The extension allows more comprehensive testing of options at the Java side, by validating that the options are set at the C++ side.
Variations on methods:
MutableColumnFamilyOptions.MutableColumnFamilyOptionsBuilder getOptions()
MutableDBOptions.MutableDBOptionsBuilder getDBOptions()
retrieve the options via RocksDB C++ interfaces, and parse the resulting string into one of the Java-style option objects.
This necessitated generalising the parsing of option strings in Java, which now parses the full range of option strings returned by the C++ interface, rather than a useful subset. This necessitates the list-separator being changed to :(colon) from , (comma).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8999
Reviewed By: jay-zhuang
Differential Revision: D31655487
Pulled By: ltamasi
fbshipit-source-id: c38e98145c81c61dc38238b0df580db176ce4efd
Summary:
* New public header unique_id.h and function GetUniqueIdFromTableProperties
which computes a universally unique identifier based on table properties
of table files from recent RocksDB versions.
* Generation of DB session IDs is refactored so that they are
guaranteed unique in the lifetime of a process running RocksDB.
(SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
this enables SST unique IDs to be guaranteed unique among SSTs generated
in a single process, and "better than random" between processes.
See https://github.com/pdillinger/unique_id
* In addition to public API producing 'external' unique IDs, there is a function
for producing 'internal' unique IDs, with functions for converting between the
two. In short, the external ID is "safe" for things people might do with it, and
the internal ID enables more "power user" features for the future. Specifically,
the external ID goes through a hashing layer so that any subset of bits in the
external ID can be used as a hash of the full ID, while also preserving
uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
and on full 192 bits).
Intended follow-up:
* Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
the third 64-bit value of the unique ID.)
* Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990
Test Plan:
Unit tests added, and checking of unique ids in stress test.
NOTE in stress test we do not generate nearly enough files to thoroughly
stress uniqueness, but the test trims off pieces of the ID to check for
uniqueness so that we can infer (with some assumptions) stronger
properties in the aggregate.
Reviewed By: zhichao-cao, mrambacher
Differential Revision: D31582865
Pulled By: pdillinger
fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243
Summary:
If `DB::Close()` is called in multi-thread env, the resource
could be double released, which causes exception or assert.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8970
Test Plan:
Test with multi-thread benchmark, with each thread try to
close the DB at the end.
Reviewed By: pdillinger
Differential Revision: D31242042
Pulled By: jay-zhuang
fbshipit-source-id: a61276b1b61e07732e375554106946aea86a23eb
Summary:
New classes FileStorageInfo and LiveFileStorageInfo and
'experimental' function DB::GetLiveFilesStorageInfo, which is intended
to largely replace several fragmented DB functions needed to create
checkpoints and backups.
This function is now used to create checkpoints and backups, because
it fixes many (probably not all) of the prior complexities of checkpoint
not having atomic access to DB metadata. This also ensures strong
functional test coverage of the new API. Specifically, much of the old
CheckpointImpl::CreateCustomCheckpoint has been migrated to and
updated in DBImpl::GetLiveFilesStorageInfo, with the former now
calling the latter.
Also, the class FileStorageInfo in metadata.h compatibly replaces
BackupFileInfo and serves as a new base class for SstFileMetaData.
Some old fields of SstFileMetaData are still provided (for now) but
deprecated.
Although FileStorageInfo::directory is accurate when using db_paths
and/or cf_paths, these have never been supported by Checkpoint
nor BackupEngine and still are not. This change does now detect
these cases and return NotSupported when appropriate. (More work
needed for support.)
Somehow this change broke ProgressCallbackDuringBackup, but
the progress_callback logic was dubious to begin with because it
would call the callback based on copy buffer size, not size actually
copied. Logic and test updated to track size actually copied
per-thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968
Test Plan:
tests updated.
DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl.
DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo,
including reading the data after DB close.
Added CheckpointTest.CheckpointWithDbPath (NotSupported).
Reviewed By: siying
Differential Revision: D31242045
Pulled By: pdillinger
fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0
Summary:
The code in `IngestExternalFiles()` that bumps the DB's sequence number
depending on what seqnos were assigned to the files has 3 bugs:
1) There is an assertion that the sequence number is increased in all the
affected column families, but this is unnecessary, it is fine if some files can
stick to a lower sequence number. It is very easy to hit the assertion: it is
sufficient to insert 2 files in 2 CFs, one which overlaps the CF and one that
doesn't (for example the CF is empty). The line added in the
`IngestFilesIntoMultipleColumnFamilies_Success` test makes the assertion fail.
2) SetLastSequence() is called with the sum of all the bumps across CFs, but we
should take the maximum instead, as all CFs start with the current seqno and bump
it independently.
3) The code above is accidentally under a `#ifndef NDEBUG`, so it doesn't run in
optimized builds, so some files may be assigned seqnos from the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9005
Test Plan:
Added line in `IngestFilesIntoMultipleColumnFamilies_Success` that
triggers the assertion, verified that the test (and all the others) pass after
the fix.
Reviewed By: ajkr
Differential Revision: D31597892
Pulled By: ot
fbshipit-source-id: c2d3237f90290df1178736ace8653a9623f5a770
Summary:
`FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage.
The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995
Test Plan:
- Verified it fixes the following failure:
```
$ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
$ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound:
Crash-recovery verification failed :(
...
```
- `make check -j48`
Reviewed By: ltamasi
Differential Revision: D31495388
Pulled By: ajkr
fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385
Summary:
Add the file temperature to `IngestExternalFileArg` such that when SST files are ingested, user is able to assign the temperature to each SST file. If the temperature vector is empty or its size does not match the file name vector size, all ingested SST files will be assigned with `Temperature::unKnown`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8949
Test Plan: add the new test and make check
Reviewed By: siying
Differential Revision: D31127852
Pulled By: zhichao-cao
fbshipit-source-id: 141a81f0f7b473d88f4ab0cb2a21a114cbc6f83c
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8991
Test Plan: the new test hangs forever without this fix and passes with this fix.
Reviewed By: hx235
Differential Revision: D31456419
Pulled By: ajkr
fbshipit-source-id: a82c0e5560b6e6153089dccd8e46163c61b07bff
Summary:
Background: Cache warming up will cause potential read performance degradation due to reading blocks from storage to the block cache. Since in production, the workload and access pattern to a certain DB is stable, it is a potential solution to dump out the blocks belonging to a certain DB to persist storage (e.g., to a file) and bulk-load the blocks to Secondary cache before the DB is relaunched. For example, when migrating a DB form host A to host B, it will take a short period of time, the access pattern to blocks in the block cache will not change much. It is efficient to dump out the blocks of certain DB, migrate to the destination host and insert them to the Secondary cache before we relaunch the DB.
Design: we introduce the interface of CacheDumpWriter and CacheDumpRead for user to store the blocks dumped out from block cache. RocksDB will encode all the information and send the string to the writer. User can implement their own writer it they want. CacheDumper and CacheLoad are introduced to save the blocks and load the blocks respectively.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8912
Test Plan: add new tests to lru_cache_test and pass make check.
Reviewed By: pdillinger
Differential Revision: D31452871
Pulled By: zhichao-cao
fbshipit-source-id: 11ab4f5d03e383f476947116361d54188d36ec48
Summary:
The ldb list_live_files_metadata command does not print any information about blob files currently. We would like to add this functionality. Note that list_live_files_metadata has two different modes of operation: the one shown above, which shows the LSM tree structure, and another one, which can be enabled using the flag --sort_by_filename and simply lists the files in numerical order regardless of level. We would like to show blob files in both modes.
Changes:
1. Using GetAllColumnFamilyMetaData API instead of GetLiveFilesMetaData API for fetching live files data.
Testing:
1. Created a sample rocksdb instance using dbbench command (this creates both SST and blob files)
2. Checked if the blob files are listed or not by using ldb commands.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8976
Reviewed By: ltamasi
Differential Revision: D31316061
Pulled By: pradeepambati
fbshipit-source-id: d15cdea192febf7a45f28deee2ba40615d3d84ab
Summary:
If WAL dir is different from the DB dir, we should still honor the SstFileManager deletion rate limit for SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8967
Test Plan: Add a new unit test in db_sst_test
Reviewed By: pdillinger
Differential Revision: D31220116
Pulled By: anand1976
fbshipit-source-id: bcde8a53a7d728e15e597fb5d07ee86c1b38bd28
Summary:
Context:
Exposing the level of the sst file (i.e, table) where it is created in `TablePropertiesCollectorFactory::Context` allows users of `TablePropertiesCollectorFactory` to customize some implementation details of `TablePropertiesCollectorFactory` and `TablePropertiesCollector` based on the level of creation. For example, `TablePropertiesCollector::NeedCompact()` can return different values based on level of creation.
- Declared an extra field `level_at_creation` in `TablePropertiesCollectorFactory::Context`
- Allowed `level_at_creation` to be passed in as an argument in `IntTblPropCollectorFactory::CreateIntTblPropCollector()` and `UserKeyTablePropertiesCollectorFactory::CreateIntTblPropCollector()`, the latter of which is an internal wrapper of user's passed-in `TablePropertiesCollectorFactory::CreateTablePropertiesCollector()` used in table-building process
- Called `IntTblPropCollectorFactory::CreateIntTblPropCollector()` with `level_at_creation` passed into both `BlockBasedTableBuilder` and `PlainTableBuilder`
- `PlainTableBuilder` previously did not capture `level_at_creation` from `TableBuilderOptions` in `PlainTableFactory`. In order for it to call the method with this parameter, this PR also made `PlainTableBuilder` capture `level_at_creation` as a required parameter
- Called `IntTblPropCollectorFactory::CreateIntTblPropCollector()` with `level_at_creation` its overridden functions in its derived classes, including `RegularKeysStartWithAFactory::CreateIntTblPropCollector()` in `table_properties_collector_test.cc`, `SstFileWriterPropertiesCollectorFactory::CreateIntTblPropCollector()` in `sst_file_writer_collectors.h`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8919
Test Plan:
- Passed the added assertion for `context.level_at_creation`
- Passed existing tests
- Run `Make` to make sure adding a required parameter to `PlainTableBuilder`'s constructor does not break anything
Reviewed By: anand1976
Differential Revision: D30951729
Pulled By: hx235
fbshipit-source-id: c4a0173b0d9344a4cf47e1b987d759c1c73cb474
Summary:
Add a paranoid check where in case FileSystem layer doesn't fill the buffer but returns succeed, checksum is unlikely to match even if buffer contains a previous block. The byte modified is not useful anyway, so it isn't expect to change any behavior.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8955
Test Plan: See existing CI to pass.
Reviewed By: pdillinger
Differential Revision: D31183966
fbshipit-source-id: dcc4de429e18131873f783b90d3be55d7eb44a1f
Summary:
Made SliceTransform into a Customizable class.
Would be nice to write a test that stored and used a custom transform in an SST table.
There are a set of tests (DBBlockFliterTest.PrefixExtractor*, SamePrefixTest.InDomainTest, PrefixTest.PrefixAndWholeKeyTest that run the same with or without a SliceTransform/PrefixFilter. Is this expected?
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8641
Reviewed By: zhichao-cao
Differential Revision: D31142793
Pulled By: mrambacher
fbshipit-source-id: bb08672fccbfdc263dcae21f25a62307e1facda1
Summary:
`RandomAccessFileReader::MultiRead()` tries to merge requests in direct IO, assuming input IO requests are
sorted by offsets.
Add a test in direct IO mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8953
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D31183546
Pulled By: riversand963
fbshipit-source-id: 5d043ec68e2daa47a3149066150afd41ee3d73e6
Summary:
Context:
After more discussion, a fix in https://github.com/facebook/rocksdb/issues/8938 might turn out to be too restrictive for the case where `GetTotalPendingRequests` might be invoked on RateLimiter classes that does not support the recently added API `RateLimiter::GetTotalPendingRequests` (https://github.com/facebook/rocksdb/issues/8890) due to the `assert(false)` in https://github.com/facebook/rocksdb/issues/8938. Furthermore, sentinel value like `-1` proposed in https://github.com/facebook/rocksdb/issues/8938 is easy to be ignored and unchecked. Therefore we decided to adopt `Status::NotSupported()`, which is also a convention of adding new API to public header in RocksDB.
- Changed return value type of `RateLimiter::GetTotalPendingRequests` in related declaration/definition
- Passed in pointer argument to hold the output instead of returning it as before
- Adapted to the changes above in calling `RateLimiter::GetTotalPendingRequests` in test
- Minor improvement to `TEST_F(RateLimiterTest, GetTotalPendingRequests)`: added failure message for assertion and replaced repetitive statements with a loop
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8950
Reviewed By: ajkr, pdillinger
Differential Revision: D31128450
Pulled By: hx235
fbshipit-source-id: 282ac9c4f3dacaa0aec6d0a993161f77ad47a040
Summary:
Context/Summary:
https://github.com/facebook/rocksdb/pull/8890 added a public API `RateLimiter::GetTotalPendingRequest()` but mistakenly marked it as pure virtual, forcing RateLimiter's derived classes to implement this function and breaking backward compatibility.
This PR makes `RateLimiter::GetTotalPendingRequest()` as non-pure virtual method by providing a trivial implementation in rate_limiter.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8938
Test Plan: Passing existing tests
Reviewed By: pdillinger
Differential Revision: D31100661
Pulled By: hx235
fbshipit-source-id: 06eff1005156a6e5a881e393b2c5b2ad706897d8
Summary:
Updates a few remaining functions that should have been updated
from Status -> IOStatus, and adds to HISTORY for the overall change
including https://github.com/facebook/rocksdb/issues/8820.
This change is for inclusion in version 6.25.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8940
Test Plan: CI
Reviewed By: zhichao-cao
Differential Revision: D31085029
Pulled By: pdillinger
fbshipit-source-id: 91557c6a39ef1d90357d4f4dcd79af0645d87c7b
Summary:
Made SystemClock into a Customizable class, complete with CreateFromString.
Cleaned up some of the existing SystemClock implementations that were redundant (NoSleep was the same as the internal one for MockEnv).
Changed MockEnv construction to allow Clock to be passed to the Memory/MockFileSystem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8636
Reviewed By: zhichao-cao
Differential Revision: D30483360
Pulled By: mrambacher
fbshipit-source-id: cd0e3a876c39f8c98fe13374c06e8edbd5b9f2a1
Summary:
In case of IO uring bugs, we need to provide a way for users to turn it off.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8931
Test Plan: Manually run db_bench with/without the option and verify the behavior
Reviewed By: pdillinger
Differential Revision: D31040252
Pulled By: anand1976
fbshipit-source-id: 56f2537d6ac8488c9e126296d8190ad9e0158f70
Summary:
Add support for fallback to local compaction, the user can
return `CompactionServiceJobStatus::kUseLocal` to instruct RocksDB to
run the compaction locally instead of waiting for the remote compaction
result.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8709
Test Plan: unittest
Reviewed By: ajkr
Differential Revision: D30560163
Pulled By: jay-zhuang
fbshipit-source-id: 65d8905a4a1bc185a68daa120997f21d3198dbe1
Summary:
Because even 32-bit systems can have large files
This is a "change" that I don't want intermingled with an upcoming refactoring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8926
Test Plan: CI
Reviewed By: zhichao-cao
Differential Revision: D31020974
Pulled By: pdillinger
fbshipit-source-id: ca9eb4510697df6f1f55e37b37730b88b1809a92
Summary:
This keeps the implementations/API backward compatible. Implementations of Statistics will need to override this method (and be registered with the ObjectRegistry) in order to be created via CreateFromString.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8918
Reviewed By: pdillinger
Differential Revision: D30958916
Pulled By: mrambacher
fbshipit-source-id: 75b99a84e9e11fda2a9e8eff9ee1ef69a17517b2
Summary:
1. Extend FlushJobInfo and CompactionJobInfo with information about the blob files generated by flush/compaction jobs. This PR add two structures BlobFileInfo and BlobFileGarbageInfo that contains the required information of blob files.
2. Notify the creation and deletion of blob files through OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted.
3. Test OnFile*Finish operations notifications with Blob Files.
4. Log the blob file creation/deletion events through EventLogger in Log file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8675
Test Plan: Add new unit tests in listener_test
Reviewed By: ltamasi
Differential Revision: D30412613
Pulled By: akankshamahajan15
fbshipit-source-id: ca51b63c6e8c8d0485a38c503572bc5a82bd5d07
Summary:
Add compaction priority information in RemoteCompaction, which
can be used to schedule high priority job first.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8707
Test Plan: unittest
Reviewed By: ajkr
Differential Revision: D30548401
Pulled By: jay-zhuang
fbshipit-source-id: b30446511fb31b4583c49edd8565d496cf013a34
Summary:
kFlushOnly currently means "always" except in the case of
remote compaction. This makes it flushes only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8750
Test Plan: test updated
Reviewed By: akankshamahajan15
Differential Revision: D30968034
Pulled By: pdillinger
fbshipit-source-id: 5dbd24dde18852a0e937a540995fba9bfbe89037
Summary:
Potential bugs in the IO uring implementation can cause bad data to be returned in the completion queue. Add some checks in the PosixRandomAccessFile::MultiRead completion handling code to catch such errors and fail the entire MultiRead. Also log some diagnostic messages and stack trace.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8894
Reviewed By: siying, pdillinger
Differential Revision: D30826982
Pulled By: anand1976
fbshipit-source-id: af91815ac760e095d6cc0466cf8bd5c10167fd15
Summary:
In the past, we unnecessarily requires all keys in the same write batch
to be from column families whose timestamps' formats are the same for
simplicity. Specifically, we cannot use the same write batch to write to
two column families, one of which enables timestamp while the other
disables it.
The limitation is due to the member `timestamp_size_` that used to exist
in each `WriteBatch` object. We pass a timestamp_size to the constructor
of `WriteBatch`. Therefore, users can simply use the old
`WriteBatch::Put()`, `WriteBatch::Delete()`, etc APIs for write, while
the internal implementation of `WriteBatch` will take care of memory
allocation for timestamps.
The above is not necessary.
One the one hand, users can set up a memory buffer to store user key and
then contiguously append the timestamp to the user key. Then the user
can pass this buffer to the `WriteBatch::Put(Slice&)` API.
On the other hand, users can set up a SliceParts object which is an
array of Slices and let the last Slice to point to the memory buffer
storing timestamp. Then the user can pass the SliceParts object to the
`WriteBatch::Put(SliceParts&)` API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8725
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D30654499
Pulled By: riversand963
fbshipit-source-id: 9d848c77ad3c9dd629aa5fc4e2bc16fb0687b4a2
Summary:
Context/Summary:
As users requested, a public API RateLimiter::GetTotalPendingRequests() is added to expose the total number of pending requests for bytes in the rate limiter, which is the size of the request queue of that priority (or of all priorities, if IO_TOTAL is interested) at the time when this API is called.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8890
Test Plan:
- Passing added new unit tests
- Passing existing unit tests
Reviewed By: ajkr
Differential Revision: D30815500
Pulled By: hx235
fbshipit-source-id: 2dfa990f651c1c47378b6215c751ad76a5824300
Summary:
Context:
While all the non-trivial write operations in BackupEngine go through the RateLimiter, reads currently do not. In general, this is not a huge issue because (especially since some I/O efficiency fixes) reads in BackupEngine are mostly limited by corresponding writes, for both backup and restore. But in principle we should charge the RateLimiter for reads as well.
- Charged read operations in `BackupEngineImpl::CopyOrCreateFile`, `BackupEngineImpl::ReadFileAndComputeChecksum`, `BackupEngineImpl::BackupMeta::LoadFromFile` and `BackupEngineImpl::GetFileDbIdentities`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8722
Test Plan:
- Passed existing tests
- Passed added unit tests
Reviewed By: pdillinger
Differential Revision: D30610464
Pulled By: hx235
fbshipit-source-id: 9b08c9387159a5385c8d390d6666377a0d0117e5
Summary:
RocksDB exposes certain internal statistics via the DB property interface.
However, there are currently no properties related to BlobDB.
For starters, we would like to add the following BlobDB properties:
`rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree).
`rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version.
`rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions.
`rocksdb.live-blob-file-size`: the total size of all blob files in the current Version.
`rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734
Test Plan:
```
➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test
[==========] Running 16 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 10 tests from DBBlobBasicTest
[ RUN ] DBBlobBasicTest.GetBlob
[ OK ] DBBlobBasicTest.GetBlob (12 ms)
[ RUN ] DBBlobBasicTest.MultiGetBlobs
[ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms)
[ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex
[ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms)
[ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex
[ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms)
[ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber
[ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms)
[ RUN ] DBBlobBasicTest.GenerateIOTracing
[ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms)
[ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile
[ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms)
[ RUN ] DBBlobBasicTest.GetMergeBlobWithPut
[ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms)
[ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut
[ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms)
[ RUN ] DBBlobBasicTest.BlobDBProperties
[ OK ] DBBlobBasicTest.BlobDBProperties (21 ms)
[----------] 10 tests from DBBlobBasicTest (124 ms total)
[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms)
[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total)
[----------] Global test environment tear-down
[==========] 16 tests from 2 test cases ran. (2190 ms total)
[ PASSED ] 16 tests.
```
Reviewed By: ltamasi
Differential Revision: D30690849
Pulled By: Zhiyi-Zhang
fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
Summary:
Failure to create the lock file (e.g. out of space) could
prevent future LockFile attempts in the same process on the same file
from succeeding.
Also added DEBUG code to fail assertion if PosixFileLock is destroyed
without using UnlockFile (which is a risk because FileLock is in the
public API with virtual destructor).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8747
Test Plan: test added
Reviewed By: ajkr
Differential Revision: D30732543
Pulled By: pdillinger
fbshipit-source-id: 4c30a959566d91f778d6fad3fbbd5f3941b097c1
Summary:
Gets `Statistics` from the options associated with the `DB` undergoing backup, and populates new ticker stats with the thread-local `IOContext` read/write counters for the threads doing backup work.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8819
Reviewed By: pdillinger
Differential Revision: D30779238
Pulled By: ajkr
fbshipit-source-id: 75ccafc355f90906df5cf80367f7245b985772d8
Summary:
- Clarified some comments on compatibility for adding new ticker stats
- Added read I/O stats for `VerifyChecksum()` and `VerifyFileChecksums()` APIs
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8741
Test Plan: new unit test
Reviewed By: zhichao-cao
Differential Revision: D30708578
Pulled By: ajkr
fbshipit-source-id: d06b961f7e199ae92c266b683e39870aa8f63449
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736
Test Plan: existing tests
Reviewed By: mrambacher
Differential Revision: D30700039
Pulled By: pdillinger
fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
Summary:
If RateLimiter burst bytes changes during concurrent Restore
operations
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8732
Test Plan: updated unit test fails with TSAN before change, passes after
Reviewed By: ajkr
Differential Revision: D30683879
Pulled By: pdillinger
fbshipit-source-id: d0ddb3587ade91ee2a4d926b475acf7781b03086
Summary:
In ```LRUCacheShard::Promote```, a reference is released outside the LRU mutex. Fix the race condition.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8717
Reviewed By: zhichao-cao
Differential Revision: D30649206
Pulled By: anand1976
fbshipit-source-id: 09c0af05b2294a7fe2c02876a61b0bad6e3ada61
Summary:
Env::GenerateUniqueId() works fine on Windows and on POSIX
where /proc/sys/kernel/random/uuid exists. Our other implementation is
flawed and easily produces collision in a new multi-threaded test.
As we rely more heavily on DB session ID uniqueness, this becomes a
serious issue.
This change combines several individually suitable entropy sources
for reliable generation of random unique IDs, with goal of uniqueness
and portability, not cryptographic strength nor maximum speed.
Specifically:
* Moves code for getting UUIDs from the OS to port::GenerateRfcUuid
rather than in Env implementation details. Callers are now told whether
the operation fails or succeeds.
* Adds an internal API GenerateRawUniqueId for generating high-quality
128-bit unique identifiers, by combining entropy from three "tracks":
* Lots of info from default Env like time, process id, and hostname.
* std::random_device
* port::GenerateRfcUuid (when working)
* Built-in implementations of Env::GenerateUniqueId() will now always
produce an RFC 4122 UUID string, either from platform-specific API or
by converting the output of GenerateRawUniqueId.
DB session IDs now use GenerateRawUniqueId while DB IDs (not as
critical) try to use port::GenerateRfcUuid but fall back on
GenerateRawUniqueId with conversion to an RFC 4122 UUID.
GenerateRawUniqueId is declared and defined under env/ rather than util/
or even port/ because of the Env dependency.
Likely follow-up: enhance GenerateRawUniqueId to be faster after the
first call and to guarantee uniqueness within the lifetime of a single
process (imparting the same property onto DB session IDs).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708
Test Plan:
A new mini-stress test in env_test checks the various public
and internal APIs for uniqueness, including each track of
GenerateRawUniqueId individually. We can't hope to verify anywhere close
to 128 bits of entropy, but it can at least detect flaws as bad as the
old code. Serial execution of the new tests takes about 350 ms on
my machine.
Reviewed By: zhichao-cao, mrambacher
Differential Revision: D30563780
Pulled By: pdillinger
fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364
Summary:
DumpStats() iterates through the ColumnFamilySet. There is a potential
race condition because it does Ref the cfd, and the cfd could get
destroyed during the iteration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8714
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D30580199
Pulled By: anand1976
fbshipit-source-id: 60a3443ad0d4f7ac6a977dec780e6d2c1b70b850
Summary:
Recent refactor of `ReactiveVersionSet::ReadAndApply()` uses
`ManifestTailer` whose `Iterate()` method can cause the db's
`last_sequence_` to go backward. Consequently, read requests can see
out-dated data. For example, latest changes to the primary will not be
seen on the secondary even after a `TryCatchUpWithPrimary()` if no new
write batches are read from the WALs and no new MANIFEST entries are
read from the MANIFEST.
Fix the bug so that `VersionEditHandler::CheckIterationResult` will
never decrease `last_sequence_`, `last_allocated_sequence_` and
`last_published_sequence_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8653
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D30272084
Pulled By: riversand963
fbshipit-source-id: c6a49c534b2509b93ef62d8936ed0acd5b860eaa
Summary:
Currently, we only provide job_id in RemoteCompaction APIs, the
main problem of `job_id` is it cannot uniquely identify a compaction job
between DB instances or between sessions.
Providing DB and session id to the user, which will make building cross
DB compaction service easier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8680
Test Plan: unittest
Reviewed By: ajkr
Differential Revision: D30444859
Pulled By: jay-zhuang
fbshipit-source-id: fdf107f4286564049637f154193c6d94c3c59448
Summary:
This is essentially resurrection and fixing of the part of
https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
when configuring Ribbon filter, you can specify an LSM level before which
Bloom will be used instead of Ribbon. But Bloom is only considered for
Leveled and Universal compaction styles and file going into a known LSM
level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
you would expect with NewRibbonFilterPolicy.
So that this can be controlled with a single int value and so that flushes
can be distinguished from intra-L0, we consider flush to go to level -1 for
the purposes of this option. (Explained in API comment.)
I also expect the most common and recommended Ribbon configuration to
use Bloom during flush, to minimize slowing down writes and because according
to my estimates, Ribbon only pays off if the structure lives in memory for
more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
to be this mild hybrid configuration. I don't really want to add something like
NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
flush, Ribbon otherwise) should be considered a natural choice.
C APIs also updated, but because they don't support overloading,
rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
configuration. While touching C API, I changed bits per key options from
int to double.
BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
unused fields from BloomFilterPolicy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679
Test Plan: new + updated tests, including crash test
Reviewed By: jay-zhuang
Differential Revision: D30445797
Pulled By: pdillinger
fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
Summary:
Pass BlobFileCompletionCallback in case of atomic flush and
compaction job which is currently nullptr(default parameter).
BlobFileCompletionCallback is used in case of IntegratedBlobDB to report new blob files to
SstFileManager.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8681
Test Plan: CircleCI jobs
Reviewed By: ltamasi
Differential Revision: D30445998
Pulled By: akankshamahajan15
fbshipit-source-id: ba48093843864faec57f1f365cce7b5a569c4021
Summary:
- Fix issue with OptionType::Vector when the nested item is a Customizable with no names
- Fix issue with OptionType::Vector to appropriately wrap the elements in a Vector;
- Fix an issue with nested Customizable object with a null immutable object still appearing in the mutable options;
- Fix/Add tests for null/empty customizable objects
- Move the RegisterTestObjects from customizable_test into testutil.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8566
Reviewed By: zhichao-cao
Differential Revision: D30303724
Pulled By: mrambacher
fbshipit-source-id: 33fa8ea2a3b663210cb356da05e64aab7585b1b5
Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.
New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".
`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657
Reviewed By: ajkr
Differential Revision: D30290216
Pulled By: autopear
fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
Summary:
In debug mode, we are seeing assertion failure as follows
```
db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \
Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed.
```
It is caused by releasing earliest snapshot during compaction between the execution of
`NextFromInput()` and `PrepareOutput()`.
In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`,
incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation
level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also
outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be
skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened.
Fix by taking a more conservative approach: make compaction zero out sequence number only
if key is in the earliest snapshot when the compaction starts.
Another assertion failure is
```
Assertion `current_user_key_snapshot_ == last_snapshot' failed.
```
It's caused by releasing the snapshot between the PUT and SingleDelete during compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D30145645
Pulled By: riversand963
fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81
Summary:
The patch adds statistics support to the integrated BlobDB implementation,
namely the tickers `BLOB_DB_BLOB_FILE_BYTES_READ` and
`BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED`, and the histograms
`BLOB_DB_(DE)COMPRESSION_MICROS`. (Some other statistics, like
`BLOB_DB_BLOB_FILE_BYTES_WRITTEN`, `BLOB_DB_BLOB_FILE_SYNCED`,
`BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS` were already supported.)
Note that the vast majority of the old BlobDB's tickers/histograms are not
really applicable to the new implementation, since they e.g. pertain to calling
dedicated BlobDB APIs (which the integrated BlobDB does not have) or are
tied to the legacy BlobDB's design of writing blob files synchronously when
a write API is called. Such statistics are marked "legacy BlobDB only" in
`statistics.h`.
Fixes https://github.com/facebook/rocksdb/issues/8645 .
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8667
Test Plan: Ran `make check` and tested the new statistics using `db_bench`.
Reviewed By: riversand963
Differential Revision: D30356884
Pulled By: ltamasi
fbshipit-source-id: 5f8a833faee60401c5643c2f0a6c0415488190a4
Summary:
Add a stat for secondary cache hits. The ```Cache::Lookup``` API had an unused ```stats``` parameter. This PR uses that to pass the pointer to a ```Statistics``` object that ```LRUCache``` uses to record the stat.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8666
Test Plan: Update a unit test in lru_cache_test
Reviewed By: zhichao-cao
Differential Revision: D30353816
Pulled By: anand1976
fbshipit-source-id: 2046f78b460428877a26ffdd2bb914ae47dfbe77
Summary:
- Remove extra `;` in trace_record.h
- Remove some unnecessary `assert` in trace_record_handler.cc
- Initialize `env_` after` exec_handler_` in `ReplayerImpl` to let db be asserted in creating the handler before getting `db->GetEnv()`.
- Update history to include the new `TraceReader::Reset()`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8652
Reviewed By: ajkr
Differential Revision: D30276872
Pulled By: autopear
fbshipit-source-id: 476ee162e0f241490c6209307448343a5b326b37
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.
User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.
Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611
Reviewed By: ajkr
Differential Revision: D30266329
Pulled By: autopear
fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
Summary:
Add comment for `options.allow_fallocate` that btrfs
preallocated space are not freed and a suggestion to disable
preallocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8646
Test Plan: No code change
Reviewed By: ajkr
Differential Revision: D30240050
Pulled By: jay-zhuang
fbshipit-source-id: 75b7190bc8276ce8d8ac2d0cb9064b386cbf4768
Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310
Test Plan: Add several unit tests.
Reviewed By: ajkr
Differential Revision: D28493792
fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501
Summary:
`CompareKeyContext::operator()` on the trunk has a bug: when comparing
column family IDs, `lhs` is used for both sides of the comparison. This
results in the `KeyContext`s getting sorted solely based on key, which
in turn means that keys with the same column family do not necessarily
form a single range in the sorted list. This violates an assumption of the
batched `MultiGet` logic, leading to the same column family
showing up multiple times in the list of `MultiGetColumnFamilyData`.
The end result is the code attempting to check out the thread-local
`SuperVersion` for the same CF multiple times, causing an
assertion violation in debug builds and memory corruption/crash in
release builds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D30169182
Pulled By: ltamasi
fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023
Summary:
Guarantees that if a restore is interrupted, DB::Open will fail. This works by
restoring CURRENT first to CURRENT.tmp then as a final step renaming to CURRENT.
Also makes restore respect BackupEngineOptions::sync (default true). When set,
the restore is guaranteed persisted by the time it returns OK. Also makes the above
atomicity guarantee work in case the interruption is power loss or OS crash (not just
process interruption or crash).
Fixes https://github.com/facebook/rocksdb/issues/8500
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8568
Test Plan:
added to backup mini-stress unit test. Passes with
gtest_repeat=100 (whereas fails 7 times without the CURRENT.tmp)
Reviewed By: akankshamahajan15
Differential Revision: D29812605
Pulled By: pdillinger
fbshipit-source-id: 24e9a993b305b1835ca95558fa7a7152e54cda8e
Summary:
Previously we attempted to rename "LOG" to "LOG.old.*" without checking
its existence first. "LOG" had no reason to exist in a new DB.
Errors in renaming a non-existent "LOG" were swallowed via
`PermitUncheckedError()` so things worked. However the storage service's
error monitoring was detecting all these benign rename failures. So it
is better to fix it. Also with this PR we can now distinguish rename failure
for other reasons and return them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8622
Test Plan: new unit test
Reviewed By: akankshamahajan15
Differential Revision: D30115189
Pulled By: ajkr
fbshipit-source-id: e2f337ffb2bd171be0203172abc8e16e7809b170
Summary:
Insert warm blocks (data, uncompressed dict, index and filter blocks) during flush in Block cache which is enabled under option BlockBasedTableOptions.prepopulate_block_cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8561
Test Plan: Added unit test
Reviewed By: anand1976
Differential Revision: D29773411
Pulled By: akankshamahajan15
fbshipit-source-id: 6631123c10134340ef0bd7e90baafaa6deba0e66
Summary:
An arbitrary string can be used as a delimiter in StringAppend merge operator
flavor. In particular, it allows using an empty string, combining binary values for
the same key byte-to-byte one next to another.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8536
Reviewed By: mrambacher
Differential Revision: D29962120
Pulled By: zhichao-cao
fbshipit-source-id: 4ef5d846a47835cf428a11200409e30e2dbffc4f
Summary:
Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file.
After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path).
Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582
Reviewed By: pdillinger, autopear
Differential Revision: D29881122
Pulled By: mrambacher
fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473
Reviewed By: zhichao-cao
Differential Revision: D29901488
Pulled By: mrambacher
fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
Summary:
If the primary's CURRENT file is missing or inaccessible, the secondary should not hang
trying repeatedly to switch to the next MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8200
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D27840627
Pulled By: riversand963
fbshipit-source-id: 071fed97cbab1bc5cdefd1dc235e5cd406c174e1
Summary:
Try avoid expensive updating options operation if
`SetDBOptions()` does not change any option value.
Skip updating is not guaranteed, for example, changing `bytes_per_sync`
to `0` may still trigger updating, as the value could be sanitized.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8518
Test Plan: added unittest
Reviewed By: riversand963
Differential Revision: D29672639
Pulled By: jay-zhuang
fbshipit-source-id: b7931de62ceea6f1bdff0d1209adf1197d3ed1f4
Summary:
This PR is for https://github.com/facebook/rocksdb/issues/8453
We need to update `s = biter.status();` when `biter.status().IsIncomplete()` is true. By doing this, can fix the problem in issue.
Besides, we still need to update `db_statistics` in `get_context.ReportCounters()` before return back.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8485
Reviewed By: jay-zhuang
Differential Revision: D29604835
Pulled By: ajkr
fbshipit-source-id: c7f2f1cd058223ce1b507ec05d57cf264b9c9710
Summary:
In ```DBImpl::WriteImpl()```, we call ```PreprocessWrite()``` which, among other things, checks the BG error and returns it set. This return status is later on passed to ```WriteStatusCheck()```, which calls ```SetBGError()```. This results in a spurious call, and info logs, on every user write request. We should avoid passing the ```PreprocessWrite()``` return status to ```WriteStatusCheck()```, as the former would have called ```SetBGError()``` already if it encountered any new errors, such as error when creating a new WAL file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8511
Test Plan: Run existing tests
Reviewed By: zhichao-cao
Differential Revision: D29639917
Pulled By: anand1976
fbshipit-source-id: 19234163969e1645dbeb273712aaf5cd9ea2b182