Commit Graph

1072 Commits

Author SHA1 Message Date
Peter Dillinger
a7d4bea43a Implement XXH3 block checksum type (#9069)
Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.

Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument

This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069

Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.

DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.

Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.

 ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c       :       0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash       :       0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64     :       0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3         :       0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c       :       0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash       :       0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64     :       0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3         :       0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c       :       0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash       :       0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64     :       0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3         :       0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)

As you can see, especially once warmed up, xxh3 is fastest.

 ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

Test

    for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done

Results (ops/sec)

    for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done

results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3

Reviewed By: mrambacher

Differential Revision: D31905249

Pulled By: pdillinger

fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
2021-10-28 22:15:17 -07:00
Andrew Kryczka
f24c39ab3d Prevent corruption with parallel manual compactions and change_level == true (#9077)
Summary:
The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.

Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered  (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.

The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.

Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).

The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.

Thanks to siying for finding the bug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077

Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.

Reviewed By: siying

Differential Revision: D31921908

Pulled By: ajkr

fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
2021-10-27 23:08:56 -07:00
Yanqin Jin
f72fd58565 Fix atomic flush waiting forever for MANIFEST write (#9034)
Summary:
In atomic flush, concurrent background flush threads will commit to the MANIFEST
one by one, in the order of the IDs of their picked memtables for all included column
families. Each time, a background flush thread decides whether to wait based on two
criteria:
- Is db stopped? If so, don't wait.
- Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go.

When atomic flush was implemented, error writing to or syncing the MANIFEST would
cause the db to be stopped. Therefore, this background thread does not have to check
for the background error while waiting. If there has been such an error, `DBStopped()`
would have been true, and this thread will **not** wait forever.

After we improved error handling, RocksDB may map an IOError while writing to MANIFEST
to a soft error, if there is no WAL. This requires the background threads to check for
background error while waiting. Otherwise, a background flush thread may wait forever.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D31639225

Pulled By: riversand963

fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd
2021-10-20 21:34:47 -07:00
sdong
633f069c29 Update Release Version to 6.26 (#9059)
Summary:
Before cutting release branch 6.26, update version.h and release notes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9059

Reviewed By: ajkr

Differential Revision: D31805126

fbshipit-source-id: ae85ccf06ec756fa21163161f53fd0b728e6e32e
2021-10-20 15:32:01 -07:00
Andrew Kryczka
4217d1bce7 Support GetMapProperty() with "rocksdb.dbstats" (#9057)
Summary:
This PR supports querying `GetMapProperty()` with "rocksdb.dbstats" to get the DB-level stats in a map format. It only reports cumulative stats over the DB lifetime and, as such, does not update the baseline for interval stats. Like other map properties, the string keys are not (yet) exposed in the public API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9057

Test Plan: new unit test

Reviewed By: zhichao-cao

Differential Revision: D31781495

Pulled By: ajkr

fbshipit-source-id: 6f77d3aee8b4b1a015061b8c260a123859ceaf9b
2021-10-20 13:17:00 -07:00
sdong
c66b4429ff Incremental Space Amp Compactions in Universal Style (#8655)
Summary:
This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty.

In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control.

Two set of write benchmarks are run:
1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163
2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655

Test Plan: Add unit tests to the functionality.

Reviewed By: ajkr

Differential Revision: D31787034

fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 10:04:13 -07:00
Zhichao Cao
6d93b87588 Add lowest_used_cache_tier to ImmutableDBOptions to enable or disable Secondary Cache (#9050)
Summary:
Currently, if Secondary Cache is provided to the lru cache, it is used by default. We add CacheTier to advanced_options.h to describe the cache tier we used. Add a `lowest_used_cache_tier` option to `DBOptions` (immutable) and pass it to BlockBasedTableReader to decide if secondary cache will be used or not. By default it is `CacheTier::kNonVolatileTier`, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileTier). By set it to `CacheTier::kVolatileTier`, the DB will not use the secondary cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9050

Test Plan: added new tests

Reviewed By: anand1976

Differential Revision: D31744769

Pulled By: zhichao-cao

fbshipit-source-id: a0575ebd23e1c6dfcfc2b4c8578764e73b15bce6
2021-10-19 15:54:23 -07:00
Jay Zhuang
f20b07cebb Add "Java API Changes" session in HISTORY (#9055)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9055

Reviewed By: ajkr

Differential Revision: D31765398

Pulled By: jay-zhuang

fbshipit-source-id: 77ed67d69415c9fbbfc1132b15310b293e3939c6
2021-10-19 15:23:06 -07:00
Peter Dillinger
b234a3f569 Improve data block construction performance (#9040)
Summary:
... by bypassing tracking of last_key in BlockBuilder when
last_key is already known (for BlockBasedTableBuilder::data_block).

I tried extracting a base class of BlockBuilder without the last_key
tracking at all, but that became complicated by NewFlushBlockPolicy() in
the public API referencing BlockBuilder, which would need to be the base
class, and I don't want to replace nearly all the internal references to
BlockBuilder.

Possible follow-up:
* Investigate / consider using AddWithLastKey in more places

This improvement should stack with https://github.com/facebook/rocksdb/issues/9039

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9040

Test Plan:
TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000
Compiled with DEBUG_LEVEL=0
Test vs. control runs simulaneous for better accuracy, units = ops/sec

Run 1: 278929 vs. 267799 (+4.2%)
Run 2: 281836 vs. 267432 (+5.4%)
Run 3: 278279 vs. 270454 (+2.9%)

(This benchmark is chosen to have detectable signal-to-noise, not to
represent expected improvement percent on real workloads.)

Reviewed By: mrambacher

Differential Revision: D31706033

Pulled By: pdillinger

fbshipit-source-id: 8a50fe6fefdd67b6d7665ffa687bbdcf5ad0d5ec
2021-10-19 12:36:21 -07:00
Alan Paxton
8d615a2b1d New-style blob option bindings, Java option getter and improve/fix option parsing (#8999)
Summary:
Implementation of https://github.com/facebook/rocksdb/issues/8221, plus/including extension of Java options API to allow the get() of options from RocksDB. The extension allows more comprehensive testing of options at the Java side, by validating that the options are set at the C++ side.

Variations on methods:
MutableColumnFamilyOptions.MutableColumnFamilyOptionsBuilder getOptions()
MutableDBOptions.MutableDBOptionsBuilder getDBOptions()

retrieve the options via RocksDB C++ interfaces, and parse the resulting string into one of the Java-style option objects.

This necessitated generalising the parsing of option strings in Java, which now parses the full range of option strings returned by the C++ interface, rather than a useful subset. This necessitates the list-separator being changed to :(colon) from , (comma).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8999

Reviewed By: jay-zhuang

Differential Revision: D31655487

Pulled By: ltamasi

fbshipit-source-id: c38e98145c81c61dc38238b0df580db176ce4efd
2021-10-19 09:21:52 -07:00
Peter Dillinger
ad5325a736 Experimental support for SST unique IDs (#8990)
Summary:
* New public header unique_id.h and function GetUniqueIdFromTableProperties
which computes a universally unique identifier based on table properties
of table files from recent RocksDB versions.
* Generation of DB session IDs is refactored so that they are
guaranteed unique in the lifetime of a process running RocksDB.
(SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
this enables SST unique IDs to be guaranteed unique among SSTs generated
in a single process, and "better than random" between processes.
See https://github.com/pdillinger/unique_id
* In addition to public API producing 'external' unique IDs, there is a function
for producing 'internal' unique IDs, with functions for converting between the
two. In short, the external ID is "safe" for things people might do with it, and
the internal ID enables more "power user" features for the future. Specifically,
the external ID goes through a hashing layer so that any subset of bits in the
external ID can be used as a hash of the full ID, while also preserving
uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
and on full 192 bits).

Intended follow-up:
* Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
the third 64-bit value of the unique ID.)
* Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990

Test Plan:
Unit tests added, and checking of unique ids in stress test.
NOTE in stress test we do not generate nearly enough files to thoroughly
stress uniqueness, but the test trims off pieces of the ID to check for
uniqueness so that we can infer (with some assumptions) stronger
properties in the aggregate.

Reviewed By: zhichao-cao, mrambacher

Differential Revision: D31582865

Pulled By: pdillinger

fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243
2021-10-18 23:32:01 -07:00
Jay Zhuang
314de7e7de Make DB::Close() thread-safe (#8970)
Summary:
If `DB::Close()` is called in multi-thread env, the resource
could be double released, which causes exception or assert.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8970

Test Plan:
Test with multi-thread benchmark, with each thread try to
close the DB at the end.

Reviewed By: pdillinger

Differential Revision: D31242042

Pulled By: jay-zhuang

fbshipit-source-id: a61276b1b61e07732e375554106946aea86a23eb
2021-10-18 20:32:35 -07:00
Alan Paxton
86cf7266c3 keyMayExist() supports ByteBuffer (#9013)
Summary:
closes https://github.com/facebook/rocksdb/issues/7917

Implemented ByteBuffer API variants of Java keyMayExist() uniformly with and without column families, read options and return data values. Implemented 2 supporting C++ JNI methods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9013

Reviewed By: mrambacher

Differential Revision: D31665989

Pulled By: jay-zhuang

fbshipit-source-id: 8adc1730217dba38d6fa7b31d788650a33e28af1
2021-10-18 17:20:07 -07:00
Peter Dillinger
3ffb3baa0b Add (Live)FileStorageInfo API (#8968)
Summary:
New classes FileStorageInfo and LiveFileStorageInfo and
'experimental' function DB::GetLiveFilesStorageInfo, which is intended
to largely replace several fragmented DB functions needed to create
checkpoints and backups.

This function is now used to create checkpoints and backups, because
it fixes many (probably not all) of the prior complexities of checkpoint
not having atomic access to DB metadata. This also ensures strong
functional test coverage of the new API. Specifically, much of the old
CheckpointImpl::CreateCustomCheckpoint has been migrated to and
updated in DBImpl::GetLiveFilesStorageInfo, with the former now
calling the latter.

Also, the class FileStorageInfo in metadata.h compatibly replaces
BackupFileInfo and serves as a new base class for SstFileMetaData.
Some old fields of SstFileMetaData are still provided (for now) but
deprecated.

Although FileStorageInfo::directory is accurate when using db_paths
and/or cf_paths, these have never been supported by Checkpoint
nor BackupEngine and still are not. This change does now detect
these cases and return NotSupported when appropriate. (More work
needed for support.)

Somehow this change broke ProgressCallbackDuringBackup, but
the progress_callback logic was dubious to begin with because it
would call the callback based on copy buffer size, not size actually
copied. Logic and test updated to track size actually copied
per-thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968

Test Plan:
tests updated.
DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl.
DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo,
including reading the data after DB close.
Added CheckpointTest.CheckpointWithDbPath (NotSupported).

Reviewed By: siying

Differential Revision: D31242045

Pulled By: pdillinger

fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0
2021-10-16 10:04:32 -07:00
Andrew Kryczka
ffc48b6cad Update HISTORY.md for #9009 (#9036)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9036

Reviewed By: zhichao-cao

Differential Revision: D31640901

Pulled By: ajkr

fbshipit-source-id: 0b1e6e36094a74bb7906af44e29ecbeaa258de58
2021-10-14 09:36:32 -07:00
Giuseppe Ottaviano
4bfd415e34 Fix sequence number bump logic in multi-CF SST ingestion (#9005)
Summary:
The code in `IngestExternalFiles()` that bumps the DB's sequence number
depending on what seqnos were assigned to the files has 3 bugs:

1) There is an assertion that the sequence number is increased in all the
affected column families, but this is unnecessary, it is fine if some files can
stick to a lower sequence number. It is very easy to hit the assertion: it is
sufficient to insert 2 files in 2 CFs, one which overlaps the CF and one that
doesn't (for example the CF is empty). The line added in the
`IngestFilesIntoMultipleColumnFamilies_Success` test makes the assertion fail.

2) SetLastSequence() is called with the sum of all the bumps across CFs, but we
should take the maximum instead, as all CFs start with the current seqno and bump
it independently.

3) The code above is accidentally under a `#ifndef NDEBUG`, so it doesn't run in
optimized builds, so some files may be assigned seqnos from the future.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9005

Test Plan:
Added line in `IngestFilesIntoMultipleColumnFamilies_Success` that
triggers the assertion, verified that the test (and all the others) pass after
the fix.

Reviewed By: ajkr

Differential Revision: D31597892

Pulled By: ot

fbshipit-source-id: c2d3237f90290df1178736ace8653a9623f5a770
2021-10-12 20:39:52 -07:00
Levi Tamasi
7cc52cd8f5 Update HISTORY for PR 8994 (#9017)
Summary:
Also, expand on/clarify a comment in `VersionStorageInfoTest`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9017

Reviewed By: riversand963

Differential Revision: D31566130

Pulled By: ltamasi

fbshipit-source-id: 1d30c7af084c4de7b2030bc6c768838d65746010
2021-10-12 10:19:56 -07:00
Andrew Kryczka
a282eff3d1 Protect existing files in FaultInjectionTest{Env,FS}::ReopenWritableFile() (#8995)
Summary:
`FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage.

The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995

Test Plan:
- Verified it fixes the following failure:

```
$ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
$ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound:
Crash-recovery verification failed :(
...
```

- `make check -j48`

Reviewed By: ltamasi

Differential Revision: D31495388

Pulled By: ajkr

fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385
2021-10-11 16:23:18 -07:00
Hui Xiao
6c3bf83d6f Update HISTORY.md for #8428 (#9001)
Summary:
Context:
HISTORY.md was not properly updated along with the change in https://github.com/facebook/rocksdb/pull/8428, where we introduced a change of accounting compression dictionary buffering memory and an extra condition of triggering data unbuffering.
Updated HISTORY.md for https://github.com/facebook/rocksdb/pull/8428 in 6.25.0 HISTORY.md section.
Updated blog post https://rocksdb.org/blog/2021/05/31/dictionary-compression.html.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9001

Reviewed By: ajkr

Differential Revision: D31517836

Pulled By: hx235

fbshipit-source-id: 01f6b30de4e1ff6b315aa8221139d9b700c7c629
2021-10-08 17:00:30 -07:00
Zhichao Cao
bcd049cd2d Ingest external SST files with Temperature hints (#8949)
Summary:
Add the file temperature to `IngestExternalFileArg` such that when SST files are ingested, user is able to assign the temperature to each SST file. If the temperature vector is empty or its size does not match the file name vector size, all ingested SST files will be assigned with `Temperature::unKnown`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8949

Test Plan: add the new test and make check

Reviewed By: siying

Differential Revision: D31127852

Pulled By: zhichao-cao

fbshipit-source-id: 141a81f0f7b473d88f4ab0cb2a21a114cbc6f83c
2021-10-08 10:32:24 -07:00
Andrew Kryczka
fcaa7ff638 Cancel manual compactions waiting on automatic compactions to drain (#8991)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8991

Test Plan: the new test hangs forever without this fix and passes with this fix.

Reviewed By: hx235

Differential Revision: D31456419

Pulled By: ajkr

fbshipit-source-id: a82c0e5560b6e6153089dccd8e46163c61b07bff
2021-10-07 15:23:55 -07:00
Zhichao Cao
699f45049d Introduce a mechanism to dump out blocks from block cache and re-insert to secondary cache (#8912)
Summary:
Background: Cache warming up will cause potential read performance degradation due to reading blocks from storage to the block cache. Since in production, the workload and access pattern to a certain DB is stable, it is a potential solution to dump out the blocks belonging to a certain DB to persist storage (e.g., to a file) and bulk-load the blocks to Secondary cache before the DB is relaunched. For example, when migrating a DB form host A to host B, it will take a short period of time, the access pattern to blocks in the block cache will not change much. It is efficient to dump out the blocks of certain DB, migrate to the destination host and insert them to the Secondary cache before we relaunch the DB.

Design: we introduce the interface of CacheDumpWriter and CacheDumpRead for user to store the blocks dumped out from block cache. RocksDB will encode all the information and send the string to the writer. User can implement their own writer it they want. CacheDumper and CacheLoad are introduced to save the blocks and load the blocks respectively.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8912

Test Plan: add new tests to lru_cache_test and pass make check.

Reviewed By: pdillinger

Differential Revision: D31452871

Pulled By: zhichao-cao

fbshipit-source-id: 11ab4f5d03e383f476947116361d54188d36ec48
2021-10-07 11:42:31 -07:00
Pradeep Ambati
e5bfb91d09 List blob files when using command - list_live_files_metadata (#8976)
Summary:
The ldb list_live_files_metadata command does not print any information about blob files currently. We would like to add this functionality. Note that list_live_files_metadata has two different modes of operation: the one shown above, which shows the LSM tree structure, and another one, which can be enabled using the flag --sort_by_filename and simply lists the files in numerical order regardless of level. We would like to show blob files in both modes.

Changes:
1. Using GetAllColumnFamilyMetaData API instead of GetLiveFilesMetaData API for fetching live files data.

Testing:
1. Created a sample rocksdb instance using dbbench command (this creates both SST and blob files)
2. Checked if the blob files are listed or not by using ldb commands.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8976

Reviewed By: ltamasi

Differential Revision: D31316061

Pulled By: pradeepambati

fbshipit-source-id: d15cdea192febf7a45f28deee2ba40615d3d84ab
2021-09-30 15:13:11 -07:00
anand76
532ff334d9 Don't ignore deletion rate limit if WAL dir is different (#8967)
Summary:
If WAL dir is different from the DB dir, we should still honor the SstFileManager deletion rate limit for SST files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8967

Test Plan: Add a new unit test in db_sst_test

Reviewed By: pdillinger

Differential Revision: D31220116

Pulled By: anand1976

fbshipit-source-id: bcde8a53a7d728e15e597fb5d07ee86c1b38bd28
2021-09-30 13:26:31 -07:00
Jay Zhuang
6b34eb0ebc Add remote compaction read/write bytes statistics (#8939)
Summary:
Add basic read/write bytes statistics on the primary side:
`REMOTE_COMPACT_READ_BYTES`
`REMOTE_COMPACT_WRITE_BYTES`

Fixed existing statistics missing some IO for remote compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8939

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D31074672

Pulled By: jay-zhuang

fbshipit-source-id: c57afdba369990185008ffaec7e3fe7c62e8902f
2021-09-28 14:00:37 -07:00
Hui Xiao
d6bd1a0291 Support "level_at_creation" in TablePropertiesCollectorFactory::Context (#8919)
Summary:
Context:
Exposing the level of the sst file (i.e, table) where it is created in `TablePropertiesCollectorFactory::Context` allows users of `TablePropertiesCollectorFactory` to customize some implementation details of `TablePropertiesCollectorFactory` and `TablePropertiesCollector` based on the level of creation. For example, `TablePropertiesCollector::NeedCompact()` can return different values based on level of creation.
- Declared an extra field `level_at_creation` in `TablePropertiesCollectorFactory::Context`
- Allowed `level_at_creation` to be passed in as an argument in `IntTblPropCollectorFactory::CreateIntTblPropCollector()` and `UserKeyTablePropertiesCollectorFactory::CreateIntTblPropCollector()`, the latter of which is an internal wrapper of user's passed-in `TablePropertiesCollectorFactory::CreateTablePropertiesCollector()` used in table-building process
- Called `IntTblPropCollectorFactory::CreateIntTblPropCollector()` with `level_at_creation` passed into both `BlockBasedTableBuilder` and `PlainTableBuilder`
  -  `PlainTableBuilder` previously did not capture `level_at_creation` from `TableBuilderOptions` in `PlainTableFactory`. In order for it to call the method with this parameter, this PR also made `PlainTableBuilder` capture `level_at_creation` as a required parameter
- Called `IntTblPropCollectorFactory::CreateIntTblPropCollector()` with `level_at_creation` its overridden functions in its derived classes, including `RegularKeysStartWithAFactory::CreateIntTblPropCollector()` in `table_properties_collector_test.cc`, `SstFileWriterPropertiesCollectorFactory::CreateIntTblPropCollector()` in `sst_file_writer_collectors.h`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8919

Test Plan:
- Passed the added assertion for `context.level_at_creation`
- Passed existing tests
- Run `Make` to make sure adding a required parameter to `PlainTableBuilder`'s constructor does not break anything

Reviewed By: anand1976

Differential Revision: D30951729

Pulled By: hx235

fbshipit-source-id: c4a0173b0d9344a4cf47e1b987d759c1c73cb474
2021-09-28 12:35:24 -07:00
mrambacher
7fd68b7c39 Make WalFilter, SstPartitionerFactory, FileChecksumGenFactory, and TableProperties Customizable (#8638)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8638

Reviewed By: zhichao-cao

Differential Revision: D31024729

Pulled By: mrambacher

fbshipit-source-id: 954c04ccab0b8dee64050a27aadf78ed119106c0
2021-09-28 05:32:02 -07:00
sdong
b88109db19 Pollute buffer before calling Read() (#8955)
Summary:
Add a paranoid check where in case FileSystem layer doesn't fill the buffer but returns succeed, checksum is unlikely to match even if buffer contains a previous block. The byte modified is not useful anyway, so it isn't expect to change any behavior.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8955

Test Plan: See existing CI to pass.

Reviewed By: pdillinger

Differential Revision: D31183966

fbshipit-source-id: dcc4de429e18131873f783b90d3be55d7eb44a1f
2021-09-27 21:30:28 -07:00
Akanksha Mahajan
a2f29ce70a Update History.md for SingleDelete with user defined timestamp (#8964)
Summary:
Update History.md for SingleDelete with user defined timestamp

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8964

Reviewed By: zhichao-cao

Differential Revision: D31216214

Pulled By: akankshamahajan15

fbshipit-source-id: 0520132c75fe8f6823d154e41585b0df3086c04d
2021-09-27 14:58:30 -07:00
mrambacher
e0f697d2bd Make SliceTransform into a Customizable class (#8641)
Summary:
Made SliceTransform into a Customizable class.

Would be nice to write a test that stored and used a custom transform  in an SST table.

There are a set of tests (DBBlockFliterTest.PrefixExtractor*, SamePrefixTest.InDomainTest, PrefixTest.PrefixAndWholeKeyTest that run the same with or without a SliceTransform/PrefixFilter.  Is this expected?

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8641

Reviewed By: zhichao-cao

Differential Revision: D31142793

Pulled By: mrambacher

fbshipit-source-id: bb08672fccbfdc263dcae21f25a62307e1facda1
2021-09-27 07:43:47 -07:00
Yanqin Jin
b92cef2d1d Sort per-file blob read requests by offset (#8953)
Summary:
`RandomAccessFileReader::MultiRead()` tries to merge requests in direct IO, assuming input IO requests are
sorted by offsets.

Add a test in direct IO mode.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8953

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D31183546

Pulled By: riversand963

fbshipit-source-id: 5d043ec68e2daa47a3149066150afd41ee3d73e6
2021-09-24 22:14:30 -07:00
Hui Xiao
b25f2afeff Return Status::NotSupported() in RateLimiter::GetTotalPendingRequests default impl (#8950)
Summary:
Context:
After more discussion, a fix in https://github.com/facebook/rocksdb/issues/8938 might turn out to be too restrictive for the case where `GetTotalPendingRequests` might be invoked on RateLimiter classes that does not support the recently added API `RateLimiter::GetTotalPendingRequests` (https://github.com/facebook/rocksdb/issues/8890) due to the `assert(false)` in https://github.com/facebook/rocksdb/issues/8938. Furthermore, sentinel value like `-1` proposed in https://github.com/facebook/rocksdb/issues/8938 is easy to be ignored and unchecked. Therefore we decided to adopt `Status::NotSupported()`, which is also a convention of adding new API to public header in RocksDB.
- Changed return value type of  `RateLimiter::GetTotalPendingRequests` in related declaration/definition
- Passed in pointer argument to hold the output instead of returning it as before
- Adapted to the changes above in calling `RateLimiter::GetTotalPendingRequests` in test
- Minor improvement to `TEST_F(RateLimiterTest, GetTotalPendingRequests)`:  added failure message for assertion and replaced repetitive statements with a loop

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8950

Reviewed By: ajkr, pdillinger

Differential Revision: D31128450

Pulled By: hx235

fbshipit-source-id: 282ac9c4f3dacaa0aec6d0a993161f77ad47a040
2021-09-22 19:36:06 -07:00
sdong
c988e4720b Add HISTORY.md entry to a recent bug fix. (#8948)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8948

Reviewed By: anand1976

Differential Revision: D31127368

fbshipit-source-id: a374cb0baf88c3e15cd587a8f31e8a2d84432928
2021-09-22 16:23:08 -07:00
Hui Xiao
58444eadda Make RateLimiter::GetTotalPendingRequest() non pure virtual for backward compability (#8938)
Summary:
Context/Summary:
https://github.com/facebook/rocksdb/pull/8890 added a public API `RateLimiter::GetTotalPendingRequest()` but mistakenly marked it as pure virtual, forcing RateLimiter's derived classes to implement this function and breaking backward compatibility.

This PR makes `RateLimiter::GetTotalPendingRequest()` as non-pure virtual method by providing a trivial implementation in rate_limiter.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8938

Test Plan: Passing existing tests

Reviewed By: pdillinger

Differential Revision: D31100661

Pulled By: hx235

fbshipit-source-id: 06eff1005156a6e5a881e393b2c5b2ad706897d8
2021-09-21 21:29:26 -07:00
Peter Dillinger
5268cdc997 Finish BackupEngine migration to IOStatus (#8940)
Summary:
Updates a few remaining functions that should have been updated
from Status -> IOStatus, and adds to HISTORY for the overall change
including https://github.com/facebook/rocksdb/issues/8820.

This change is for inclusion in version 6.25.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8940

Test Plan: CI

Reviewed By: zhichao-cao

Differential Revision: D31085029

Pulled By: pdillinger

fbshipit-source-id: 91557c6a39ef1d90357d4f4dcd79af0645d87c7b
2021-09-21 11:13:17 -07:00
mrambacher
6924869867 Make SystemClock into a Customizable Class (#8636)
Summary:
Made SystemClock into a Customizable class, complete with CreateFromString.

Cleaned up some of the existing SystemClock implementations that were redundant (NoSleep was the same as the internal one for MockEnv).

Changed MockEnv construction to allow Clock to be passed to the Memory/MockFileSystem.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8636

Reviewed By: zhichao-cao

Differential Revision: D30483360

Pulled By: mrambacher

fbshipit-source-id: cd0e3a876c39f8c98fe13374c06e8edbd5b9f2a1
2021-09-21 09:23:48 -07:00
Peter Dillinger
d497cdfbb2 Update version to 6.25.0 (#8935)
Summary:
for release

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8935

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D31056726

Pulled By: pdillinger

fbshipit-source-id: 6fd022c39c19c35f10a2367df45dd2deb43df510
2021-09-20 11:22:41 -07:00
anand76
99fe4c5005 Add a gflag for IO uring enable/disable (#8931)
Summary:
In case of IO uring bugs, we need to provide a way for users to turn it off.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8931

Test Plan: Manually run db_bench with/without the option and verify the behavior

Reviewed By: pdillinger

Differential Revision: D31040252

Pulled By: anand1976

fbshipit-source-id: 56f2537d6ac8488c9e126296d8190ad9e0158f70
2021-09-18 10:24:56 -07:00
Jay Zhuang
1c290c785d RemoteCompaction support Fallback to local compaction (#8709)
Summary:
Add support for fallback to local compaction, the user can
return `CompactionServiceJobStatus::kUseLocal` to instruct RocksDB to
run the compaction locally instead of waiting for the remote compaction
result.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8709

Test Plan: unittest

Reviewed By: ajkr

Differential Revision: D30560163

Pulled By: jay-zhuang

fbshipit-source-id: 65d8905a4a1bc185a68daa120997f21d3198dbe1
2021-09-18 00:25:04 -07:00
Yanqin Jin
b512f4bc76 Batch blob read IO for MultiGet (#8699)
Summary:
In batched `MultiGet()`, RocksDB batches blob read IO and uses `RandomAccessFileReader::MultiRead()`
to read the blobs instead of issuing multiple `Read()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8699

Test Plan:
```
make check
```

Reviewed By: ltamasi

Differential Revision: D31030861

Pulled By: riversand963

fbshipit-source-id: a0df6060cbfd54cff9515a4eee08807b1dbcb0c8
2021-09-17 19:23:13 -07:00
Peter Dillinger
4149d044cd Change SstFileMetaData::size from size_t to uint64_t (#8926)
Summary:
Because even 32-bit systems can have large files

This is a "change" that I don't want intermingled with an upcoming refactoring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8926

Test Plan: CI

Reviewed By: zhichao-cao

Differential Revision: D31020974

Pulled By: pdillinger

fbshipit-source-id: ca9eb4510697df6f1f55e37b37730b88b1809a92
2021-09-17 13:23:34 -07:00
mrambacher
272cc77751 Added a default Name method to Statistics (#8918)
Summary:
This keeps the implementations/API backward compatible.  Implementations of Statistics will need to override this method (and be registered with the ObjectRegistry) in order to be created via CreateFromString.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8918

Reviewed By: pdillinger

Differential Revision: D30958916

Pulled By: mrambacher

fbshipit-source-id: 75b99a84e9e11fda2a9e8eff9ee1ef69a17517b2
2021-09-17 07:25:43 -07:00
Akanksha Mahajan
d6aa8c49f8 Expose blob file information through the EventListener interface (#8675)
Summary:
1. Extend FlushJobInfo and CompactionJobInfo with information about the blob files generated by flush/compaction jobs. This PR add two structures BlobFileInfo and BlobFileGarbageInfo that contains the required information of blob files.
 2. Notify the creation and deletion of blob files through OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted.
 3. Test OnFile*Finish operations notifications with Blob Files.
 4. Log the blob file creation/deletion events through EventLogger in Log file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8675

Test Plan: Add new unit tests in listener_test

Reviewed By: ltamasi

Differential Revision: D30412613

Pulled By: akankshamahajan15

fbshipit-source-id: ca51b63c6e8c8d0485a38c503572bc5a82bd5d07
2021-09-16 17:23:36 -07:00
Jay Zhuang
b97c53b629 Add compaction priority information in RemoteCompaction (#8707)
Summary:
Add compaction priority information in RemoteCompaction, which
can be used to schedule high priority job first.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8707

Test Plan: unittest

Reviewed By: ajkr

Differential Revision: D30548401

Pulled By: jay-zhuang

fbshipit-source-id: b30446511fb31b4583c49edd8565d496cf013a34
2021-09-16 15:09:35 -07:00
Peter Dillinger
2819c7840e Fix PrepopulateBlockCache::kFlushOnly (#8750)
Summary:
kFlushOnly currently means "always" except in the case of
remote compaction. This makes it flushes only.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8750

Test Plan: test updated

Reviewed By: akankshamahajan15

Differential Revision: D30968034

Pulled By: pdillinger

fbshipit-source-id: 5dbd24dde18852a0e937a540995fba9bfbe89037
2021-09-15 15:33:20 -07:00
anand76
7743f033b1 More robust checking of IO uring completion data (#8894)
Summary:
Potential bugs in the IO uring implementation can cause bad data to be returned in the completion queue. Add some checks in the PosixRandomAccessFile::MultiRead completion handling code to catch such errors and fail the entire MultiRead. Also log some diagnostic messages and stack trace.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8894

Reviewed By: siying, pdillinger

Differential Revision: D30826982

Pulled By: anand1976

fbshipit-source-id: af91815ac760e095d6cc0466cf8bd5c10167fd15
2021-09-15 12:44:43 -07:00
eharry
0b6be7eb68 Fix WAL log data corruption #8723 (#8746)
Summary:
Fix WAL log data corruption when using DBOptions.manual_wal_flush(true) and WriteOptions.sync(true) together (https://github.com/facebook/rocksdb/issues/8723)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8746

Reviewed By: ajkr

Differential Revision: D30758468

Pulled By: riversand963

fbshipit-source-id: 07c20899d5f2447dc77861b4845efc68a59aa4e8
2021-09-13 20:15:59 -07:00
Yanqin Jin
2a2b3e03a5 Allow WriteBatch to have keys with different timestamp sizes (#8725)
Summary:
In the past, we unnecessarily requires all keys in the same write batch
to be from column families whose timestamps' formats are the same for
simplicity. Specifically, we cannot use the same write batch to write to
two column families, one of which enables timestamp while the other
disables it.

The limitation is due to the member `timestamp_size_` that used to exist
in each `WriteBatch` object. We pass a timestamp_size to the constructor
of `WriteBatch`. Therefore, users can simply use the old
`WriteBatch::Put()`, `WriteBatch::Delete()`, etc APIs for write, while
the internal implementation of `WriteBatch` will take care of memory
allocation for timestamps.

The above is not necessary.
One the one hand, users can set up a memory buffer to store user key and
then contiguously append the timestamp to the user key. Then the user
can pass this buffer to the `WriteBatch::Put(Slice&)` API.
On the other hand, users can set up a SliceParts object which is an
array of Slices and let the last Slice to point to the memory buffer
storing timestamp. Then the user can pass the SliceParts object to the
`WriteBatch::Put(SliceParts&)` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8725

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D30654499

Pulled By: riversand963

fbshipit-source-id: 9d848c77ad3c9dd629aa5fc4e2bc16fb0687b4a2
2021-09-12 15:34:26 -07:00
Levi Tamasi
5f40b05c98 Update HISTORY.md for PR 8899 (#8905)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8905

Reviewed By: zhichao-cao

Differential Revision: D30873416

Pulled By: ltamasi

fbshipit-source-id: 6e55ec14a7fd2e562aa24cd0274e2436369923f5
2021-09-12 08:19:05 -07:00
Hui Xiao
12542488ef Add public API RateLimiter::GetTotalPendingRequests() (#8890)
Summary:
Context/Summary:
As users requested, a public API RateLimiter::GetTotalPendingRequests() is added to expose the total number of pending requests for bytes in the rate limiter, which is the size of the request queue of that priority (or of all priorities, if IO_TOTAL is interested) at the time when this API is called.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8890

Test Plan:
- Passing added new unit tests
- Passing existing unit tests

Reviewed By: ajkr

Differential Revision: D30815500

Pulled By: hx235

fbshipit-source-id: 2dfa990f651c1c47378b6215c751ad76a5824300
2021-09-10 08:37:04 -07:00