Commit Graph

10561 Commits

Author SHA1 Message Date
Yanqin Jin
9b53f14a35 Fixed a bug in CompactionIterator when write-preared transaction is used (#9060)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060

RocksDB bottommost level compaction may zero out an internal key's sequence if
the key's sequence is in the earliest_snapshot.
In write-prepared transaction, checking the visibility of a certain sequence in
a specific released snapshot may return a "snapshot released" result.
Therefore, it is possible, after a certain sequence of events, a PUT has its
sequence zeroed out, but a subsequent SingleDelete of the same key will still
be output with its original sequence. This violates the ascending order of
keys and leads to incorrect result.

The solution is to use an extra variable `last_key_seq_zeroed_` to track the
information about visibility in earliest snapshot. With this variable, we can
know for sure that a SingleDelete is in the earliest snapshot even if the said
snapshot is released during compaction before processing the SD.

Reviewed By: ltamasi

Differential Revision: D31813016

fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1
2021-11-03 15:55:00 -07:00
Jack Feng
56810142a5 Fixing child jobs of rocksdb for duplo deprecation (#9117)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9117

This fixes the loosely formatted json child job spec. I.e., trailing commas

Reviewed By: jay-zhuang

Differential Revision: D32119606

fbshipit-source-id: 0ef571ccbef0e638df18f17288611d3b5774f129
2021-11-03 14:56:26 -07:00
Jay Zhuang
29102641dd Skip directory fsync for filesystem btrfs (#8903)
Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
   new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
   synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
   to sync the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is forced by set
   `IOOptions.force_dir_fsync = true`
4. renaming multiple files (like backup and checkpoint): dir-fsync is
   forced, the same as above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903

Test Plan: run tests on btrfs and non btrfs

Reviewed By: ajkr

Differential Revision: D30885059

Pulled By: jay-zhuang

fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
2021-11-03 12:21:27 -07:00
Levi Tamasi
081722780b Refactor the detailed consistency checks and the SST saving logic in VersionBuilder (#9099)
Summary:
The patch refactors the parts of `VersionBuilder` that deal with SST file
comparisons. Specifically, it makes the following changes:
* Turns `NewestFirstBySeqNo` and `BySmallestKey` from free-standing
functions into function objects. Note: `BySmallestKey` has a pointer to the
`InternalKeyComparator`, while `NewestFirstBySeqNo` is completely
stateless.
* Eliminates the wrapper `FileComparator`, which was essentially an
unnecessary DIY virtual function call mechanism.
* Refactors `CheckConsistencyDetails` and `SaveSSTFilesTo` using helper
function templates that take comparator/checker function objects. Using
static polymorphism eliminates the need to make runtime decisions about
which comparator to use.
* Extends some error messages returned by the consistency checks and
makes them more uniform.
* Removes some incomplete/redundant consistency checks from `VersionBuilder`
and `FilePicker`.
* Improves const correctness in several places.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9099

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D32027503

Pulled By: ltamasi

fbshipit-source-id: 621326ae41f4f55f7ad6a91abbd6e666d5c7857c
2021-11-03 11:52:47 -07:00
Peter Dillinger
2b60621f16 Don't call OnTableFileCreated with OK for empty+deleted file (#9118)
Summary:
EventListener::OnTableFileCreated was previously called with OK
status and file_size==0 in cases of no SST file contents written
(because there was no content to add) and the empty file deleted before
calling the listener. This could lead to a stress test assertion failure
added in https://github.com/facebook/rocksdb/issues/9054.

This changes the status to Aborted, to align with the API doc:
"... if the file is successfully created. Now it will also be called on
failure case. User can check info.status to see if it succeeded or not."
For internal purposes, this case is considered "success" but for
listener purposes, no SST file is (successfully) created.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9118

Test Plan: test case added + existing db_stress

Reviewed By: ajkr, riversand963

Differential Revision: D32120232

Pulled By: pdillinger

fbshipit-source-id: a804e2e0a52598018d3b182da97804d402ffcdfa
2021-11-03 08:43:27 -07:00
Peter Dillinger
21f8a57f2a Fix TSAN report on MemPurge test (#9115)
Summary:
TSAN reported data race on count variables in MemPurgeBasic
test. This suggests the test could fail if mempurges were slow enough
that they don't complete before the count variables being checked, but
injecting a long sleep into MemPurge (outside DB mutex) confirms that
blocked writes ensure enough mempurges/flushes happen to make the test
pass. All the possible different values on testing should be OK to make
the test pass.

So this change makes the variables atomic so that up-to-date value is
always read and TSAN report suppressed. I have also used `.exchange(0)`
to make the checking less stateful by "popping off" all the accumulated
counts.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9115

Test Plan: updated test, watch for any flakiness

Reviewed By: riversand963

Differential Revision: D32114432

Pulled By: pdillinger

fbshipit-source-id: c985609d39896a0d8f69ebc87b221e688609bdd8
2021-11-02 21:54:29 -07:00
Andrew Kryczka
67a7b74b7f Clarify setting CompressionOptions::max_dict_bytes > 0 will charge block cache (#9119)
Summary:
Add it to the API comment.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9119

Reviewed By: hx235

Differential Revision: D32124238

Pulled By: ajkr

fbshipit-source-id: d1f82037417d883f2000f2d62995a7708dda77c6
2021-11-02 21:43:50 -07:00
Peter Dillinger
82afa01815 Some API clarifications (#9080)
Summary:
* Clarify that RocksDB is not exception safe on many of our callback
and extension interfaces
* Clarify FSRandomAccessFile::MultiRead implementations must accept
non-sorted inputs (see https://github.com/facebook/rocksdb/issues/8953)
* Clarify ConcurrentTaskLimiter and SstFileManager are not (currently)
extensible interfaces
* Mark WriteBufferManager as `final`, so it is then clearly not a
callback interface, even though it smells like one
* Clarify TablePropertiesCollector Status returns are mostly ignored

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9080

Test Plan: comments only (except WriteBufferManager final)

Reviewed By: ajkr

Differential Revision: D31968782

Pulled By: pdillinger

fbshipit-source-id: 11b648ce3ce3c5e5bdc02d2eafc7ea4b864bd1d2
2021-11-02 20:30:07 -07:00
mrambacher
f72c834eab Make FileSystem a Customizable Class (#8649)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8649

Reviewed By: zhichao-cao

Differential Revision: D32036059

Pulled By: mrambacher

fbshipit-source-id: 4f1e7557ecac52eb849b83ae02b8d7d232112295
2021-11-02 09:07:11 -07:00
Levi Tamasi
cfc57f55b5 Mention PR 9100 in HISTORY.md (#9111)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9111

Reviewed By: riversand963

Differential Revision: D32076310

Pulled By: ltamasi

fbshipit-source-id: 81e30a02ded87c0f1a42985db42e80b62235ba11
2021-11-01 15:22:32 -07:00
Alan Paxton
ec9082d698 Regression tests for tickets fixed by previous change. (#9019)
Summary:
closes https://github.com/facebook/rocksdb/issues/5891
closes https://github.com/facebook/rocksdb/issues/2001

Java BytewiseComparator is now unsigned compliant, consistent with the default C++ comparator, which has always been thus. Consequently 2 tickets reporting the previous broken state can be closed.

 This test confirms that the following issues were in fact resolved
 by a change made between 6.2.2 and 6.22.1,
 to wit https://github.com/facebook/rocksdb/commit/7242dae7
which as part of its effect, changed the Java bytewise comparators.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9019

Reviewed By: pdillinger

Differential Revision: D31610910

Pulled By: mrambacher

fbshipit-source-id: 664230f1377a1aa270136edd63eea2c206b907e9
2021-11-01 15:06:47 -07:00
Hui Xiao
560fe70233 Add new API CacheReservationManager::GetDummyEntrySize() (#9072)
Summary:
Note: it might conflict with another CRM related PR https://github.com/facebook/rocksdb/pull/9071 and so will merge after that's merged.

Context:
As `CacheReservationManager` being used by more memory users, it is convenient to retrieve the dummy entry size for `CacheReservationManager` instead of hard-coding `256 * 1024` in writing tests. Plus it allows more flexibility to change our implementation on dummy entry size.

A follow-up PR is needed to replace those hard-coded dummy entry size value in `db_test2.cc`, `db_write_buffer_manager_test.cc`, `write_buffer_manager_test.cc`, `table_test.cc` and the ones introduced in https://github.com/facebook/rocksdb/pull/9072#issue-1034326069.
- Exposed the private static constexpr `kDummyEntrySize` through public static `CacheReservationManager::GetDummyEntrySize()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9072

Test Plan:
- Passing new tests
- Passing existing tests

Reviewed By: ajkr

Differential Revision: D32043684

Pulled By: hx235

fbshipit-source-id: ddefc6921c052adab6a2cda2394eb26da3076a50
2021-11-01 14:46:09 -07:00
sdong
a2b9be42b6 Try to start TTL earlier with kMinOverlappingRatio is used (#8749)
Summary:
Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles.

When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed.

In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749

Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end.

Reviewed By: jay-zhuang

Differential Revision: D30735261

fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e
2021-11-01 14:36:31 -07:00
hx235
a5ec5e3ea0 Minor improvement to #8428 (Account for dictionary-building buffer in global memory limit) (#9032)
Summary:
Summary/Context:
- Renamed `cache_rev_mng` to `compression_dict_buffer_cache_res_mgr`
   - It is to distinguish with other potential `cache_res_mgr` in `BlockBasedTableBuilder` and to use correct short-hand for the words "reservation", "manager"
- Added `table_options.block_cache == nullptr` in additional to `table_options.no_block_cache == true` to be conditions where we don't create a `CacheReservationManager`
   - Theoretically `table_options.no_block_cache == true` is equivalent to `table_options.block_cache == nullptr` by API. But since segment fault will be generated by passing `nullptr` into `CacheReservationManager`'s constructor, it does not hurt to directly verify  `table_options.block_cache != nullptr` before passing in
- Renamed `is_cache_full` to `exceeds_global_block_cache_limit`
   - It is to hide implementation detail of cache reservation and to emphasize on the concept/design intent of caping memory within global block cache limit

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9032

Test Plan: - Passing existing tests

Reviewed By: ajkr

Differential Revision: D32005807

Pulled By: hx235

fbshipit-source-id: 619fd17bb924199de3db5924d8ab7dae53b1efa2
2021-11-01 14:28:09 -07:00
leipeng
230c98f3ce fix histogram NUM_FILES_IN_SINGLE_COMPACTION (#9026)
Summary:
currently histogram `NUM_FILES_IN_SINGLE_COMPACTION` just counted files in first level of compaction input, this fix counts files in all levels of compaction input.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9026

Reviewed By: ajkr

Differential Revision: D31668241

Pulled By: jay-zhuang

fbshipit-source-id: c02f6c4a5df9fbf0b7510036594811152e8738af
2021-11-01 12:57:27 -07:00
Levi Tamasi
b1c27a52d2 Add a consistency check that prevents the overflow of garbage in blob files (#9100)
Summary:
The number or total size of garbage blobs in any given blob file can
never exceed the number or total size of all blobs in the file. (This
would be a similar error to e.g. attempting to remove from the LSM tree
an SST file that has already been removed.) The patch builds on
https://github.com/facebook/rocksdb/issues/9085 and adds a
consistency check to `VersionBuilder` that prevents the above from
happening.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9100

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D32048982

Pulled By: ltamasi

fbshipit-source-id: 6f7e0793bf534ad04c3359cc0f696b8e4e5ef81c
2021-11-01 12:32:14 -07:00
Alan Paxton
73e6b89fad Java wrapper for blob_gc_force_threshold as blobGarbageCollectionForceThreshold (#9109)
Summary:
Extra option added as a supplement to https://github.com/facebook/rocksdb/pull/8999

Closes https://github.com/facebook/rocksdb/issues/8221

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9109

Reviewed By: mrambacher

Differential Revision: D32065039

Pulled By: ltamasi

fbshipit-source-id: 6c484050a30fe0523850a8a3c95dc85b0a501362
2021-11-01 11:59:10 -07:00
leipeng
2b70224f82 remove bad extra RecordTick(stats_, WRITE_WITH_WAL) (#9064)
Summary:
This PR fix wrong ticker `WRITE_WITH_WAL`.

`RecordTick(WRITE_WITH_WAL)` will be called later in `WriteToWAL` and `ConcurrentWriteToWAL`.

Fixes:
1. Delete these two extra `RecordTick(WRITE_WITH_WAL)`
2. Fix corresponding test case

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9064

Reviewed By: ajkr

Differential Revision: D31944459

Pulled By: riversand963

fbshipit-source-id: f1aa8d2a4320456bc357bc5b0902032f7dcad086
2021-11-01 11:43:14 -07:00
Yanqin Jin
0e12b1d691 Update buckify scripts (#9104)
Summary:
49af999954
updates RocksDB buckifier script directly via fbcode. We need to make
sure that the following command run in RocksDB repo generate the same
TARGETS file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9104

Test Plan:
```
$python buckifier/buckify_rocksdb.py
```
Verify that TARGETS file does not have uncommitted changes.

Reviewed By: jay-zhuang

Differential Revision: D32055387

Pulled By: riversand963

fbshipit-source-id: 19cf1b8145095b6df625958458189680e543e3ba
2021-11-01 10:11:18 -07:00
leipeng
01bd86ad35 InternalStats::DumpCFMapStat: fix sum.w_amp (#9065)
Summary:
sum `w_amp` will be a very large number`(bytes_written + bytes_written_blob)` when there is no any flush and ingest.

This PR set sum `w_amp` to zero if there is no any flush and ingest, this is conform to per-level `w_amp` computation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9065

Reviewed By: ajkr

Differential Revision: D31943994

Pulled By: riversand963

fbshipit-source-id: acbef5e331debebfad09e0e0d8d0885ebbc00609
2021-10-31 23:11:43 -07:00
Yanqin Jin
d263505417 Avoid div-by-zero error in db_stress (#9086)
Summary:
If a column family has 0 levels, then existing `TestCompactFiles(...)` may hit
divide-by-zero. To fix, return early if the cf is empty.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9086

Test Plan: TBD

Reviewed By: ajkr

Differential Revision: D31986799

Pulled By: riversand963

fbshipit-source-id: 48f7dfb2b2b47cfc1315cb71ca80eb230d947f17
2021-10-31 22:16:03 -07:00
Yanqin Jin
8e59a1dc9a Attempt to deflake ListenerTest.MultiCF (#9084)
Summary:
EventListenerTest.MultiCF uses TestFlushListener which has members
flushed_dbs_ and flushed_column_family_names_ that are not protected by
locks. This implicitly indicates that we need to ensure the methods
accessing these data structures in a single threaded way. In other
tests, e.g. MultiDBMultiListeners, we use TEST_WaitForFlushMemtable() to
wait until all memtables of a given column family are flushed, hence no
pending flush threads will concurrently call OnFlushCompleted() and
cause data race for flushed_dbs_. To fix a test failure, we should do
the same for MultiCF.

Example data race stack traces reported by TSAN
```
Read of size 8 at 0x7b6000002840 by main thread:
    #0 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::size() const /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/stl_vector.h:655:40
    https://github.com/facebook/rocksdb/issues/1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() /home/circleci/project/db/listener_test.cc:380:7

Previous write of size 8 at 0x7b6000002840 by thread T2:
    #0 void std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::_M_emplace_back_aux<rocksdb::DB* const&>(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/vector.tcc:442:26
    https://github.com/facebook/rocksdb/issues/1 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::push_back(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/stl_vector.h:923:4
    https://github.com/facebook/rocksdb/issues/2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) /home/circleci/project/db/listener_test.cc:255:18
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9084

Test Plan: ./listener_test --gtest_filter=EventListenerTest.MultiCF

Reviewed By: jay-zhuang

Differential Revision: D31952259

Pulled By: riversand963

fbshipit-source-id: 94a7f29e4e9466ead42418944eb2247fc32bd499
2021-10-31 22:12:15 -07:00
Yanqin Jin
8f4f302316 Attempt to deflake DBFlushTest.FireOnFlushCompletedAfterCommittedResult (#9083)
Summary:
DBFlushTest.FireOnFlushCompletedAfterCommittedResult uses test sync
points to coordinate interleaving of different threads. Before this PR,
the test writes some data to memtable, triggers a manual flush, and
triggers a second manual flush after a first bg flush thread starts
executing. Though unlikely, it is possible for the second bg flush
thread to run faster than the first bg flush thread and deques flush
queue first. In this case, the original test will fail.
The fix is to wait until the first bg flush thread deques the flush
queue before triggering second manual flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9083

Test Plan: ./db_flush_test --gtest_filter=DBFlushTest.FireOnFlushCompletedAfterCommittedResult

Reviewed By: jay-zhuang

Differential Revision: D31951239

Pulled By: riversand963

fbshipit-source-id: f32d7cdabe6ad6808fd18e54e663936dc0a9edb4
2021-10-31 22:08:48 -07:00
CodemodService Bot
49af999954 internal_repo_rocksdb/repo
Reviewed By: DrMarcII

Differential Revision: D32033741

fbshipit-source-id: af12d9d72f109a4a2837cb64e02fa0dbc9175711
2021-10-29 19:34:39 -07:00
Levi Tamasi
44d04582cb Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085)
Summary:
The current VersionBuilder code on mainline keeps track of blob file related
changes ("delta") induced by a series of `VersionEdit`s in the form of
`BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta`
contains the amount of additional garbage generated by compactions, as well
as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves,
since in that case the newly linked and unlinked SSTs cancel each other out.
However, this representation does not allow us to easily tell whether a certain
blob file is obsolete after applying a set of `VersionEdit`s or not. In order to
solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition
to the delta, also contains the materialized state after applying a set of version edits
(i.e. the total amount of garbage and the resulting set of linked SSTs). This will
enable us to add further consistency checks and to improve certain pieces of
functionality where knowing up front which blob files get obsoleted is beneficial.
(Note: this patch is just the refactoring part; I plan to create separate PRs for
the enhancements.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085

Test Plan: Ran `make check` and the stress tests in BlobDB mode.

Reviewed By: riversand963

Differential Revision: D31980867

Pulled By: ltamasi

fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
2021-10-29 17:47:02 -07:00
Yanqin Jin
fdf2a0d7eb Fix a compaction bug for write-prepared txn (#9061)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061

In write-prepared txn, checking a sequence's visibility in a released (old)
snapshot may return "Snapshot released". Suppose we have two snapshots:

```
earliest_snap < earliest_write_conflict_snap
```

If we release `earliest_write_conflict_snap` but keep `earliest_snap` during
bottommost level compaction, then it is possible that certain sequence of
events can lead to a PUT being seq-zeroed followed by a SingleDelete of the
same key. This violates the ascending order of keys, and will cause data
inconsistency.

Reviewed By: ltamasi

Differential Revision: D31813017

fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285
2021-10-29 15:23:17 -07:00
Jonathan Albrecht
f2d11b3fdc Temporarily disable s390x+cmake* Travis jobs (#9095)
Summary:
Temporarily disable s390x+cmake* jobs until a cmake-3.14.5-Linux-s390x.deb can be installed to https://rocksdb-deps.s3-us-west-2.amazonaws.com.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9095

Reviewed By: akankshamahajan15

Differential Revision: D32025417

Pulled By: riversand963

fbshipit-source-id: eefb9737937987c7d9273482a89e4d2266cd5375
2021-10-29 11:06:50 -07:00
Peter Dillinger
92e2399669 Fix EnvLibrados and add to CI (#9088)
Summary:
This feature was not part of any common or CI build, so no
surprise it broke. Now we can at least ensure compilation. I don't know
how to run the test successfully (missing config file) so it is bypassed
for now.

Fixes https://github.com/facebook/rocksdb/issues/9078

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9088

Test Plan: CI

Reviewed By: mrambacher

Differential Revision: D32009467

Pulled By: pdillinger

fbshipit-source-id: 3e0d1e5fde7f0ece703d48a81479e1cc7392c25c
2021-10-29 08:19:03 -07:00
Peter Dillinger
a7d4bea43a Implement XXH3 block checksum type (#9069)
Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.

Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument

This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069

Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.

DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.

Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.

 ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c       :       0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash       :       0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64     :       0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3         :       0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c       :       0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash       :       0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64     :       0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3         :       0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c       :       0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash       :       0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64     :       0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3         :       0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)

As you can see, especially once warmed up, xxh3 is fastest.

 ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

Test

    for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done

Results (ops/sec)

    for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done

results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3

Reviewed By: mrambacher

Differential Revision: D31905249

Pulled By: pdillinger

fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
2021-10-28 22:15:17 -07:00
Andrew Kryczka
f24c39ab3d Prevent corruption with parallel manual compactions and change_level == true (#9077)
Summary:
The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.

Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered  (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.

The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.

Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).

The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.

Thanks to siying for finding the bug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077

Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.

Reviewed By: siying

Differential Revision: D31921908

Pulled By: ajkr

fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
2021-10-27 23:08:56 -07:00
Peter Dillinger
5bf9a7d5ee Clarify caching behavior for index and filter partitions (#9068)
Summary:
Somewhat confusingly, index and filter partition blocks are
never owned by table readers, even with
cache_index_and_filter_blocks=false. They still go into block cache
(possibly pinned by table reader) if there is a block cache. If no block
cache, they are only loaded transiently on demand.

This PR primarily clarifies the options APIs and some internal code
comments.

Also, this closes a hypothetical data corruption vulnerability where
some but not all index partitions are pinned. I haven't been able to
reproduce a case where it can happen (the failure seems to propagate
to abort table open) but it's worth patching nonetheless.

Fixes https://github.com/facebook/rocksdb/issues/8979

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9068

Test Plan:
existing tests :-/  I could cover the new code using sync
points, but then I'd have to very carefully relax my `assert(false)`

Reviewed By: ajkr

Differential Revision: D31898284

Pulled By: pdillinger

fbshipit-source-id: f2511a7d3a36bc04b627935d8e6cfea6422f98be
2021-10-27 17:23:04 -07:00
Calin Culianu
82846f41d3 Fix incorrect order of comments in win_thread.cc (#9033)
Summary:
The comments in the `#endif` section at the end of the file were in the
wrong order.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9033

Reviewed By: mrambacher

Differential Revision: D31935856

Pulled By: ajkr

fbshipit-source-id: 24aca039993d6e27022cfe8d6434e90f2934c87c
2021-10-27 13:25:01 -07:00
Peter Dillinger
4ec31dc8ac Make format-diff.sh locale-independent (#9079)
Summary:
Force POSIX locale for calls to 'git remote' that might have
locale-dependent formatting, as shown in https://github.com/facebook/rocksdb/issues/8731 comment

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9079

Test Plan:
manual (haven't tried on a machine with non-english default
locale)

Reviewed By: ltamasi

Differential Revision: D31943092

Pulled By: pdillinger

fbshipit-source-id: 7dbe5915824f39f73b412cc3d1a86a2521cf76c1
2021-10-27 12:26:36 -07:00
myasuka
dc00e4b120 Introduce allowStall option for write buffer manager constructor (#9076)
Summary:
https://github.com/facebook/rocksdb/pull/7898 enable write buffer manager to stall write when memory_usage exceeds buffer_size, this is really useful for container running case to limit the memory usage. However, this feature is not visiable for rocksJava yet.

This PR targets to introduce this feature for rocksJava.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9076

Reviewed By: akankshamahajan15

Differential Revision: D31931092

Pulled By: anand1976

fbshipit-source-id: 5531c16a87598663a02368c07b5e13a503164578
2021-10-26 12:09:54 -07:00
Jonathan Albrecht
e970248602 Add support for building on s390x platform (#8962)
Summary:
This PR adds support for building on s390x including updating travis CI. It uses the previous work in https://github.com/facebook/rocksdb/pull/6168 and adds some more changes to get all current tests (make check and jni tests) to pass. The tests were run with snappy, lz4, bzip2 and zstd all compiled in.

There are a few pieces still needed to get the travis build working that I don't think I can do. adamretter is this something you could help with?

1. A prebuilt https://rocksdb-deps.s3-us-west-2.amazonaws.com/cmake/cmake-3.14.5-Linux-s390x.deb package
2. A https://hub.docker.com/r/evolvedbinary/rocksjava s390x image

Not sure if there is more required for travis. Happy to help in any way I can.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8962

Reviewed By: mrambacher

Differential Revision: D31802198

Pulled By: pdillinger

fbshipit-source-id: 683511466fa6b505f85ba5a9964a268c6151f0c2
2021-10-22 10:13:15 -07:00
Yanqin Jin
f72fd58565 Fix atomic flush waiting forever for MANIFEST write (#9034)
Summary:
In atomic flush, concurrent background flush threads will commit to the MANIFEST
one by one, in the order of the IDs of their picked memtables for all included column
families. Each time, a background flush thread decides whether to wait based on two
criteria:
- Is db stopped? If so, don't wait.
- Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go.

When atomic flush was implemented, error writing to or syncing the MANIFEST would
cause the db to be stopped. Therefore, this background thread does not have to check
for the background error while waiting. If there has been such an error, `DBStopped()`
would have been true, and this thread will **not** wait forever.

After we improved error handling, RocksDB may map an IOError while writing to MANIFEST
to a soft error, if there is no WAL. This requires the background threads to check for
background error while waiting. Otherwise, a background flush thread may wait forever.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D31639225

Pulled By: riversand963

fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd
2021-10-20 21:34:47 -07:00
sdong
633f069c29 Update Release Version to 6.26 (#9059)
Summary:
Before cutting release branch 6.26, update version.h and release notes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9059

Reviewed By: ajkr

Differential Revision: D31805126

fbshipit-source-id: ae85ccf06ec756fa21163161f53fd0b728e6e32e
2021-10-20 15:32:01 -07:00
leipeng
0a73ada7b5 remove unused local obj and simpilify comple code (#9052)
Summary:
This PR does not change code sematics, it just changes for:

1. local obj `nonmem_w` and `lfile` are unused
2. null check for `delete ptr` is unnecessary
3. use `unique_ptr::reset` instead of `release` + `delete`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9052

Reviewed By: zhichao-cao

Differential Revision: D31801661

Pulled By: anand1976

fbshipit-source-id: 16a77d45da8c8833bf5bf3bce546bb3711b335df
2021-10-20 14:08:05 -07:00
leipeng
0c53b41856 db_impl_write.cc: use stats_ instead of immutable_db_options_.stats (#9053)
Summary:
This PR has no semantic changes, just to make code shorter.

`stats_` has value same with `immutable_db_options_.stats`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9053

Reviewed By: zhichao-cao

Differential Revision: D31801603

Pulled By: anand1976

fbshipit-source-id: cbd8fe478d3e90ae078ace49b4f2eb9bb028ccf6
2021-10-20 14:04:59 -07:00
Andrew Kryczka
4217d1bce7 Support GetMapProperty() with "rocksdb.dbstats" (#9057)
Summary:
This PR supports querying `GetMapProperty()` with "rocksdb.dbstats" to get the DB-level stats in a map format. It only reports cumulative stats over the DB lifetime and, as such, does not update the baseline for interval stats. Like other map properties, the string keys are not (yet) exposed in the public API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9057

Test Plan: new unit test

Reviewed By: zhichao-cao

Differential Revision: D31781495

Pulled By: ajkr

fbshipit-source-id: 6f77d3aee8b4b1a015061b8c260a123859ceaf9b
2021-10-20 13:17:00 -07:00
sdong
c66b4429ff Incremental Space Amp Compactions in Universal Style (#8655)
Summary:
This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty.

In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control.

Two set of write benchmarks are run:
1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163
2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655

Test Plan: Add unit tests to the functionality.

Reviewed By: ajkr

Differential Revision: D31787034

fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 10:04:13 -07:00
Zhichao Cao
6d93b87588 Add lowest_used_cache_tier to ImmutableDBOptions to enable or disable Secondary Cache (#9050)
Summary:
Currently, if Secondary Cache is provided to the lru cache, it is used by default. We add CacheTier to advanced_options.h to describe the cache tier we used. Add a `lowest_used_cache_tier` option to `DBOptions` (immutable) and pass it to BlockBasedTableReader to decide if secondary cache will be used or not. By default it is `CacheTier::kNonVolatileTier`, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileTier). By set it to `CacheTier::kVolatileTier`, the DB will not use the secondary cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9050

Test Plan: added new tests

Reviewed By: anand1976

Differential Revision: D31744769

Pulled By: zhichao-cao

fbshipit-source-id: a0575ebd23e1c6dfcfc2b4c8578764e73b15bce6
2021-10-19 15:54:23 -07:00
Jay Zhuang
f20b07cebb Add "Java API Changes" session in HISTORY (#9055)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9055

Reviewed By: ajkr

Differential Revision: D31765398

Pulled By: jay-zhuang

fbshipit-source-id: 77ed67d69415c9fbbfc1132b15310b293e3939c6
2021-10-19 15:23:06 -07:00
sdong
f053851af6 Ignore non-overlapping levels when determinig grandparent files (#9051)
Summary:
Right now, when picking a compaction, grand parent files are from output_level + 1. This usually works, but if the level doesn't have any overlapping file, it will be more efficient to go further down. This is because the files are likely to be trivial moved further and might create a violation of max_compaction_bytes. This situation can naturally happen and might happen even more with TTL compactions. There is no harm to fix it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9051

Test Plan: Run existing tests and see it passes. Also briefly run crash test.

Reviewed By: ajkr

Differential Revision: D31748829

fbshipit-source-id: 52b99ab4284dc816d22f34406d528a3c98ff6719
2021-10-19 12:48:18 -07:00
Peter Dillinger
b234a3f569 Improve data block construction performance (#9040)
Summary:
... by bypassing tracking of last_key in BlockBuilder when
last_key is already known (for BlockBasedTableBuilder::data_block).

I tried extracting a base class of BlockBuilder without the last_key
tracking at all, but that became complicated by NewFlushBlockPolicy() in
the public API referencing BlockBuilder, which would need to be the base
class, and I don't want to replace nearly all the internal references to
BlockBuilder.

Possible follow-up:
* Investigate / consider using AddWithLastKey in more places

This improvement should stack with https://github.com/facebook/rocksdb/issues/9039

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9040

Test Plan:
TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000
Compiled with DEBUG_LEVEL=0
Test vs. control runs simulaneous for better accuracy, units = ops/sec

Run 1: 278929 vs. 267799 (+4.2%)
Run 2: 281836 vs. 267432 (+5.4%)
Run 3: 278279 vs. 270454 (+2.9%)

(This benchmark is chosen to have detectable signal-to-noise, not to
represent expected improvement percent on real workloads.)

Reviewed By: mrambacher

Differential Revision: D31706033

Pulled By: pdillinger

fbshipit-source-id: 8a50fe6fefdd67b6d7665ffa687bbdcf5ad0d5ec
2021-10-19 12:36:21 -07:00
Peter Dillinger
0534393fc8 Fix stress/crash test handling of SST unique IDs (#9054)
Summary:
Was not handling the case of OnTableFileCreated invoked for
table file NOT created.

Also improved error reporting and caught a missing status check.

Also strengthened the db_stress listener to require file_size > 0 when
status.ok(). We would be violating the API contract if status is OK and
we didn't create a valid SST file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9054

Test Plan: make blackbox_crash_test for a while

Reviewed By: zhichao-cao

Differential Revision: D31765200

Pulled By: pdillinger

fbshipit-source-id: 7c527f5531bc239a5efd7a7b018545d480f926e2
2021-10-19 11:52:07 -07:00
mrambacher
8fb3fe8d39 Allow unregistered options to be ignored in DBOptions from files (#9045)
Summary:
Adds changes to DBOptions (comparable to ColumnFamilyOptions) to allow some option values to be ignored on rehydration from the Options file.  This is necessary for some customizable classes that were not registered with the ObjectRegistry but are saved/restored from the Options file.

All tests pass.  Will run check_format_compatible.sh shortly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9045

Reviewed By: zhichao-cao

Differential Revision: D31761664

Pulled By: mrambacher

fbshipit-source-id: 300c2251639cce2b223481c3bb2a63877b1f3766
2021-10-19 10:43:04 -07:00
Alan Paxton
8d615a2b1d New-style blob option bindings, Java option getter and improve/fix option parsing (#8999)
Summary:
Implementation of https://github.com/facebook/rocksdb/issues/8221, plus/including extension of Java options API to allow the get() of options from RocksDB. The extension allows more comprehensive testing of options at the Java side, by validating that the options are set at the C++ side.

Variations on methods:
MutableColumnFamilyOptions.MutableColumnFamilyOptionsBuilder getOptions()
MutableDBOptions.MutableDBOptionsBuilder getDBOptions()

retrieve the options via RocksDB C++ interfaces, and parse the resulting string into one of the Java-style option objects.

This necessitated generalising the parsing of option strings in Java, which now parses the full range of option strings returned by the C++ interface, rather than a useful subset. This necessitates the list-separator being changed to :(colon) from , (comma).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8999

Reviewed By: jay-zhuang

Differential Revision: D31655487

Pulled By: ltamasi

fbshipit-source-id: c38e98145c81c61dc38238b0df580db176ce4efd
2021-10-19 09:21:52 -07:00
Peter Dillinger
ad5325a736 Experimental support for SST unique IDs (#8990)
Summary:
* New public header unique_id.h and function GetUniqueIdFromTableProperties
which computes a universally unique identifier based on table properties
of table files from recent RocksDB versions.
* Generation of DB session IDs is refactored so that they are
guaranteed unique in the lifetime of a process running RocksDB.
(SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
this enables SST unique IDs to be guaranteed unique among SSTs generated
in a single process, and "better than random" between processes.
See https://github.com/pdillinger/unique_id
* In addition to public API producing 'external' unique IDs, there is a function
for producing 'internal' unique IDs, with functions for converting between the
two. In short, the external ID is "safe" for things people might do with it, and
the internal ID enables more "power user" features for the future. Specifically,
the external ID goes through a hashing layer so that any subset of bits in the
external ID can be used as a hash of the full ID, while also preserving
uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
and on full 192 bits).

Intended follow-up:
* Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
the third 64-bit value of the unique ID.)
* Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990

Test Plan:
Unit tests added, and checking of unique ids in stress test.
NOTE in stress test we do not generate nearly enough files to thoroughly
stress uniqueness, but the test trims off pieces of the ID to check for
uniqueness so that we can infer (with some assumptions) stronger
properties in the aggregate.

Reviewed By: zhichao-cao, mrambacher

Differential Revision: D31582865

Pulled By: pdillinger

fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243
2021-10-18 23:32:01 -07:00
anand76
aa21896880 Add property_bag to FileOptions (#9030)
Summary:
Add a property_bag option in FileOptions for direct FileSystem users to pass custom properties to the provider in APIs such as NewRandomAccessFile, NewWritableFile etc. This field will be ignored/not populated by RocksDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9030

Reviewed By: zhichao-cao

Differential Revision: D31630643

Pulled By: anand1976

fbshipit-source-id: 1e1ddc5e2933ecada99a94eada5f309b674a03e8
2021-10-18 23:03:19 -07:00