Commit Graph

4707 Commits

Author SHA1 Message Date
Adam Simpkins
28f54e71f3 fix compile errors in db/kv_checksum.h (#9173)
Summary:
When defining a template class, the constructor should be specified
simply using the class name; it does not take template arguments.a

Apparently older versions of gcc and clang did not complain about this
syntax, but gcc 11.x and recent versions of clang both complain about
this file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9173

Test Plan:
When building with platform010 I got compile errors in this file both
in `mode/dev` (clang) and in `mode/opt-gcc`.  This diff fixes the
compile failures.

Reviewed By: ajkr

Differential Revision: D32455881

Pulled By: simpkins

fbshipit-source-id: 0682910d9e2cdade94ce1e77973d47ac04d9f7e2
2021-11-16 10:20:50 -08:00
Yanqin Jin
2035798834 Update TransactionUtil::CheckKeyForConflict to also use timestamps (#9162)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162

Existing TransactionUtil::CheckKeyForConflict() performs only seq-based
conflict checking. If user-defined timestamp is enabled, it should perform
conflict checking based on timestamps too.

Update TransactionUtil::CheckKey-related methods to verify the timestamp of the
latest version of a key is smaller than the read timestamp. Note that
CheckKeysForConflict() is not updated since it's used only by optimistic
transaction, and we do not plan to update it in this upcoming batch of diffs.

Existing GetLatestSequenceForKey() returns the sequence of the latest
version of a specific user key. Since we support user-defined timestamp, we
need to update this method to also return the timestamp (if enabled) of the
latest version of the key. This will be needed for snapshot validation.

Reviewed By: ltamasi

Differential Revision: D31567960

fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44
2021-11-15 12:52:18 -08:00
Akanksha Mahajan
17ce1ca48b Reuse internal auto readhead_size at each Level (expect L0) for Iterations (#9056)
Summary:
RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB.

This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size  at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block.

1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer),  readahead_size will decrease by 8KB.
2. It maintains readahead_size for L1 - Ln levels.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056

Test Plan:
Added new unit tests
Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled.

Reviewed By: anand1976

Differential Revision: D31773640

Pulled By: akankshamahajan15

fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98
2021-11-10 16:20:04 -08:00
slk
937fbcbddc Track per-SST user-defined timestamp information in MANIFEST (#9092)
Summary:
Track per-SST user-defined timestamp information in MANIFEST https://github.com/facebook/rocksdb/issues/8957

Rockdb has supported user-defined timestamp feature. Application can specify a timestamp
when writing each k-v pair. When data flush from memory to disk file called SST files, file
creation activity will commit to MANIFEST. This commit is for tracking timestamp info in the
MANIFEST for each file. The changes involved are as follows:
1) Track max/min timestamp in FileMetaData, and fix invoved codes.
2) Add NewFileCustomTag::kMinTimestamp and NewFileCustomTag::kMinTimestamp in
    NewFileCustomTag ( in the kNewFile4 part ), and support invoved codes such as
    VersionEdit Encode and Decode etc.
3) Add unit test code for VersionEdit EncodeDecodeNewFile4, and fix invoved test codes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9092

Reviewed By: ajkr, akankshamahajan15

Differential Revision: D32252323

Pulled By: riversand963

fbshipit-source-id: d2642898d6e3ad1fef0eb866b98045408bd4e162
2021-11-10 10:49:04 -08:00
Yanqin Jin
a113cecfc9 Fix a bug in timestamp-related GC (#9116)
Summary:
For multiple versions (ts + seq) of the same user key, if they cross the boundary of `full_history_ts_low_`,
we should retain the version that is visible to the `full_history_ts_low_`. Namely, we keep the internal key
with the largest timestamp smaller than `full_history_ts_low`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9116

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D32261514

Pulled By: riversand963

fbshipit-source-id: e10f47c254c04c05261440051e4f50cb7d95474e
2021-11-09 13:08:55 -08:00
Zhichao Cao
efaef9b40a cleanup error_handler related code (#9098)
Summary:
Remove code not in use, add comments, remove redundant code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9098

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D32027219

Pulled By: zhichao-cao

fbshipit-source-id: 253aae926c87726268af6c027bf805dc9156c8a8
2021-11-08 15:49:17 -08:00
anand76
dddb791c18 Enable a few unit tests to use custom Env objects (#9087)
Summary:
Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087

Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc.

Reviewed By: riversand963

Differential Revision: D32007222

Pulled By: anand1976

fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792
2021-11-08 11:05:59 -08:00
Jay Zhuang
5aad38f262 Deflake DBBasicTestWithTimestampCompressionSettings.PutAndGetWithComp… (#9136)
Summary:
…action
```
db/db_with_timestamp_basic_test.cc:2643: Failure
db_->CompactFiles(compact_opt, handles_[cf], collector->GetFlushedFiles(), static_cast<int>(kNumTimestamps - i))
Invalid argument: A compaction must contain at least one file.
```
Able to be reproduced by run multiple test in parallel.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9136

Test Plan:
```
gtest-parallel ./db_with_timestamp_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampCompressionSettings.PutAndGetWithCompaction/12 -r 100 -w 100
```

Reviewed By: riversand963

Differential Revision: D32197734

Pulled By: jay-zhuang

fbshipit-source-id: aeb0d6e9b37312f577e203ca81bb7a0f14d4e7ce
2021-11-05 17:57:50 -07:00
Levi Tamasi
3fbddb1d27 Refactor and unify blob file saving and the logic that finds the oldest live blob file (#9122)
Summary:
The patch refactors and unifies the logic in `VersionBuilder::SaveBlobFilesTo`
and `VersionBuilder::GetMinOldestBlobFileNumber` by introducing a generic
helper that can "merge" the list of `BlobFileMetaData` in the base version with
the list of `MutableBlobFileMetaData` representing the updated state after
applying a sequence of `VersionEdit`s. This serves as groundwork for subsequent
changes that will enable us to determine whether a blob file is live after applying
a sequence of edits without calling `VersionBuilder::SaveTo`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9122

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D32151472

Pulled By: ltamasi

fbshipit-source-id: 11622b475866de823334b8bc21b0e99d913af97e
2021-11-05 16:50:52 -07:00
Yanqin Jin
5237b39d2e Fix assertion error during compaction with write-prepared txn enabled (#9105)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105

The user contract of SingleDelete is that: a SingleDelete can only be issued to
a key that exists and has NOT been updated. For example, application can insert
one key `key`, and uses a SingleDelete to delete it in the future. The `key`
cannot be updated or removed using Delete.
In reality, especially when write-prepared transaction is being used, things
can get tricky. For example, a prepared transaction already writes `key` to the
memtable after a successful Prepare(). Afterwards, should the transaction
rollback, it will insert a Delete into the memtable to cancel out the prior
Put. Consider the following sequence of operations.

```
// operation sequence 1
Begin txn
Put(key)
Prepare()
Flush()

Rollback txn
Flush()
```

There will be two SSTs resulting from above. One of the contains a PUT, while
the second one contains a Delete. It is also known that releasing a snapshot
can lead to an L0 containing only a SD for a particular key. Consider the
following operations following the above block.

```
// operation sequence 2
db->Put(key)
db->SingleDelete(key)
Flush()
```

The operation sequence 2 can result in an L0 with only the SD.

Should there be a snapshot for conflict checking created before operation
sequence 1, then an attempt to compact the db may hit the assertion failure
below, because ikey_.type is Delete (from a rollback).

```
else if (clear_and_output_next_key_) {
  assert(ikey_.type == kTypeValue || ikey_.type == kTypeBlobIndex);
}
```

To fix the assertion failure, we can skip the SingleDelete if we detect an
earlier Delete in the same snapshot interval.

Reviewed By: ltamasi

Differential Revision: D32056848

fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748
2021-11-05 15:29:18 -07:00
Yanqin Jin
9b53f14a35 Fixed a bug in CompactionIterator when write-preared transaction is used (#9060)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060

RocksDB bottommost level compaction may zero out an internal key's sequence if
the key's sequence is in the earliest_snapshot.
In write-prepared transaction, checking the visibility of a certain sequence in
a specific released snapshot may return a "snapshot released" result.
Therefore, it is possible, after a certain sequence of events, a PUT has its
sequence zeroed out, but a subsequent SingleDelete of the same key will still
be output with its original sequence. This violates the ascending order of
keys and leads to incorrect result.

The solution is to use an extra variable `last_key_seq_zeroed_` to track the
information about visibility in earliest snapshot. With this variable, we can
know for sure that a SingleDelete is in the earliest snapshot even if the said
snapshot is released during compaction before processing the SD.

Reviewed By: ltamasi

Differential Revision: D31813016

fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1
2021-11-03 15:55:00 -07:00
Jay Zhuang
29102641dd Skip directory fsync for filesystem btrfs (#8903)
Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
   new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
   synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
   to sync the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is forced by set
   `IOOptions.force_dir_fsync = true`
4. renaming multiple files (like backup and checkpoint): dir-fsync is
   forced, the same as above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903

Test Plan: run tests on btrfs and non btrfs

Reviewed By: ajkr

Differential Revision: D30885059

Pulled By: jay-zhuang

fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
2021-11-03 12:21:27 -07:00
Levi Tamasi
081722780b Refactor the detailed consistency checks and the SST saving logic in VersionBuilder (#9099)
Summary:
The patch refactors the parts of `VersionBuilder` that deal with SST file
comparisons. Specifically, it makes the following changes:
* Turns `NewestFirstBySeqNo` and `BySmallestKey` from free-standing
functions into function objects. Note: `BySmallestKey` has a pointer to the
`InternalKeyComparator`, while `NewestFirstBySeqNo` is completely
stateless.
* Eliminates the wrapper `FileComparator`, which was essentially an
unnecessary DIY virtual function call mechanism.
* Refactors `CheckConsistencyDetails` and `SaveSSTFilesTo` using helper
function templates that take comparator/checker function objects. Using
static polymorphism eliminates the need to make runtime decisions about
which comparator to use.
* Extends some error messages returned by the consistency checks and
makes them more uniform.
* Removes some incomplete/redundant consistency checks from `VersionBuilder`
and `FilePicker`.
* Improves const correctness in several places.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9099

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D32027503

Pulled By: ltamasi

fbshipit-source-id: 621326ae41f4f55f7ad6a91abbd6e666d5c7857c
2021-11-03 11:52:47 -07:00
Peter Dillinger
2b60621f16 Don't call OnTableFileCreated with OK for empty+deleted file (#9118)
Summary:
EventListener::OnTableFileCreated was previously called with OK
status and file_size==0 in cases of no SST file contents written
(because there was no content to add) and the empty file deleted before
calling the listener. This could lead to a stress test assertion failure
added in https://github.com/facebook/rocksdb/issues/9054.

This changes the status to Aborted, to align with the API doc:
"... if the file is successfully created. Now it will also be called on
failure case. User can check info.status to see if it succeeded or not."
For internal purposes, this case is considered "success" but for
listener purposes, no SST file is (successfully) created.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9118

Test Plan: test case added + existing db_stress

Reviewed By: ajkr, riversand963

Differential Revision: D32120232

Pulled By: pdillinger

fbshipit-source-id: a804e2e0a52598018d3b182da97804d402ffcdfa
2021-11-03 08:43:27 -07:00
Peter Dillinger
21f8a57f2a Fix TSAN report on MemPurge test (#9115)
Summary:
TSAN reported data race on count variables in MemPurgeBasic
test. This suggests the test could fail if mempurges were slow enough
that they don't complete before the count variables being checked, but
injecting a long sleep into MemPurge (outside DB mutex) confirms that
blocked writes ensure enough mempurges/flushes happen to make the test
pass. All the possible different values on testing should be OK to make
the test pass.

So this change makes the variables atomic so that up-to-date value is
always read and TSAN report suppressed. I have also used `.exchange(0)`
to make the checking less stateful by "popping off" all the accumulated
counts.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9115

Test Plan: updated test, watch for any flakiness

Reviewed By: riversand963

Differential Revision: D32114432

Pulled By: pdillinger

fbshipit-source-id: c985609d39896a0d8f69ebc87b221e688609bdd8
2021-11-02 21:54:29 -07:00
mrambacher
f72c834eab Make FileSystem a Customizable Class (#8649)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8649

Reviewed By: zhichao-cao

Differential Revision: D32036059

Pulled By: mrambacher

fbshipit-source-id: 4f1e7557ecac52eb849b83ae02b8d7d232112295
2021-11-02 09:07:11 -07:00
sdong
a2b9be42b6 Try to start TTL earlier with kMinOverlappingRatio is used (#8749)
Summary:
Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles.

When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed.

In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749

Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end.

Reviewed By: jay-zhuang

Differential Revision: D30735261

fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e
2021-11-01 14:36:31 -07:00
leipeng
230c98f3ce fix histogram NUM_FILES_IN_SINGLE_COMPACTION (#9026)
Summary:
currently histogram `NUM_FILES_IN_SINGLE_COMPACTION` just counted files in first level of compaction input, this fix counts files in all levels of compaction input.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9026

Reviewed By: ajkr

Differential Revision: D31668241

Pulled By: jay-zhuang

fbshipit-source-id: c02f6c4a5df9fbf0b7510036594811152e8738af
2021-11-01 12:57:27 -07:00
Levi Tamasi
b1c27a52d2 Add a consistency check that prevents the overflow of garbage in blob files (#9100)
Summary:
The number or total size of garbage blobs in any given blob file can
never exceed the number or total size of all blobs in the file. (This
would be a similar error to e.g. attempting to remove from the LSM tree
an SST file that has already been removed.) The patch builds on
https://github.com/facebook/rocksdb/issues/9085 and adds a
consistency check to `VersionBuilder` that prevents the above from
happening.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9100

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D32048982

Pulled By: ltamasi

fbshipit-source-id: 6f7e0793bf534ad04c3359cc0f696b8e4e5ef81c
2021-11-01 12:32:14 -07:00
leipeng
2b70224f82 remove bad extra RecordTick(stats_, WRITE_WITH_WAL) (#9064)
Summary:
This PR fix wrong ticker `WRITE_WITH_WAL`.

`RecordTick(WRITE_WITH_WAL)` will be called later in `WriteToWAL` and `ConcurrentWriteToWAL`.

Fixes:
1. Delete these two extra `RecordTick(WRITE_WITH_WAL)`
2. Fix corresponding test case

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9064

Reviewed By: ajkr

Differential Revision: D31944459

Pulled By: riversand963

fbshipit-source-id: f1aa8d2a4320456bc357bc5b0902032f7dcad086
2021-11-01 11:43:14 -07:00
leipeng
01bd86ad35 InternalStats::DumpCFMapStat: fix sum.w_amp (#9065)
Summary:
sum `w_amp` will be a very large number`(bytes_written + bytes_written_blob)` when there is no any flush and ingest.

This PR set sum `w_amp` to zero if there is no any flush and ingest, this is conform to per-level `w_amp` computation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9065

Reviewed By: ajkr

Differential Revision: D31943994

Pulled By: riversand963

fbshipit-source-id: acbef5e331debebfad09e0e0d8d0885ebbc00609
2021-10-31 23:11:43 -07:00
Yanqin Jin
8e59a1dc9a Attempt to deflake ListenerTest.MultiCF (#9084)
Summary:
EventListenerTest.MultiCF uses TestFlushListener which has members
flushed_dbs_ and flushed_column_family_names_ that are not protected by
locks. This implicitly indicates that we need to ensure the methods
accessing these data structures in a single threaded way. In other
tests, e.g. MultiDBMultiListeners, we use TEST_WaitForFlushMemtable() to
wait until all memtables of a given column family are flushed, hence no
pending flush threads will concurrently call OnFlushCompleted() and
cause data race for flushed_dbs_. To fix a test failure, we should do
the same for MultiCF.

Example data race stack traces reported by TSAN
```
Read of size 8 at 0x7b6000002840 by main thread:
    #0 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::size() const /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/stl_vector.h:655:40
    https://github.com/facebook/rocksdb/issues/1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() /home/circleci/project/db/listener_test.cc:380:7

Previous write of size 8 at 0x7b6000002840 by thread T2:
    #0 void std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::_M_emplace_back_aux<rocksdb::DB* const&>(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/vector.tcc:442:26
    https://github.com/facebook/rocksdb/issues/1 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::push_back(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/stl_vector.h:923:4
    https://github.com/facebook/rocksdb/issues/2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) /home/circleci/project/db/listener_test.cc:255:18
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9084

Test Plan: ./listener_test --gtest_filter=EventListenerTest.MultiCF

Reviewed By: jay-zhuang

Differential Revision: D31952259

Pulled By: riversand963

fbshipit-source-id: 94a7f29e4e9466ead42418944eb2247fc32bd499
2021-10-31 22:12:15 -07:00
Yanqin Jin
8f4f302316 Attempt to deflake DBFlushTest.FireOnFlushCompletedAfterCommittedResult (#9083)
Summary:
DBFlushTest.FireOnFlushCompletedAfterCommittedResult uses test sync
points to coordinate interleaving of different threads. Before this PR,
the test writes some data to memtable, triggers a manual flush, and
triggers a second manual flush after a first bg flush thread starts
executing. Though unlikely, it is possible for the second bg flush
thread to run faster than the first bg flush thread and deques flush
queue first. In this case, the original test will fail.
The fix is to wait until the first bg flush thread deques the flush
queue before triggering second manual flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9083

Test Plan: ./db_flush_test --gtest_filter=DBFlushTest.FireOnFlushCompletedAfterCommittedResult

Reviewed By: jay-zhuang

Differential Revision: D31951239

Pulled By: riversand963

fbshipit-source-id: f32d7cdabe6ad6808fd18e54e663936dc0a9edb4
2021-10-31 22:08:48 -07:00
Levi Tamasi
44d04582cb Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085)
Summary:
The current VersionBuilder code on mainline keeps track of blob file related
changes ("delta") induced by a series of `VersionEdit`s in the form of
`BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta`
contains the amount of additional garbage generated by compactions, as well
as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves,
since in that case the newly linked and unlinked SSTs cancel each other out.
However, this representation does not allow us to easily tell whether a certain
blob file is obsolete after applying a set of `VersionEdit`s or not. In order to
solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition
to the delta, also contains the materialized state after applying a set of version edits
(i.e. the total amount of garbage and the resulting set of linked SSTs). This will
enable us to add further consistency checks and to improve certain pieces of
functionality where knowing up front which blob files get obsoleted is beneficial.
(Note: this patch is just the refactoring part; I plan to create separate PRs for
the enhancements.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085

Test Plan: Ran `make check` and the stress tests in BlobDB mode.

Reviewed By: riversand963

Differential Revision: D31980867

Pulled By: ltamasi

fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
2021-10-29 17:47:02 -07:00
Yanqin Jin
fdf2a0d7eb Fix a compaction bug for write-prepared txn (#9061)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061

In write-prepared txn, checking a sequence's visibility in a released (old)
snapshot may return "Snapshot released". Suppose we have two snapshots:

```
earliest_snap < earliest_write_conflict_snap
```

If we release `earliest_write_conflict_snap` but keep `earliest_snap` during
bottommost level compaction, then it is possible that certain sequence of
events can lead to a PUT being seq-zeroed followed by a SingleDelete of the
same key. This violates the ascending order of keys, and will cause data
inconsistency.

Reviewed By: ltamasi

Differential Revision: D31813017

fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285
2021-10-29 15:23:17 -07:00
Peter Dillinger
a7d4bea43a Implement XXH3 block checksum type (#9069)
Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.

Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument

This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069

Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.

DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.

Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.

 ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c       :       0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash       :       0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64     :       0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3         :       0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c       :       0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash       :       0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64     :       0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3         :       0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c       :       0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash       :       0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64     :       0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3         :       0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)

As you can see, especially once warmed up, xxh3 is fastest.

 ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

Test

    for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done

Results (ops/sec)

    for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done

results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3

Reviewed By: mrambacher

Differential Revision: D31905249

Pulled By: pdillinger

fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
2021-10-28 22:15:17 -07:00
Andrew Kryczka
f24c39ab3d Prevent corruption with parallel manual compactions and change_level == true (#9077)
Summary:
The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.

Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered  (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.

The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.

Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).

The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.

Thanks to siying for finding the bug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077

Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.

Reviewed By: siying

Differential Revision: D31921908

Pulled By: ajkr

fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
2021-10-27 23:08:56 -07:00
Jonathan Albrecht
e970248602 Add support for building on s390x platform (#8962)
Summary:
This PR adds support for building on s390x including updating travis CI. It uses the previous work in https://github.com/facebook/rocksdb/pull/6168 and adds some more changes to get all current tests (make check and jni tests) to pass. The tests were run with snappy, lz4, bzip2 and zstd all compiled in.

There are a few pieces still needed to get the travis build working that I don't think I can do. adamretter is this something you could help with?

1. A prebuilt https://rocksdb-deps.s3-us-west-2.amazonaws.com/cmake/cmake-3.14.5-Linux-s390x.deb package
2. A https://hub.docker.com/r/evolvedbinary/rocksjava s390x image

Not sure if there is more required for travis. Happy to help in any way I can.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8962

Reviewed By: mrambacher

Differential Revision: D31802198

Pulled By: pdillinger

fbshipit-source-id: 683511466fa6b505f85ba5a9964a268c6151f0c2
2021-10-22 10:13:15 -07:00
Yanqin Jin
f72fd58565 Fix atomic flush waiting forever for MANIFEST write (#9034)
Summary:
In atomic flush, concurrent background flush threads will commit to the MANIFEST
one by one, in the order of the IDs of their picked memtables for all included column
families. Each time, a background flush thread decides whether to wait based on two
criteria:
- Is db stopped? If so, don't wait.
- Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go.

When atomic flush was implemented, error writing to or syncing the MANIFEST would
cause the db to be stopped. Therefore, this background thread does not have to check
for the background error while waiting. If there has been such an error, `DBStopped()`
would have been true, and this thread will **not** wait forever.

After we improved error handling, RocksDB may map an IOError while writing to MANIFEST
to a soft error, if there is no WAL. This requires the background threads to check for
background error while waiting. Otherwise, a background flush thread may wait forever.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D31639225

Pulled By: riversand963

fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd
2021-10-20 21:34:47 -07:00
leipeng
0a73ada7b5 remove unused local obj and simpilify comple code (#9052)
Summary:
This PR does not change code sematics, it just changes for:

1. local obj `nonmem_w` and `lfile` are unused
2. null check for `delete ptr` is unnecessary
3. use `unique_ptr::reset` instead of `release` + `delete`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9052

Reviewed By: zhichao-cao

Differential Revision: D31801661

Pulled By: anand1976

fbshipit-source-id: 16a77d45da8c8833bf5bf3bce546bb3711b335df
2021-10-20 14:08:05 -07:00
leipeng
0c53b41856 db_impl_write.cc: use stats_ instead of immutable_db_options_.stats (#9053)
Summary:
This PR has no semantic changes, just to make code shorter.

`stats_` has value same with `immutable_db_options_.stats`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9053

Reviewed By: zhichao-cao

Differential Revision: D31801603

Pulled By: anand1976

fbshipit-source-id: cbd8fe478d3e90ae078ace49b4f2eb9bb028ccf6
2021-10-20 14:04:59 -07:00
Andrew Kryczka
4217d1bce7 Support GetMapProperty() with "rocksdb.dbstats" (#9057)
Summary:
This PR supports querying `GetMapProperty()` with "rocksdb.dbstats" to get the DB-level stats in a map format. It only reports cumulative stats over the DB lifetime and, as such, does not update the baseline for interval stats. Like other map properties, the string keys are not (yet) exposed in the public API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9057

Test Plan: new unit test

Reviewed By: zhichao-cao

Differential Revision: D31781495

Pulled By: ajkr

fbshipit-source-id: 6f77d3aee8b4b1a015061b8c260a123859ceaf9b
2021-10-20 13:17:00 -07:00
sdong
c66b4429ff Incremental Space Amp Compactions in Universal Style (#8655)
Summary:
This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty.

In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control.

Two set of write benchmarks are run:
1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163
2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655

Test Plan: Add unit tests to the functionality.

Reviewed By: ajkr

Differential Revision: D31787034

fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 10:04:13 -07:00
sdong
f053851af6 Ignore non-overlapping levels when determinig grandparent files (#9051)
Summary:
Right now, when picking a compaction, grand parent files are from output_level + 1. This usually works, but if the level doesn't have any overlapping file, it will be more efficient to go further down. This is because the files are likely to be trivial moved further and might create a violation of max_compaction_bytes. This situation can naturally happen and might happen even more with TTL compactions. There is no harm to fix it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9051

Test Plan: Run existing tests and see it passes. Also briefly run crash test.

Reviewed By: ajkr

Differential Revision: D31748829

fbshipit-source-id: 52b99ab4284dc816d22f34406d528a3c98ff6719
2021-10-19 12:48:18 -07:00
Peter Dillinger
ad5325a736 Experimental support for SST unique IDs (#8990)
Summary:
* New public header unique_id.h and function GetUniqueIdFromTableProperties
which computes a universally unique identifier based on table properties
of table files from recent RocksDB versions.
* Generation of DB session IDs is refactored so that they are
guaranteed unique in the lifetime of a process running RocksDB.
(SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
this enables SST unique IDs to be guaranteed unique among SSTs generated
in a single process, and "better than random" between processes.
See https://github.com/pdillinger/unique_id
* In addition to public API producing 'external' unique IDs, there is a function
for producing 'internal' unique IDs, with functions for converting between the
two. In short, the external ID is "safe" for things people might do with it, and
the internal ID enables more "power user" features for the future. Specifically,
the external ID goes through a hashing layer so that any subset of bits in the
external ID can be used as a hash of the full ID, while also preserving
uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
and on full 192 bits).

Intended follow-up:
* Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
the third 64-bit value of the unique ID.)
* Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990

Test Plan:
Unit tests added, and checking of unique ids in stress test.
NOTE in stress test we do not generate nearly enough files to thoroughly
stress uniqueness, but the test trims off pieces of the ID to check for
uniqueness so that we can infer (with some assumptions) stronger
properties in the aggregate.

Reviewed By: zhichao-cao, mrambacher

Differential Revision: D31582865

Pulled By: pdillinger

fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243
2021-10-18 23:32:01 -07:00
Giuseppe Ottaviano
f0841d4faf Fix out-of-bounds access in MultiDBParallelOpenTest (#9046)
Summary:
`dbs` should not be cleared, as it is reused later when reopening the DBs, so we have an out-of-bounds access with `dbnames[dbnum]`. The values left in the vector don't need to be reset, as the db pointer is an out parameter for `DB::Open`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9046

Reviewed By: pdillinger

Differential Revision: D31738263

Pulled By: ot

fbshipit-source-id: c619e947b8d3dbc3d896f29971f093d3e3c794d3
2021-10-18 21:25:45 -07:00
Jay Zhuang
314de7e7de Make DB::Close() thread-safe (#8970)
Summary:
If `DB::Close()` is called in multi-thread env, the resource
could be double released, which causes exception or assert.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8970

Test Plan:
Test with multi-thread benchmark, with each thread try to
close the DB at the end.

Reviewed By: pdillinger

Differential Revision: D31242042

Pulled By: jay-zhuang

fbshipit-source-id: a61276b1b61e07732e375554106946aea86a23eb
2021-10-18 20:32:35 -07:00
Jay Zhuang
53a0ab2bea Deflaky ObsoleteFilesTest (#9049)
Summary:
WaitForFlushMemTable() may only wait for mem flush but not background flush
finishing. The the obsoleted file may not be purged yet.
fcaa7ff638/db/db_impl/db_impl_compaction_flush.cc (L2200-L2203)

Use WaitForCompact() instead to wait for background flush job.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9049

Test Plan: `gtest-parallel ./obsolete_files_test --gtest_filter=ObsoleteFilesTest.DeleteObsoleteOptionsFile -r 1000`

Reviewed By: zhichao-cao

Differential Revision: D31737343

Pulled By: jay-zhuang

fbshipit-source-id: 82276ebeae7c7c75a733d3e1fd1c130d45e4761f
2021-10-18 15:15:23 -07:00
Peter Dillinger
3ffb3baa0b Add (Live)FileStorageInfo API (#8968)
Summary:
New classes FileStorageInfo and LiveFileStorageInfo and
'experimental' function DB::GetLiveFilesStorageInfo, which is intended
to largely replace several fragmented DB functions needed to create
checkpoints and backups.

This function is now used to create checkpoints and backups, because
it fixes many (probably not all) of the prior complexities of checkpoint
not having atomic access to DB metadata. This also ensures strong
functional test coverage of the new API. Specifically, much of the old
CheckpointImpl::CreateCustomCheckpoint has been migrated to and
updated in DBImpl::GetLiveFilesStorageInfo, with the former now
calling the latter.

Also, the class FileStorageInfo in metadata.h compatibly replaces
BackupFileInfo and serves as a new base class for SstFileMetaData.
Some old fields of SstFileMetaData are still provided (for now) but
deprecated.

Although FileStorageInfo::directory is accurate when using db_paths
and/or cf_paths, these have never been supported by Checkpoint
nor BackupEngine and still are not. This change does now detect
these cases and return NotSupported when appropriate. (More work
needed for support.)

Somehow this change broke ProgressCallbackDuringBackup, but
the progress_callback logic was dubious to begin with because it
would call the callback based on copy buffer size, not size actually
copied. Logic and test updated to track size actually copied
per-thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968

Test Plan:
tests updated.
DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl.
DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo,
including reading the data after DB close.
Added CheckpointTest.CheckpointWithDbPath (NotSupported).

Reviewed By: siying

Differential Revision: D31242045

Pulled By: pdillinger

fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0
2021-10-16 10:04:32 -07:00
Giuseppe Ottaviano
4bfd415e34 Fix sequence number bump logic in multi-CF SST ingestion (#9005)
Summary:
The code in `IngestExternalFiles()` that bumps the DB's sequence number
depending on what seqnos were assigned to the files has 3 bugs:

1) There is an assertion that the sequence number is increased in all the
affected column families, but this is unnecessary, it is fine if some files can
stick to a lower sequence number. It is very easy to hit the assertion: it is
sufficient to insert 2 files in 2 CFs, one which overlaps the CF and one that
doesn't (for example the CF is empty). The line added in the
`IngestFilesIntoMultipleColumnFamilies_Success` test makes the assertion fail.

2) SetLastSequence() is called with the sum of all the bumps across CFs, but we
should take the maximum instead, as all CFs start with the current seqno and bump
it independently.

3) The code above is accidentally under a `#ifndef NDEBUG`, so it doesn't run in
optimized builds, so some files may be assigned seqnos from the future.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9005

Test Plan:
Added line in `IngestFilesIntoMultipleColumnFamilies_Success` that
triggers the assertion, verified that the test (and all the others) pass after
the fix.

Reviewed By: ajkr

Differential Revision: D31597892

Pulled By: ot

fbshipit-source-id: c2d3237f90290df1178736ace8653a9623f5a770
2021-10-12 20:39:52 -07:00
Levi Tamasi
7cc52cd8f5 Update HISTORY for PR 8994 (#9017)
Summary:
Also, expand on/clarify a comment in `VersionStorageInfoTest`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9017

Reviewed By: riversand963

Differential Revision: D31566130

Pulled By: ltamasi

fbshipit-source-id: 1d30c7af084c4de7b2030bc6c768838d65746010
2021-10-12 10:19:56 -07:00
Giuseppe Ottaviano
22d4dc5066 Fix race in WriteBufferManager (#9009)
Summary:
EndWriteStall has a data race: `queue_.empty()` is checked outside of the
mutex, so once we enter the critical section another thread may already have
cleared the list, and accessing the `front()` is undefined behavior (and causes
interesting crashes under high concurrency).

This PR fixes the bug, and also rewrites the logic to make it easier to reason
about it. It also fixes another subtle bug: if some writers are stalled and
`SetBufferSize(0)` is called, which disables the WBM, the writer are not
unblocked because of an early `enabled()` check in `EndWriteStall()`.

It doesn't significantly change the locking behavior, as before writers won't
lock unless entering a stall condition, and `FreeMem` almost always locks if
stalling is allowed, but that is inevitable with the current design. Liveness is
guaranteed by the fact that if some writes are blocked, eventually all writes
will be blocked due to `stall_active_`, and eventually all memory is freed.

While at it, do a couple of optimizations:

- In `WBMStallInterface::Signal()` signal the CV only after releasing the
  lock. Signaling under the lock is a common pitfall, as it causes the woken-up
  thread to immediately go back to sleep because the mutex is still locked by
  the awaker.

- Move all allocations and deallocations outside of the lock.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9009

Test Plan:
```
USE_CLANG=1 make -j64 all check
```

Reviewed By: akankshamahajan15

Differential Revision: D31550668

Pulled By: ot

fbshipit-source-id: 5125387c3dc7ecaaa2b8bbc736e58c4156698580
2021-10-12 00:16:21 -07:00
Levi Tamasi
3e1bf771a3 Make it possible to force the garbage collection of the oldest blob files (#8994)
Summary:
The current BlobDB garbage collection logic works by relocating the valid
blobs from the oldest blob files as they are encountered during compaction,
and cleaning up blob files once they contain nothing but garbage. However,
with sufficiently skewed workloads, it is theoretically possible to end up in a
situation when few or no compactions get scheduled for the SST files that contain
references to the oldest blob files, which can lead to increased space amp due
to the lack of GC.

In order to efficiently handle such workloads, the patch adds a new BlobDB
configuration option called `blob_garbage_collection_force_threshold`,
which signals to BlobDB to schedule targeted compactions for the SST files
that keep alive the oldest batch of blob files if the overall ratio of garbage in
the given blob files meets the threshold *and* all the given blob files are
eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
if the new option is set to 0.9, targeted compactions will get scheduled if the
sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
oldest blob files, assuming all affected blob files are below the age-based cutoff.)
The net result of these targeted compactions is that the valid blobs in the oldest
blob files are relocated and the oldest blob files themselves cleaned up (since
*all* SST files that rely on them get compacted away).

These targeted compactions are similar to periodic compactions in the sense
that they force certain SST files that otherwise would not get picked up to undergo
compaction and also in the sense that instead of merging files from multiple levels,
they target a single file. (Note: such compactions might still include neighboring files
from the same level due to the need of having a "clean cut" boundary but they never
include any files from any other level.)

This functionality is currently only supported with the leveled compaction style
and is inactive by default (since the default value is set to 1.0, i.e. 100%).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994

Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.

Reviewed By: riversand963

Differential Revision: D31489850

Pulled By: ltamasi

fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
2021-10-11 18:03:01 -07:00
Andrew Kryczka
a282eff3d1 Protect existing files in FaultInjectionTest{Env,FS}::ReopenWritableFile() (#8995)
Summary:
`FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage.

The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995

Test Plan:
- Verified it fixes the following failure:

```
$ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
$ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound:
Crash-recovery verification failed :(
...
```

- `make check -j48`

Reviewed By: ltamasi

Differential Revision: D31495388

Pulled By: ajkr

fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385
2021-10-11 16:23:18 -07:00
Zhichao Cao
bcd049cd2d Ingest external SST files with Temperature hints (#8949)
Summary:
Add the file temperature to `IngestExternalFileArg` such that when SST files are ingested, user is able to assign the temperature to each SST file. If the temperature vector is empty or its size does not match the file name vector size, all ingested SST files will be assigned with `Temperature::unKnown`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8949

Test Plan: add the new test and make check

Reviewed By: siying

Differential Revision: D31127852

Pulled By: zhichao-cao

fbshipit-source-id: 141a81f0f7b473d88f4ab0cb2a21a114cbc6f83c
2021-10-08 10:32:24 -07:00
Andrew Kryczka
fcaa7ff638 Cancel manual compactions waiting on automatic compactions to drain (#8991)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8991

Test Plan: the new test hangs forever without this fix and passes with this fix.

Reviewed By: hx235

Differential Revision: D31456419

Pulled By: ajkr

fbshipit-source-id: a82c0e5560b6e6153089dccd8e46163c61b07bff
2021-10-07 15:23:55 -07:00
Kajetan Janiak
8717c26823 Warning about incompatible options with level_compaction_dynamic_level_bytes (#8329)
Summary:
This change introduces warnings instead of a silent override when trying to use level_compaction_dynamic_level_bytes with multiple cf_paths/db_paths.
I have completed the CLA.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8329

Reviewed By: hx235

Differential Revision: D31399713

Pulled By: ajkr

fbshipit-source-id: 29c6fe5258d1f739b4590ecd44aee44f55415595
2021-10-07 15:23:55 -07:00
Zhichao Cao
b632ed0c67 Add file temperature related counter and bytes stats to and io_stats (#8710)
Summary:
For tiered storage project, we need to know the block read count and read bytes of files with different temperature. Add FileIOByTemperature to IOStatsContext and collect the bytes read and read count from different temperature files through the RandomAccessFileReader.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8710

Test Plan: make check, add the testing cases

Reviewed By: siying

Differential Revision: D30582400

Pulled By: zhichao-cao

fbshipit-source-id: d83173de594374fc8404af5ce93a6a9be72c7141
2021-10-07 14:58:41 -07:00
Ramkumar Vadivelu
fe994bbd0b Misc doc fixes (#8983)
Summary:
- Update few stale GitHub wiki link references from rocksdb.org
- Update the API comments for ignore_range_deletions

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8983

Reviewed By: ajkr

Differential Revision: D31355965

Pulled By: ramvadiv

fbshipit-source-id: 245ac4a6913976dd82afa308bc4aae6bff3d788c
2021-10-07 11:22:17 -07:00
mrambacher
53e595d1f3 Cleanup multiple implementations of VectorIterator (#8901)
Summary:
There were three implementations of VectorIterator (util/vector_iterator, test_util/testutil.h and LoggingForwardVectorIterator).  Merged them into one class to increase code coverage/testing and reduce duplication.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8901

Reviewed By: pdillinger

Differential Revision: D31022673

Pulled By: mrambacher

fbshipit-source-id: 8e3acbd2dfd60b4df609d02cc72846de2389d531
2021-10-06 07:48:31 -07:00
Stefan Roesch
a776406de3 Add file operation callbacks to SequentialFileReader (#8982)
Summary:
This change adds File IO Notifications to the SequentialFileReader The SequentialFileReader is extended
with a listener parameter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8982

Test Plan:
A new test EventListenerTest::OnWALOperationTest has been added. The
test verifies that during restore the sequential file reader is called
and the notifications are fired.

Reviewed By: riversand963

Differential Revision: D31320844

Pulled By: shrfb

fbshipit-source-id: 040b24da7c010d7c14ebb5c6460fae9a19b8c168
2021-10-05 10:51:59 -07:00
mrambacher
787229837e Fix LITE mode builds on MacOs (#8981)
Summary:
On MacOS, there were errors building in LITE mode related to unused private member variables:

In file included from ./db/compaction/compaction_job.h:20:
./db/blob/blob_file_completion_callback.h:87:19: error: private field ‘sst_file_manager_’ is not used [-Werror,-Wunused-private-field]
  SstFileManager* sst_file_manager_;
                  ^
./db/blob/blob_file_completion_callback.h:88:22: error: private field ‘mutex_’ is not used [-Werror,-Wunused-private-field]
  InstrumentedMutex* mutex_;
                     ^
./db/blob/blob_file_completion_callback.h:89:17: error: private field ‘error_handler_’ is not used [-Werror,-Wunused-private-field]
  ErrorHandler* error_handler_;

This PR resolves those build issues by removing the values as members in LITE mode and fixing the constructor to ignore the input values in LITE mode (otherwise we get unused parameter warnings).

Tested by validating compiles without warnings.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8981

Reviewed By: akankshamahajan15

Differential Revision: D31320141

Pulled By: mrambacher

fbshipit-source-id: d67875ebbd39a9555e4f09b2d37159566dd8a085
2021-10-04 05:30:26 -07:00
Yanqin Jin
2cdaf5ca5b Add additional checks for three existing unit tests (#8973)
Summary:
With test sync points, we can assert on the equality of iterator value in three existing
unit tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8973

Test Plan:
```
gtest-parallel -r 1000 ./db_test2 --gtest_filter=DBTest2.IterRaceFlush2:DBTest2.IterRaceFlush1:DBTest2.IterRefreshRaceFlush
```

make check

Reviewed By: akankshamahajan15

Differential Revision: D31256340

Pulled By: riversand963

fbshipit-source-id: a9440767ab383e0ec61bd43ffa8fbec4ba562ea2
2021-10-01 17:22:37 -07:00
anand76
532ff334d9 Don't ignore deletion rate limit if WAL dir is different (#8967)
Summary:
If WAL dir is different from the DB dir, we should still honor the SstFileManager deletion rate limit for SST files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8967

Test Plan: Add a new unit test in db_sst_test

Reviewed By: pdillinger

Differential Revision: D31220116

Pulled By: anand1976

fbshipit-source-id: bcde8a53a7d728e15e597fb5d07ee86c1b38bd28
2021-09-30 13:26:31 -07:00
Yanqin Jin
2acffecca1 Add comments for MultiGetBlob() and checks for MultiRead() (#8972)
Summary:
Add comments for MultiGetBlob() that input argument `offsets` must be
sorted. In addition, add assertion for this condition in debug build.
Repeat the same for RandomAccessFileReader::MultiRead().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8972

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D31253205

Pulled By: riversand963

fbshipit-source-id: 98758229b8052f3aeb319d5584026b4de2d220a2
2021-09-29 14:27:19 -07:00
mrambacher
13ae16c315 Cleanup includes in dbformat.h (#8930)
Summary:
This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.

Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.

Hopefully, this sort of code hygiene cleanup will speed up the builds...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930

Reviewed By: pdillinger

Differential Revision: D31142788

Pulled By: mrambacher

fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
2021-09-29 04:04:40 -07:00
Jay Zhuang
6b34eb0ebc Add remote compaction read/write bytes statistics (#8939)
Summary:
Add basic read/write bytes statistics on the primary side:
`REMOTE_COMPACT_READ_BYTES`
`REMOTE_COMPACT_WRITE_BYTES`

Fixed existing statistics missing some IO for remote compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8939

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D31074672

Pulled By: jay-zhuang

fbshipit-source-id: c57afdba369990185008ffaec7e3fe7c62e8902f
2021-09-28 14:00:37 -07:00
Hui Xiao
d6bd1a0291 Support "level_at_creation" in TablePropertiesCollectorFactory::Context (#8919)
Summary:
Context:
Exposing the level of the sst file (i.e, table) where it is created in `TablePropertiesCollectorFactory::Context` allows users of `TablePropertiesCollectorFactory` to customize some implementation details of `TablePropertiesCollectorFactory` and `TablePropertiesCollector` based on the level of creation. For example, `TablePropertiesCollector::NeedCompact()` can return different values based on level of creation.
- Declared an extra field `level_at_creation` in `TablePropertiesCollectorFactory::Context`
- Allowed `level_at_creation` to be passed in as an argument in `IntTblPropCollectorFactory::CreateIntTblPropCollector()` and `UserKeyTablePropertiesCollectorFactory::CreateIntTblPropCollector()`, the latter of which is an internal wrapper of user's passed-in `TablePropertiesCollectorFactory::CreateTablePropertiesCollector()` used in table-building process
- Called `IntTblPropCollectorFactory::CreateIntTblPropCollector()` with `level_at_creation` passed into both `BlockBasedTableBuilder` and `PlainTableBuilder`
  -  `PlainTableBuilder` previously did not capture `level_at_creation` from `TableBuilderOptions` in `PlainTableFactory`. In order for it to call the method with this parameter, this PR also made `PlainTableBuilder` capture `level_at_creation` as a required parameter
- Called `IntTblPropCollectorFactory::CreateIntTblPropCollector()` with `level_at_creation` its overridden functions in its derived classes, including `RegularKeysStartWithAFactory::CreateIntTblPropCollector()` in `table_properties_collector_test.cc`, `SstFileWriterPropertiesCollectorFactory::CreateIntTblPropCollector()` in `sst_file_writer_collectors.h`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8919

Test Plan:
- Passed the added assertion for `context.level_at_creation`
- Passed existing tests
- Run `Make` to make sure adding a required parameter to `PlainTableBuilder`'s constructor does not break anything

Reviewed By: anand1976

Differential Revision: D30951729

Pulled By: hx235

fbshipit-source-id: c4a0173b0d9344a4cf47e1b987d759c1c73cb474
2021-09-28 12:35:24 -07:00
mrambacher
7fd68b7c39 Make WalFilter, SstPartitionerFactory, FileChecksumGenFactory, and TableProperties Customizable (#8638)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8638

Reviewed By: zhichao-cao

Differential Revision: D31024729

Pulled By: mrambacher

fbshipit-source-id: 954c04ccab0b8dee64050a27aadf78ed119106c0
2021-09-28 05:32:02 -07:00
Akanksha Mahajan
78afb4d81e Support SingleDelete for user-defined timestamps (#8921)
Summary:
Added support for SingleDelete for user-defined timestamps. Users can now Get and Iterate over keys deleted with SingleDelete. It also includes changes in CompactionIterator which  preserves the same user key with different timestamps, unless the timestamp is below a certain threshold full_history_ts_low.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8921

Test Plan: Added new unit tests

Reviewed By: riversand963

Differential Revision: D31098191

Pulled By: akankshamahajan15

fbshipit-source-id: 78a59ef4b4884ae324fcd10f56e62a27d5ee2f49
2021-09-27 11:51:07 -07:00
Peter Dillinger
0774d640c0 Fix some lint warnings reported on 6.25 (#8945)
Summary:
Fix some lint warnings

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8945

Test Plan: existing tests, linters

Reviewed By: zhichao-cao

Differential Revision: D31103824

Pulled By: pdillinger

fbshipit-source-id: 4dd9b0c30fa50e588107ac6ed392b2dfb507a5d4
2021-09-27 11:43:20 -07:00
mrambacher
e0f697d2bd Make SliceTransform into a Customizable class (#8641)
Summary:
Made SliceTransform into a Customizable class.

Would be nice to write a test that stored and used a custom transform  in an SST table.

There are a set of tests (DBBlockFliterTest.PrefixExtractor*, SamePrefixTest.InDomainTest, PrefixTest.PrefixAndWholeKeyTest that run the same with or without a SliceTransform/PrefixFilter.  Is this expected?

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8641

Reviewed By: zhichao-cao

Differential Revision: D31142793

Pulled By: mrambacher

fbshipit-source-id: bb08672fccbfdc263dcae21f25a62307e1facda1
2021-09-27 07:43:47 -07:00
Yanqin Jin
b92cef2d1d Sort per-file blob read requests by offset (#8953)
Summary:
`RandomAccessFileReader::MultiRead()` tries to merge requests in direct IO, assuming input IO requests are
sorted by offsets.

Add a test in direct IO mode.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8953

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D31183546

Pulled By: riversand963

fbshipit-source-id: 5d043ec68e2daa47a3149066150afd41ee3d73e6
2021-09-24 22:14:30 -07:00
Hui Xiao
58444eadda Make RateLimiter::GetTotalPendingRequest() non pure virtual for backward compability (#8938)
Summary:
Context/Summary:
https://github.com/facebook/rocksdb/pull/8890 added a public API `RateLimiter::GetTotalPendingRequest()` but mistakenly marked it as pure virtual, forcing RateLimiter's derived classes to implement this function and breaking backward compatibility.

This PR makes `RateLimiter::GetTotalPendingRequest()` as non-pure virtual method by providing a trivial implementation in rate_limiter.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8938

Test Plan: Passing existing tests

Reviewed By: pdillinger

Differential Revision: D31100661

Pulled By: hx235

fbshipit-source-id: 06eff1005156a6e5a881e393b2c5b2ad706897d8
2021-09-21 21:29:26 -07:00
mrambacher
6924869867 Make SystemClock into a Customizable Class (#8636)
Summary:
Made SystemClock into a Customizable class, complete with CreateFromString.

Cleaned up some of the existing SystemClock implementations that were redundant (NoSleep was the same as the internal one for MockEnv).

Changed MockEnv construction to allow Clock to be passed to the Memory/MockFileSystem.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8636

Reviewed By: zhichao-cao

Differential Revision: D30483360

Pulled By: mrambacher

fbshipit-source-id: cd0e3a876c39f8c98fe13374c06e8edbd5b9f2a1
2021-09-21 09:23:48 -07:00
Jay Zhuang
1c290c785d RemoteCompaction support Fallback to local compaction (#8709)
Summary:
Add support for fallback to local compaction, the user can
return `CompactionServiceJobStatus::kUseLocal` to instruct RocksDB to
run the compaction locally instead of waiting for the remote compaction
result.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8709

Test Plan: unittest

Reviewed By: ajkr

Differential Revision: D30560163

Pulled By: jay-zhuang

fbshipit-source-id: 65d8905a4a1bc185a68daa120997f21d3198dbe1
2021-09-18 00:25:04 -07:00
Yanqin Jin
b512f4bc76 Batch blob read IO for MultiGet (#8699)
Summary:
In batched `MultiGet()`, RocksDB batches blob read IO and uses `RandomAccessFileReader::MultiRead()`
to read the blobs instead of issuing multiple `Read()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8699

Test Plan:
```
make check
```

Reviewed By: ltamasi

Differential Revision: D31030861

Pulled By: riversand963

fbshipit-source-id: a0df6060cbfd54cff9515a4eee08807b1dbcb0c8
2021-09-17 19:23:13 -07:00
Akanksha Mahajan
d6aa8c49f8 Expose blob file information through the EventListener interface (#8675)
Summary:
1. Extend FlushJobInfo and CompactionJobInfo with information about the blob files generated by flush/compaction jobs. This PR add two structures BlobFileInfo and BlobFileGarbageInfo that contains the required information of blob files.
 2. Notify the creation and deletion of blob files through OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted.
 3. Test OnFile*Finish operations notifications with Blob Files.
 4. Log the blob file creation/deletion events through EventLogger in Log file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8675

Test Plan: Add new unit tests in listener_test

Reviewed By: ltamasi

Differential Revision: D30412613

Pulled By: akankshamahajan15

fbshipit-source-id: ca51b63c6e8c8d0485a38c503572bc5a82bd5d07
2021-09-16 17:23:36 -07:00
Jay Zhuang
b97c53b629 Add compaction priority information in RemoteCompaction (#8707)
Summary:
Add compaction priority information in RemoteCompaction, which
can be used to schedule high priority job first.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8707

Test Plan: unittest

Reviewed By: ajkr

Differential Revision: D30548401

Pulled By: jay-zhuang

fbshipit-source-id: b30446511fb31b4583c49edd8565d496cf013a34
2021-09-16 15:09:35 -07:00
Peter Dillinger
f4a1d10668 Fix flaky WALTrashCleanupOnOpen (#8917)
Summary:
Test did not consider that slower deletion rate only kicks in
after a file is deleted

Fixes https://github.com/facebook/rocksdb/issues/7546

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8917

Test Plan:
no longer reproduces using

    buck test mode/dev //internal_repo_rocksdb/repo:db_sst_test -- --exact 'internal_repo_rocksdb/repo:db_sst_test - DBWALTestWithParam/DBWALTestWithParam.WALTrashCleanupOnOpen/0' --jobs 40 --stress-runs 600 --record-results

Reviewed By: siying

Differential Revision: D30949127

Pulled By: pdillinger

fbshipit-source-id: 5d0607f8f548071b07410fe8f532b4618cd225e5
2021-09-15 21:31:20 -07:00
Peter Dillinger
2819c7840e Fix PrepopulateBlockCache::kFlushOnly (#8750)
Summary:
kFlushOnly currently means "always" except in the case of
remote compaction. This makes it flushes only.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8750

Test Plan: test updated

Reviewed By: akankshamahajan15

Differential Revision: D30968034

Pulled By: pdillinger

fbshipit-source-id: 5dbd24dde18852a0e937a540995fba9bfbe89037
2021-09-15 15:33:20 -07:00
sdong
12d798ac06 Always iniitalize ArenaWrappedDBIter::db_iter_ to nullptr (#8889)
Summary:
ArenaWrappedDBIter::db_iter_ should never be nullptr. However, when debugging a segfault, it's hard to distinguish it is not initialized (not possible) and other corruption. Add this nullptr to help distinguish the case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8889

Test Plan: Run existing unit tests.

Reviewed By: pdillinger

Differential Revision: D30814756

fbshipit-source-id: 4b1f36896a33dc203d4f1f424ded9554927d61ba
2021-09-14 14:33:15 -07:00
Andrew Kryczka
d648cb47b9 Adapt key-value checksum for timestamp-suffixed keys (#8914)
Summary:
After https://github.com/facebook/rocksdb/issues/8725, keys added to `WriteBatch` may be timestamp-suffixed, while `WriteBatch` has no awareness of the timestamp size. Therefore, `WriteBatch` can no longer calculate timestamp checksum separately from the rest of the key's checksum in all cases.

This PR changes the definition of key in KV checksum to include the timestamp suffix. That way we do not need to worry about where the timestamp begins within the key. I believe the only practical effect of this change is now `AssignTimestamp()` requires recomputing the whole key checksum (`UpdateK()`) rather than just the timestamp portion (`UpdateT()`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8914

Test Plan:
run stress command that used to fail

```
$ ./db_stress --batch_protection_bytes_per_key=8 -clear_column_family_one_in=0 -test_batches_snapshots=1
```

Reviewed By: riversand963

Differential Revision: D30925715

Pulled By: ajkr

fbshipit-source-id: c143f7ccb46c0efb390ad57ef415c250d754deff
2021-09-14 13:14:39 -07:00
eharry
0b6be7eb68 Fix WAL log data corruption #8723 (#8746)
Summary:
Fix WAL log data corruption when using DBOptions.manual_wal_flush(true) and WriteOptions.sync(true) together (https://github.com/facebook/rocksdb/issues/8723)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8746

Reviewed By: ajkr

Differential Revision: D30758468

Pulled By: riversand963

fbshipit-source-id: 07c20899d5f2447dc77861b4845efc68a59aa4e8
2021-09-13 20:15:59 -07:00
Peter Dillinger
7bef598440 Bypass unused parameterization in ExternalSSTFileBasicTest.IngestExte… (#8910)
Summary:
Facebook infrastructure doesn't like continuously skipping
tests, so fixing this permanently disabled parameterization to BYPASS
instead of SKIP. (Internal ref: T100525285)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8910

Test Plan: manual

Reviewed By: anand1976

Differential Revision: D30905169

Pulled By: pdillinger

fbshipit-source-id: e23d63d2aa800e54676269fad3a093cd3f9f222d
2021-09-13 12:18:15 -07:00
Levi Tamasi
306b779957 Use GetBlobFileSize instead of GetTotalBlobBytes in DB properties (#8902)
Summary:
The patch adjusts the definition of BlobDB's DB properties a bit by
switching to `GetBlobFileSize` from `GetTotalBlobBytes`. The
difference is that the value returned by `GetBlobFileSize` includes
the blob file header and footer as well, and thus matches the on-disk
size of blob files. In addition, the patch removes the `Version` number
from the `blob_stats` property, and updates/extends the unit tests a little.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8902

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D30859542

Pulled By: ltamasi

fbshipit-source-id: e3426d2d567bd1bd8c8636abdafaafa0743c854c
2021-09-13 10:47:16 -07:00
mrambacher
dafa584fd1 Change the File System File Wrappers to std::unique_ptr (#8618)
Summary:
This allows the wrapper classes to own the wrapped object and eliminates confusion as to ownership.  Previously, many classes implemented their own ownership solutions.  Fixes https://github.com/facebook/rocksdb/issues/8606

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8618

Reviewed By: pdillinger

Differential Revision: D30136064

Pulled By: mrambacher

fbshipit-source-id: d0bf471df8818dbb1770a86335fe98f761cca193
2021-09-13 08:46:19 -07:00
Yanqin Jin
2a2b3e03a5 Allow WriteBatch to have keys with different timestamp sizes (#8725)
Summary:
In the past, we unnecessarily requires all keys in the same write batch
to be from column families whose timestamps' formats are the same for
simplicity. Specifically, we cannot use the same write batch to write to
two column families, one of which enables timestamp while the other
disables it.

The limitation is due to the member `timestamp_size_` that used to exist
in each `WriteBatch` object. We pass a timestamp_size to the constructor
of `WriteBatch`. Therefore, users can simply use the old
`WriteBatch::Put()`, `WriteBatch::Delete()`, etc APIs for write, while
the internal implementation of `WriteBatch` will take care of memory
allocation for timestamps.

The above is not necessary.
One the one hand, users can set up a memory buffer to store user key and
then contiguously append the timestamp to the user key. Then the user
can pass this buffer to the `WriteBatch::Put(Slice&)` API.
On the other hand, users can set up a SliceParts object which is an
array of Slices and let the last Slice to point to the memory buffer
storing timestamp. Then the user can pass the SliceParts object to the
`WriteBatch::Put(SliceParts&)` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8725

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D30654499

Pulled By: riversand963

fbshipit-source-id: 9d848c77ad3c9dd629aa5fc4e2bc16fb0687b4a2
2021-09-12 15:34:26 -07:00
Peter Dillinger
bda8d93ba9 Fix and detect headers with missing dependencies (#8893)
Summary:
It's always annoying to find a header does not include its own
dependencies and only works when included after other includes. This
change adds `make check-headers` which validates that each header can
be included at the top of a file. Some headers are excluded e.g. because
of platform or external dependencies.

rocksdb_namespace.h had to be re-worked slightly to enable checking for
failure to include it. (ROCKSDB_NAMESPACE is a valid namespace name.)

Fixes mostly involve adding and cleaning up #includes, but for
FileTraceWriter, a constructor was out-of-lined to make a forward
declaration sufficient.

This check is not currently run with `make check` but is added to
CircleCI build-linux-unity since that one is already relatively fast.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8893

Test Plan: existing tests and resolving issues detected by new check

Reviewed By: mrambacher

Differential Revision: D30823300

Pulled By: pdillinger

fbshipit-source-id: 9fff223944994c83c105e2e6496d24845dc8e572
2021-09-10 10:00:26 -07:00
mrambacher
dc0dc90cf5 Make Statistics a Customizable Class (#8637)
Summary:
Make the Statistics object into a Customizable object.  Statistics can now be stored and created to/from the Options file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8637

Reviewed By: zhichao-cao

Differential Revision: D30530550

Pulled By: mrambacher

fbshipit-source-id: 5fc7d01d8431f37b2c205bbbd8342c9f697023bd
2021-09-10 09:47:39 -07:00
anand76
eea566864e Support custom Env in db_sst_test and external_sst_file_basic_test (#8888)
Summary:
Support custom Env in these tests. Some custom Envs do not support reopening a file for write, either normal mode or Random RW mode. Added some additional checks in external_sst_file_basic_test to accommodate those Envs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8888

Reviewed By: riversand963

Differential Revision: D30824481

Pulled By: anand1976

fbshipit-source-id: c3ac7a628e6df29e94f42e370e679934a4f77eac
2021-09-08 21:21:49 -07:00
Zhiyi Zhang
0cb0fc6fd3 Add DB properties for BlobDB (#8734)
Summary:
RocksDB exposes certain internal statistics via the DB property interface.
However, there are currently no properties related to BlobDB.

For starters, we would like to add the following BlobDB properties:
`rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree).
`rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version.
`rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions.
`rocksdb.live-blob-file-size`: the total size of all blob files in the current Version.
`rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734

Test Plan:
```
➜  rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test
[==========] Running 16 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 10 tests from DBBlobBasicTest
[ RUN      ] DBBlobBasicTest.GetBlob
[       OK ] DBBlobBasicTest.GetBlob (12 ms)
[ RUN      ] DBBlobBasicTest.MultiGetBlobs
[       OK ] DBBlobBasicTest.MultiGetBlobs (11 ms)
[ RUN      ] DBBlobBasicTest.GetBlob_CorruptIndex
[       OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms)
[ RUN      ] DBBlobBasicTest.GetBlob_InlinedTTLIndex
[       OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms)
[ RUN      ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber
[       OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms)
[ RUN      ] DBBlobBasicTest.GenerateIOTracing
[       OK ] DBBlobBasicTest.GenerateIOTracing (11 ms)
[ RUN      ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile
[       OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms)
[ RUN      ] DBBlobBasicTest.GetMergeBlobWithPut
[       OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms)
[ RUN      ] DBBlobBasicTest.MultiGetMergeBlobWithPut
[       OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms)
[ RUN      ] DBBlobBasicTest.BlobDBProperties
[       OK ] DBBlobBasicTest.BlobDBProperties (21 ms)
[----------] 10 tests from DBBlobBasicTest (124 ms total)

[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms)
[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total)

[----------] Global test environment tear-down
[==========] 16 tests from 2 test cases ran. (2190 ms total)
[  PASSED  ] 16 tests.
```

Reviewed By: ltamasi

Differential Revision: D30690849

Pulled By: Zhiyi-Zhang

fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 12:22:04 -07:00
mrambacher
beed86473a Make MemTableRepFactory into a Customizable class (#8419)
Summary:
This PR does the following:
-> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
-> Makes the existing implementations compatible with configurations
-> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API

New tests were added to validate the functionality and all existing tests pass.  db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419

Reviewed By: zhichao-cao

Differential Revision: D29558961

Pulled By: mrambacher

fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
2021-09-08 07:46:44 -07:00
Andrew Kryczka
941543721d Bytes read stat for VerifyChecksum() and VerifyFileChecksums() APIs (#8741)
Summary:
- Clarified some comments on compatibility for adding new ticker stats
- Added read I/O stats for `VerifyChecksum()` and `VerifyFileChecksums()` APIs

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8741

Test Plan: new unit test

Reviewed By: zhichao-cao

Differential Revision: D30708578

Pulled By: ajkr

fbshipit-source-id: d06b961f7e199ae92c266b683e39870aa8f63449
2021-09-07 13:28:29 -07:00
Peter Dillinger
0ef88538c6 Improve support for using regexes (#8740)
Summary:
* Consolidate use of std::regex for testing to testharness.cc, to
minimize Facebook linters constantly flagging uses in non-production
code.
* Improve syntax and error messages for asserting some string matches a
regex in tests.
* Add a public Regex wrapper class to encapsulate existing usage in
ObjectRegistry.
* Remove unnecessary include <regex>
* Put warnings that use of Regex in production code could cause bad
performance or stack overflow.

Intended follow-up work:
* Replace std::regex with another underlying implementation like RE2
* Improve ObjectRegistry interface in terms of possibly confusing literal
string matching vs. regex and in terms of reporting invalid regex.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8740

Test Plan:
tests updated, basic unit test for public Regex, and some manual
testing of temporary changes to see example error messages:

utilities/backupable/backupable_db_test.cc:917: Failure
000010_1162373755_138626.blob (child.name)
does not match regex
[0-9]+_[0-9]+_[0-9]+[.]blobHAHAHA (pattern)

db/db_basic_test.cc:74: Failure
R3SHSBA8C4U0CIMV2ZB0 (sid3)
does not match regex [0-9A-Z]{20}HAHAHA

Reviewed By: mrambacher

Differential Revision: D30706246

Pulled By: pdillinger

fbshipit-source-id: ba845e8f563ccad39bdb58f44f04e9da8f78c3fd
2021-09-07 13:05:23 -07:00
Peter Dillinger
4750421ece Replace most typedef with using= (#8751)
Summary:
Old typedef syntax is confusing

Most but not all changes with

    perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files
    make format

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751

Test Plan: existing

Reviewed By: zhichao-cao

Differential Revision: D30745277

Pulled By: pdillinger

fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9
2021-09-07 11:31:59 -07:00
Levi Tamasi
55ef8972fc Support custom env in db_blob_{basic,compaction,corruption,index}_test (#8817)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8817

Test Plan: Ran `make check` and built/tested using internal custom environment.

Reviewed By: riversand963

Differential Revision: D30768215

Pulled By: ltamasi

fbshipit-source-id: cce96211d4c097612d20247f2e997358f40cc3d3
2021-09-07 11:13:56 -07:00
Akanksha Mahajan
e8a7001159 Update branch as "main" in tools/advisor/README.md (#8744)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8744

Reviewed By: ltamasi

Differential Revision: D30716145

Pulled By: akankshamahajan15

fbshipit-source-id: c2fcaf9ddcae85a86c0f10496acab28cd795ff12
2021-09-01 20:26:28 -07:00
Peter Dillinger
c9cd5d25a8 Remove some unneeded code (#8736)
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736

Test Plan: existing tests

Reviewed By: mrambacher

Differential Revision: D30700039

Pulled By: pdillinger

fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
2021-09-01 14:28:58 -07:00
Peter Dillinger
13ded69484 Built-in support for generating unique IDs, bug fix (#8708)
Summary:
Env::GenerateUniqueId() works fine on Windows and on POSIX
where /proc/sys/kernel/random/uuid exists. Our other implementation is
flawed and easily produces collision in a new multi-threaded test.
As we rely more heavily on DB session ID uniqueness, this becomes a
serious issue.

This change combines several individually suitable entropy sources
for reliable generation of random unique IDs, with goal of uniqueness
and portability, not cryptographic strength nor maximum speed.

Specifically:
* Moves code for getting UUIDs from the OS to port::GenerateRfcUuid
rather than in Env implementation details. Callers are now told whether
the operation fails or succeeds.
* Adds an internal API GenerateRawUniqueId for generating high-quality
128-bit unique identifiers, by combining entropy from three "tracks":
  * Lots of info from default Env like time, process id, and hostname.
  * std::random_device
  * port::GenerateRfcUuid (when working)
* Built-in implementations of Env::GenerateUniqueId() will now always
produce an RFC 4122 UUID string, either from platform-specific API or
by converting the output of GenerateRawUniqueId.

DB session IDs now use GenerateRawUniqueId while DB IDs (not as
critical) try to use port::GenerateRfcUuid but fall back on
GenerateRawUniqueId with conversion to an RFC 4122 UUID.

GenerateRawUniqueId is declared and defined under env/ rather than util/
or even port/ because of the Env dependency.

Likely follow-up: enhance GenerateRawUniqueId to be faster after the
first call and to guarantee uniqueness within the lifetime of a single
process (imparting the same property onto DB session IDs).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708

Test Plan:
A new mini-stress test in env_test checks the various public
and internal APIs for uniqueness, including each track of
GenerateRawUniqueId individually. We can't hope to verify anywhere close
to 128 bits of entropy, but it can at least detect flaws as bad as the
old code. Serial execution of the new tests takes about 350 ms on
my machine.

Reviewed By: zhichao-cao, mrambacher

Differential Revision: D30563780

Pulled By: pdillinger

fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364
2021-08-30 15:20:41 -07:00
Zaorang Yang
2bc914094d Refactor with VersionBuilder (#8706)
Summary:
Introduce a new function to save sst files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8706

Reviewed By: jay-zhuang

Differential Revision: D30544242

Pulled By: riversand963

fbshipit-source-id: 554755852daff7ae1c7864b0029f51b27099ee09
2021-08-27 12:15:08 -07:00
James Yin
7ddc096d7d Fix typo in the comment of log_empty_ (#8711)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8711

Reviewed By: riversand963

Differential Revision: D30566761

Pulled By: jay-zhuang

fbshipit-source-id: dd4690f5e2af2d263ed75ea1b9ed24692fe81362
2021-08-27 12:10:29 -07:00
anand76
ebaa3c8a59 Fix a race condition in DumpStats() during iteration of the ColumnFamilySet (#8714)
Summary:
DumpStats() iterates through the ColumnFamilySet. There is a potential
race condition because it does Ref the cfd, and the cfd could get
destroyed during the iteration.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8714

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D30580199

Pulled By: anand1976

fbshipit-source-id: 60a3443ad0d4f7ac6a977dec780e6d2c1b70b850
2021-08-26 15:40:26 -07:00
Jay Zhuang
4afa24f8ae Deflake test CompactionJobTest.InputSerialization (#8712)
Summary:
It's invalid to have an empty file name.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8712

Test Plan:
```
$ gtest-parallel ./compaction_job_test --gtest_filter=CompactionJobTest.InputSerialization -r 10000
```

Reviewed By: pdillinger

Differential Revision: D30566739

Pulled By: jay-zhuang

fbshipit-source-id: 41e73175e3c95c4b73b4fdcd33470788d4e29d37
2021-08-26 09:27:37 -07:00
Yanqin Jin
f235f4b0a3 Fix a bug of secondary instance sequence going backward (#8653)
Summary:
Recent refactor of `ReactiveVersionSet::ReadAndApply()` uses
`ManifestTailer` whose `Iterate()` method can cause the db's
`last_sequence_` to go backward. Consequently, read requests can see
out-dated data. For example, latest changes to the primary will not be
seen on the secondary even after a `TryCatchUpWithPrimary()` if no new
write batches are read from the WALs and no new MANIFEST entries are
read from the MANIFEST.

Fix the bug so that `VersionEditHandler::CheckIterationResult` will
never decrease `last_sequence_`, `last_allocated_sequence_` and
`last_published_sequence_`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8653

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D30272084

Pulled By: riversand963

fbshipit-source-id: c6a49c534b2509b93ef62d8936ed0acd5b860eaa
2021-08-24 18:18:36 -07:00
Peter Dillinger
318fe6941a Add port::GetProcessID() (#8693)
Summary:
Useful in some places for object uniqueness across processes.
Currently used for generating a host-wide identifier of Cache objects
but expected to be used soon in some unique id generation code.

`int64_t` is chosen for return type because POSIX uses signed integer type,
usually `int`, for `pid_t` and Windows uses `DWORD`, which is `uint32_t`.

Future work: avoid copy-pasted declarations in port_*.h, perhaps with
port_common.h always included from port.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8693

Test Plan: manual for now

Reviewed By: ajkr, anand1976

Differential Revision: D30492876

Pulled By: pdillinger

fbshipit-source-id: 39fc2788623cc9f4787866bdb67a4d183dde7eef
2021-08-24 17:46:14 -07:00
Yanqin Jin
229350ef48 Allow iterate refresh for secondary instance (#8700)
Summary:
Test plan
make check

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8700

Reviewed By: zhichao-cao

Differential Revision: D30523907

Pulled By: riversand963

fbshipit-source-id: 68928ab4dafb64ce80ab7bc69d83727a4713ab91
2021-08-24 15:40:56 -07:00
Andrew Kryczka
c521f22a1e Deflake write-prepared and write-unprepared tests (#8696)
Summary:
The `JobContext::job_snapshot` referenced DB state but could
have been deleted by a BG thread after the signal/unlock allowing
shutdown to proceed. Then we would see an error like this (valgrind):

```
==354104== Thread 2:
==354104== Invalid read of size 8
==354104==    at 0x694C4D: rocksdb::ManagedSnapshot::~ManagedSnapshot() (snapshot_impl.cc:20)
==354104==    by 0x58F5BA: operator() (unique_ptr.h:81)
==354104==    by 0x58F5BA: operator() (unique_ptr.h:75)
==354104==    by 0x58F5BA: ~unique_ptr (unique_ptr.h:292)
==354104==    by 0x58F5BA: rocksdb::JobContext::~JobContext() (job_context.h:221)
==354104==    by 0x5F155E: rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) (db_impl_compaction_flush.cc:2696)
==354104==    by 0x5F1BC2: rocksdb::DBImpl::BGWorkCompaction(void*) (db_impl_compaction_flush.cc:2468)
==354104==    by 0x83707A: operator() (std_function.h:688)
==354104==    by 0x83707A: rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) (threadpool_imp.cc:266)
==354104==    by 0x8373ED: rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*) (threadpool_imp.cc:307)
==354104==    by 0x492A800: execute_native_thread_routine (in /usr/local/fbcode/platform009/lib/libstdc++.so.6.0.28)
==354104==    by 0x4A5020B: start_thread (in /usr/local/fbcode/platform009/lib/libpthread-2.30.so)
==354104==    by 0x4CF281E: clone (in /usr/local/fbcode/platform009/lib/libc-2.30.so)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8696

Test Plan: unable to repro

Reviewed By: pdillinger

Differential Revision: D30505277

Pulled By: ajkr

fbshipit-source-id: 5a99f34137cd14d06b0f624add6d37a70a61135d
2021-08-23 23:09:17 -07:00
Jay Zhuang
249b1078c9 Add extra information to RemoteCompaction APIs (#8680)
Summary:
Currently, we only provide job_id in RemoteCompaction APIs, the
main problem of `job_id` is it cannot uniquely identify a compaction job
between DB instances or between sessions.
Providing DB and session id to the user, which will make building cross
DB compaction service easier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8680

Test Plan: unittest

Reviewed By: ajkr

Differential Revision: D30444859

Pulled By: jay-zhuang

fbshipit-source-id: fdf107f4286564049637f154193c6d94c3c59448
2021-08-23 16:27:38 -07:00
Peter Dillinger
04db764831 Embed original file number in SST table properties (#8686)
Summary:
I very recently realized that with https://github.com/facebook/rocksdb/issues/8669 we cannot later add
file numbers to external SST files (so that more can share db session
ids for better uniqueness properties), because of forward compatibility.
We would have a version of RocksDB that assumes session IDs are unique
on external SST files and therefore can't really break that invariant in
future files.

This change adds a table property for "orig_file_number" which is
populated by normal SST files and also external SST files generated by
SstFileWriter. SstFileWriter now keeps a db_session_id for life of the
object and increments its own file numbers for embedding in table
properties. (They are arguably "fake" file numbers because these numbers
and not embedded in the file name.)

While updating block_based_table_builder, I removed several unnecessary
fields from Rep, because following the pattern would have created
another unnecessary field.

This change also updates block_based_table_reader to use this new
property when available, which means that for newer SST files, we can
determine the stable/original <db_session_id,file_number> unique
identifier using just the file contents, not the file name. (It's a bit
complicated; detailed comments in block_based_table_reader.)

Also added DB host id to properties listing by sst_dump, which could be
useful in debugging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8686

Test Plan: majorly overhauled StableCacheKeys test for this change

Reviewed By: zhichao-cao

Differential Revision: D30457742

Pulled By: pdillinger

fbshipit-source-id: 2e5ae7dddeb94fb9d8eac8a928486aed8b8cd445
2021-08-20 20:40:48 -07:00