Commit Graph

343 Commits

Author SHA1 Message Date
Akanksha Mahajan
b175eceb09 Store FSWritableFilePtr object in WritableFileWriter (#7193)
Summary:
Replace FSWritableFile pointer with FSWritableFilePtr
    object in WritableFileWriter.
    This new object wraps FSWritableFile pointer.

    Objective: If tracing is enabled, FSWritableFile Ptr returns
    FSWritableFileTracingWrapper pointer that includes all necessary
    information in IORecord and calls underlying FileSystem and invokes
    IOTracer to dump that record in a binary file. If tracing is disabled
    then, underlying FileSystem pointer is returned directly.
    FSWritableFilePtr wrapper class is added to bypass the
    FSWritableFileWrapper when
    tracing is disabled.

    Test Plan: make check -j64

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7193

Reviewed By: anand1976

Differential Revision: D23355915

Pulled By: akankshamahajan15

fbshipit-source-id: e62a27a13c1fd77e36a6dbafc7006d969bed25cf
2020-09-08 10:56:08 -07:00
Akanksha Mahajan
1f9f630b27 Store FileSystemPtr object that contains FileSystem ptr (#7180)
Summary:
As part of the IOTracing project, this PR
    1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer.
    2. FileSystemPtr object is created using FileSystem pointer and IOTracer
    pointer.
    3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor.
    4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr
    returns FileSystemTracingWrapper pointer for tracing purpose and when
    it is disabled underlying FileSystem pointer is returned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180

Test Plan:
make check -j64
                COMPILE_WITH_TSAN=1 make check -j64

Reviewed By: anand1976

Differential Revision: D22987117

Pulled By: akankshamahajan15

fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1
2020-08-12 17:31:23 -07:00
Levi Tamasi
a99fb67233 Remove redundant consistency check from VersionStorageInfo::AddFile (#7237)
Summary:
`VersionStorageInfo::AddFile` currently has a debug-mode consistency
check to make sure the newly added file does not overlap with the
previous one (for levels below L0). Considering that
`VersionBuilder::CheckConsistency` also performs similar checks (in
fact, those checks are more comprehensive and cover L0 as well), this
check is redundant. The patch removes it and also cleans up `AddFile` a
little.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7237

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23041937

Pulled By: ltamasi

fbshipit-source-id: e00665f3b83bfd17f86c54c238800f3d77d739bd
2020-08-11 09:23:17 -07:00
Cheng Chang
cd48ecaa1a Define WAL related classes to be used in VersionEdit and VersionSet (#7164)
Summary:
`WalAddition`, `WalDeletion` are defined in `wal_version.h` and used in `VersionEdit`.
`WalAddition` is used to represent events of creating a new WAL (no size, just log number), or closing a WAL (with size).
`WalDeletion` is used to represent events of deleting or archiving a WAL, it means the WAL is no longer alive (won't be replayed during recovery).

`WalSet` is the set of alive WALs kept in `VersionSet`.

1. Why use `WalDeletion` instead of relying on `MinLogNumber` to identify outdated WALs

On recovery, we can compute `MinLogNumber()` based on the log numbers kept in MANIFEST, any log with number < MinLogNumber can be ignored. So it seems that we don't need to persist `WalDeletion` to MANIFEST, since we can ignore the WALs based on MinLogNumber.

But the `MinLogNumber()` is actually a lower bound, it does not exactly mean that logs starting from MinLogNumber must exist. This is because in a corner case, when a column family is empty and never flushed, its log number is set to the largest log number, but not persisted in MANIFEST. So let's say there are 2 column families, when creating the DB, the first WAL has log number 1, so it's persisted to MANIFEST for both column families. Then CF 0 is empty and never flushed, CF 1 is updated and flushed, so a new WAL with log number 2 is created and persisted to MANIFEST for CF 1. But CF 0's log number in MANIFEST is still 1. So on recovery, MinLogNumber is 1, but since log 1 only contains data for CF 1, and CF 1 is flushed, log 1 might have already been deleted from disk.

We can make `MinLogNumber()` be the exactly minimum log number that must exist, by persisting the most recent log number for empty column families that are not flushed. But if there are N such column families, then every time a new WAL is created, we need to add N records to MANIFEST.

In current design, a record is persisted to MANIFEST only when WAL is created, closed, or deleted/archived, so the number of WAL related records are bounded to 3x number of WALs.

2. Why keep `WalSet` in `VersionSet` instead of applying the `VersionEdit`s to `VersionStorageInfo`

`VersionEdit`s are originally designed to track the addition and deletion of SST files. The SST files are related to column families, each column family has a list of `Version`s, and each `Version` keeps the set of active SST files in `VersionStorageInfo`.

But WALs are a concept of DB, they are not bounded to specific column families. So logically it does not make sense to store WALs in a column family's `Version`s.
Also, `Version`'s purpose is to keep reference to SST / blob files, so that they are not deleted until there is no version referencing them. But a WAL is deleted regardless of version references.
So we keep the WALs in `VersionSet`  for the purpose of writing out the DB state's snapshot when creating new MANIFESTs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7164

Test Plan:
make version_edit_test && ./version_edit_test
make wal_edit_test && ./wal_edit_test

Reviewed By: ltamasi

Differential Revision: D22677936

Pulled By: cheng-chang

fbshipit-source-id: 5a3b6890140e572ffd79eb37e6e4c3c32361a859
2020-08-05 16:34:38 -07:00
Andrew Kryczka
a4a4a2dabd dedup ReadOptions in iterator hierarchy (#7210)
Summary:
Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator`
and every `LevelIterator`. This redundancy consumes extra memory,
resulting in the `Arena` making more allocations, and iteration
observing worse cache performance.

This PR migrates callers of `NewInternalIterator()` and
`MakeInputIterator()` to provide a `ReadOptions` object guaranteed to
outlive the returned iterator. When the iterator's lifetime will be managed by the
user, this lifetime guarantee is achieved by storing the `ReadOptions`
value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and
`MakeInputIterator()` can hold a reference-to-const `ReadOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210

Test Plan:
- `make check` under ASAN and valgrind
- benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes.

Reviewed By: anand1976

Differential Revision: D22861323

Pulled By: ajkr

fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce
2020-08-03 15:23:04 -07:00
sdong
ca5a069a79 Suppress a TSAN warning (#7126)
Summary:
TSAN shows warning with clang with warning similar to this:

WARNING: ThreadSanitizer: data race (pid=10159)
  Atomic write of size 8 at 0x7b5000002890 by thread T33:
    #0 __tsan_atomic64_store <null> (db_test+0x4ca2b5)
    https://github.com/facebook/rocksdb/issues/1 std::__atomic_base<unsigned long>::store(unsigned long, std::memory_order) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/atomic_base.h:374:2 (db_test+0x774fde)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::VersionSet::SetLastSequence(unsigned long) /home/circleci/project/./db/version_set.h:1057:20 (db_test+0x774fde)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*) /home/circleci/project/db/db_impl/db_impl_write.cc:449:18 (db_test+0x774fde)
......
  Previous read of size 8 at 0x7b5000002890 by thread T5 (mutexes: write M1044689462619020832):
    #0 rocksdb::DBImpl::ReleaseSnapshot(rocksdb::Snapshot const*) /home/circleci/project/db/db_impl/db_impl.cc (db_test+0x6f4ae7)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::(anonymous namespace)::MTThreadBody(void*) /home/circleci/project/db/db_test.cc:2514:13 (db_test+0x56ac59)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::(anonymous namespace)::StartThreadWrapper(void*) /home/circleci/project/env/env_posix.cc:443:3 (db_test+0x88c4cd)

It is not limited to ReleaseSnapshot() and rocksdb::DBImpl::MultiCFSnapshot().

While we are not 100% sure it doesn't indicate any correctness violation, we suppress them for now to keep TSAN clean with more tests so that we can cover more bugs with CI.

In the gcc runs we have been running, this warning rarely shows up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7126

Test Plan: See the mini-TSAN test to pass with reasonable run time.

Reviewed By: ajkr

Differential Revision: D22552375

fbshipit-source-id: ebdd3854cb3becec3403970326a1ca961db2ab00
2020-07-15 13:25:14 -07:00
wenh
4924a506b9 Reduce env_->GetChildren() calls in DBImpl::Recover() (#7044)
Summary:
There currently exist multiple `GetChildren()` calls in `DBImpl::Recover()`, which can be expensive in cases of distributed file systems.
This pull request try to call `DBImpl::Recover()` of each necessary directory only _once_ and reuse the results in the places of repeated calls in current code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7044

Test Plan:
Run `make check` and use the default test suite. The modified code should be semantically identical to the current code. As a proof of this solution, we may optionally deploy the system onto a (real or simulated) distributed system and expect reduced latency caused by manifest fetching.

(WIP)

Reviewed By: riversand963

Differential Revision: D22419925

Pulled By: roghnin

fbshipit-source-id: d3774fbfbc246c5527101bc16747eb5c90919886
2020-07-10 13:41:08 -07:00
Levi Tamasi
e367bc7f4b Clean up blob files based on the linked SST set (#7001)
Summary:
The earlier `VersionBuilder` code only cleaned up blob files that were
marked as entirely consisting of garbage using `VersionEdits` with
`BlobFileGarbage`. This covers the cases when table files go through
regular compaction, where we iterate through the KVs and thus have an
opportunity to calculate the amount of garbage (that is, most cases).
However, it does not help when table files are simply dropped (e.g. deletion
compactions or the `DeleteFile` API). To deal with such cases, the patch
adds logic that cleans up all blob files at the head of the list until the first
one with linked SSTs is found. (As an example, let's assume we have blob files
with numbers 1..10, and the first one with any linked SSTs is number 8.
This means that SSTs in the `Version` only rely on blob files with numbers >= 8,
and thus 1..7 are no longer needed.)

The code change itself is pretty small; however, changing the logic like this
necessitated changes to some tests that have been added recently (namely
to the ones that use blob files in isolation, i.e. without any table files referring
to them). Some of these cases were fixed by bypassing `VersionBuilder` altogether
in order to keep the tests simple (which actually makes them more proper unit tests
as well), while the `VersionBuilder` unit tests were fixed by adding dummy table
files to the test cases as needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7001

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D22119474

Pulled By: ltamasi

fbshipit-source-id: c6547141355667d4291d9661d6518eb741e7b54a
2020-06-30 15:31:21 -07:00
Yanqin Jin
d47c871190 Fix data race to VersionSet::io_status_ (#7034)
Summary:
After https://github.com/facebook/rocksdb/issues/6949 , VersionSet::io_status_ can be concurrently accessed by multiple
threads without lock, causing tsan test to fail. For example, a bg flush thread
resets io_status_ before calling LogAndApply(), while another thread already in
the process of LogAndApply() reads io_status_. This is a bug.

We do not have to reset io_status_ each time we call LogAndApply(). io_status_
is part of the state of VersionSet, and it indicates the outcome of preceding
MANIFEST/CURRENT files IO operations. Its value should be updated only when:

1. MANIFEST/CURRENT files IO fail for the first time.
2. MANIFEST/CURRENT files IO succeed as part of recovering from a prior
   failure without process restart, e.g. calling Resume().

Test Plan (devserver):
COMPILE_WITH_TSAN=1 make check
COMPILE_WITH_TSAN=1 make db_test2
./db_test2 --gtest_filter=DBTest2.CompactionStall
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7034

Reviewed By: zhichao-cao

Differential Revision: D22247137

Pulled By: riversand963

fbshipit-source-id: 77b83e05390f3ee3cd2d96d3fdd6fe4f225e3216
2020-06-27 08:57:31 -07:00
Yanqin Jin
e66199d848 First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.

Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.

Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949

Reviewed By: anand1976

Differential Revision: D22026020

Pulled By: riversand963

fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-24 19:07:08 -07:00
Yanqin Jin
569b87e8c7 Fail recovery when MANIFEST record checksum mismatch (#6996)
Summary:
https://github.com/facebook/rocksdb/issues/5411 refactored `VersionSet::Recover` but introduced a bug, explained as follows.
Before, once a checksum mismatch happens, `reporter` will set `s` to be non-ok. Therefore, Recover will stop processing the MANIFEST any further.
```
// Correct
// Inside Recover
LogReporter reporter;
reporter.status = &s;
log::Reader reader(..., reporter);
while (reader.ReadRecord() && s.ok()) {
...
}
```
The bug is that, the local variable `s` in `ReadAndRecover` won't be updated by `reporter` while reading the MANIFEST. It is possible that the reader sees a checksum mismatch in a record, but `ReadRecord` retries internally read and finds the next valid record. The mismatched record will be ignored and no error is reported.
```
// Incorrect
// Inside Recover
LogReporter reporter;
reporter.status = &s;
log::Reader reader(..., reporter);
s = ReadAndRecover(reader, ...);

// Inside ReadAndRecover
  Status s;  // Shadows the s in Recover.
  while (reader.ReadRecord() && s.ok()) {
   ...
  }
```
`LogReporter` can use a separate `log_read_status` to track the errors while reading the MANIFEST. RocksDB can process more MANIFEST entries only if `log_read_status.ok()`.
Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6996

Reviewed By: ajkr

Differential Revision: D22105746

Pulled By: riversand963

fbshipit-source-id: b22f717a423457a41ca152a242abbb64cf91fc38
2020-06-18 10:09:12 -07:00
Andrew Kryczka
02db03af8d make L0 index/filter pinned memory usage predictable (#6911)
Summary:
Memory pinned by `pin_l0_filter_and_index_blocks_in_cache` needs to be predictable based on user config. This PR makes sure
we do not pin extra memory for large files generated by intra-L0 (see https://github.com/facebook/rocksdb/issues/6889).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6911

Test Plan: unit test

Reviewed By: siying

Differential Revision: D21835818

Pulled By: ajkr

fbshipit-source-id: a11a088549d06bed8aacc2548d266e5983f0ead4
2020-06-09 16:51:23 -07:00
Levi Tamasi
f5e649453c Add convenience method GetFileMetaDataByNumber (#6940)
Summary:
The patch adds a convenience method `GetFileMetaDataByNumber` that
builds on the `FileLocation` functionality introduced recently (see
https://github.com/facebook/rocksdb/pull/6862). This method makes it possible to
retrieve the `FileMetaData` directly as opposed to having to go through
`LevelFiles` and friends.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6940

Test Plan: `make check`

Reviewed By: cheng-chang

Differential Revision: D21905946

Pulled By: ltamasi

fbshipit-source-id: af99e19de21242b2b4a87594a535c6028d16ee72
2020-06-08 16:01:59 -07:00
Levi Tamasi
e3f953a863 Make it possible to look up files by number in VersionStorageInfo (#6862)
Summary:
Does what it says on the can: the patch adds a hash map to `VersionStorageInfo`
that maps file numbers to file locations, i.e. (level, position in level) pairs. This
will enable stricter consistency checks in `VersionBuilder`. The patch also fixes
all the unit tests that used duplicate file numbers in a version (which would trigger
an assertion with the new code).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6862

Test Plan:
`make check`
`make whitebox_crash_test`

Reviewed By: riversand963

Differential Revision: D21670446

Pulled By: ltamasi

fbshipit-source-id: 2eac249945cf33d8fb8597b26bfff5221e1a861a
2020-05-28 10:03:06 -07:00
Levi Tamasi
a00ddf1574 Expose the set of live blob files from Version/VersionSet (#6785)
Summary:
The patch adds logic that returns the set of live blob files from
`Version::AddLiveFiles` and `VersionSet::AddLiveFiles` (in addition to
live table files), and also cleans up the code a bit, for example, by
exposing only the numbers of table files as opposed to the earlier
`FileDescriptor`s that no clients used. Moreover, the patch extends
the `GetLiveFiles` API so that it also exposes blob files in the current version.
Similarly to https://github.com/facebook/rocksdb/pull/6755,
this is a building block for identifying and purging obsolete blob files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6785

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21336210

Pulled By: ltamasi

fbshipit-source-id: fc1aede8a49eacd03caafbc5f6f9ce43b6270821
2020-05-04 15:08:13 -07:00
Levi Tamasi
fe238e5438 Keep track of obsolete blob files in VersionSet (#6755)
Summary:
The patch adds logic to keep track of obsolete blob files. A blob file becomes
obsolete when the last `shared_ptr` that points to the corresponding
`SharedBlobFileMetaData` object goes away, which, in turn, happens when the
last `Version` that contains the blob file is destroyed. No longer needed blob
files are added to the obsolete list in `VersionSet` using a custom deleter to
avoid unnecessary coupling between `SharedBlobFileMetaData` and `VersionSet`.
Obsolete blob files are returned by `VersionSet::GetObsoleteFiles` and stored
in `JobContext`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6755

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21233155

Pulled By: ltamasi

fbshipit-source-id: 47757e06fdc0127f27ed57f51abd27893d9a7b7a
2020-04-30 11:25:51 -07:00
Mike Kolupaev
e45673dece Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.

Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.

It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.

Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621

Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.

Reviewed By: siying

Differential Revision: D20786930

Pulled By: al13n321

fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-15 17:40:44 -07:00
Zhichao Cao
4246888101 Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 16:04:43 -07:00
Levi Tamasi
6f62322fe4 Add blob files to VersionStorageInfo/VersionBuilder (#6597)
Summary:
The patch adds a couple of classes to represent metadata about
blob files: `SharedBlobFileMetaData` contains the information elements
that are immutable (once the blob file is closed), e.g. blob file number,
total number and size of blob files, checksum method/value, while
`BlobFileMetaData` contains attributes that can vary across versions like
the amount of garbage in the file. There is a single `SharedBlobFileMetaData`
for each blob file, which is jointly owned by the `BlobFileMetaData` objects
that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s
and can also be shared if the (immutable _and_ mutable) state of the blob file
is the same in two versions.

In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends
`VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those
containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata
to a new `VersionStorageInfo`. Consistency checks are also extended to ensure
that table files point to blob files that are part of the `Version`, and that all blob files
that are part of any given `Version` have at least some _non_-garbage data in them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D20656803

Pulled By: ltamasi

fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
2020-03-26 18:51:53 -07:00
Mike Kolupaev
963af52f15 Fix iterator reading filter block despite read_tier == kBlockCacheTier (#6562)
Summary:
We're seeing iterators with `ReadOptions::read_tier == kBlockCacheTier` sometimes doing file reads. Stack trace:

```
rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*, bool) const
rocksdb::BlockFetcher::ReadBlockContents()
rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const
rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool) const
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*)
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*) const
rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) const
rocksdb::FullFilterBlockReader::RangeMayExist(rocksdb::Slice const*, rocksdb::Slice const&, rocksdb::SliceTransform const*, rocksdb::Comparator const*, rocksdb::Slice const*, bool*, bool, rocksdb::BlockCacheLookupContext*)
rocksdb::BlockBasedTable::PrefixMayMatch(rocksdb::Slice const&, rocksdb::ReadOptions const&, rocksdb::SliceTransform const*, bool, rocksdb::BlockCacheLookupContext*) const
rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::SeekImpl(rocksdb::Slice const*)
rocksdb::ForwardIterator::SeekInternal(rocksdb::Slice const&, bool)
rocksdb::DBIter::Seek(rocksdb::Slice const&)
```

`BlockBasedTableIterator::CheckPrefixMayMatch` was missing a check for `kBlockCacheTier`. This PR adds it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6562

Test Plan: deployed it to a logdevice test cluster and looked at logdevice's IO tracing.

Reviewed By: siying

Differential Revision: D20529368

Pulled By: al13n321

fbshipit-source-id: 65bf33964b1951464415c900336635fb20919611
2020-03-26 15:21:26 -07:00
Yanqin Jin
fb09ef05dc Attempt to recover from db with missing table files (#6334)
Summary:
There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status.
This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version.
`DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed.
To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush.

Test plan (on devserver):
```
$make check
$COMPILE_WITH_ASAN=1 make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334

Reviewed By: anand1976

Differential Revision: D19778960

Pulled By: riversand963

fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8
2020-03-20 19:30:48 -07:00
Zhichao Cao
8d73137ae8 Replace Directory with FSDirectory in DB (#6468)
Summary:
In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761  introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468

Test Plan: pass make asan_check

Differential Revision: D20195261

Pulled By: zhichao-cao

fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1
2020-03-02 16:16:26 -08:00
Huisheng Liu
904a60ff63 return timestamp from get (#6409)
Summary:
Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included.

ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
    base line (commit 72ee067b9):
        101.712 micros/op 314602 ops/sec;   36.0 MB/s (5658999 of 5658999 found)
    This PR:
        100.288 micros/op 319071 ops/sec;   36.5 MB/s (5674999 of 5674999 found)

./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409

Differential Revision: D20200086

Pulled By: riversand963

fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50
2020-03-02 16:01:00 -08:00
sdong
fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Zhichao Cao
4369f2c7bb Checksum for each SST file and stores in MANIFEST (#6216)
Summary:
In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.

Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string).  A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216

Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.

Differential Revision: D19171461

Pulled By: zhichao-cao

fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
2020-02-10 15:52:52 -08:00
Yanqin Jin
1aaa145877 Fix a data race for cfd->log_number_ (#6249)
Summary:
A thread calling LogAndApply may release db mutex when calling
WriteCurrentStateToManifest() which reads cfd->log_number_. Another thread can
call SwitchMemtable() and writes to cfd->log_number_.
Solution is to cache the cfd->log_number_ before releasing mutex in
LogAndApply.

Test Plan (on devserver):
```
$COMPILE_WITH_TSAN=1 make db_stress
$./db_stress --acquire_snapshot_one_in=10000 --avoid_unnecessary_blocking_io=1 --block_size=16384 --bloom_bits=16 --bottommost_compression_type=zstd --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_pipelined_write=0  --flush_one_in=1000000 --format_version=5 --get_live_files_and_wal_files_one_in=1000000 --index_block_restart_interval=5 --index_type=0 --log2_keys_per_lock=22 --long_running_snapshots=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000000 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --memtablerep=skip_list --mmap_read=0 --nooverwritepercent=1 --open_files=500000 --ops_per_thread=100000000 --partition_filters=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --set_options_one_in=10000 --snapshot_hold_ops=100000 --subcompactions=2 --sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 --write_dbid_to_manifest=1 --writepercent=35
```
Then repeat the following multiple times, e.g. 100 after compiling with tsan.
```
$./db_test2 --gtest_filter=DBTest2.SwitchMemtableRaceWithNewManifest
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6249

Differential Revision: D19235077

Pulled By: riversand963

fbshipit-source-id: 79467b52f48739ce7c27e440caa2447a40653173
2020-01-06 20:09:51 -08:00
anand76
afa2420c2b Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:48:41 -08:00
sdong
aa6f7d0995 Support periodic compaction in universal compaction (#5970)
Summary:
Previously, periodic compaction is not supported in universal compaction. Add the support using following approach: if any file is marked as qualified for periodid compaction, trigger a full compaction. If a full compaction is prevented by files being compacted, try to compact the higher levels than files currently being compacted. If in this way we can only compact the last sorted run and none of the file to be compacted qualifies for periodic compaction, skip the compact. This is to prevent the same single level compaction from being executed again and again.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5970

Test Plan: Add several test cases.

Differential Revision: D18147097

fbshipit-source-id: 8ecc308154d9aca96fb192c51fbceba3947550c1
2019-10-31 11:31:37 -07:00
Vijay Nadimpalli
ec880436c1 API to get file_creation_time of the oldest file in the DB (#5948)
Summary:
Adding a new API to db.h that allows users to get file_creation_time of the oldest file in the DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5948

Test Plan: Added unit test.

Differential Revision: D18056151

Pulled By: vjnadimpalli

fbshipit-source-id: 448ec9d34cb6772e1e5a62db399ace00dcbfbb5d
2019-10-25 11:53:57 -07:00
Shylock Hg
9eb3e1f77d Use delete to disable automatic generated methods. (#5009)
Summary:
Use delete to disable automatic generated methods instead of private, and put the constructor together for more clear.This modification cause the unused field warning, so add unused attribute to disable this warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5009

Differential Revision: D17288733

fbshipit-source-id: 8a767ce096f185f1db01bd28fc88fef1cdd921f3
2019-09-11 18:09:00 -07:00
Vijay Nadimpalli
979fbdc696 Persistent globally unique DB ID in manifest (#5725)
Summary:
Each DB has a globally unique ID. A DB can be physically copied around, or backed-up and restored, and the users should be identify the same DB. This unique ID right now is stored as plain text in file IDENTITY under the DB directory. This approach introduces at least two problems: (1) the file is not checksumed; (2) the source of truth of a DB is the manifest file, which can be copied separately from IDENTITY file, causing the DB ID to be wrong.
The goal of this PR is solve this problem by moving the  DB ID to manifest. To begin with we will write to both identity file and manifest. Write to Manifest is controlled via the flag write_dbid_to_manifest in Options and default is false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5725

Test Plan: Added unit tests.

Differential Revision: D16963840

Pulled By: vjnadimpalli

fbshipit-source-id: 8a86a4c8c82c716003c40fd6b9d2d758030d92e9
2019-09-03 08:52:24 -07:00
Pratik Dhandharia
a281822331 Lower the risk for users to run options.force_consistency_checks = true (#5744)
Summary:
Open-source users recently reported two occurrences of LSM-tree corruption (https://github.com/facebook/rocksdb/issues/5558 is one), which would be caught by options.force_consistency_checks = true. options.force_consistency_checks has a usability limitation because it crashes the service once inconsistency is detected. This makes the feature hard to use. Most users serve from multiple RocksDB shards per server and the impacts of crashing the service is higher than it should be.

Instead, we just pass the error back to users without killing the service, and ask them to deal with the problem accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5744

Differential Revision: D17096940

Pulled By: pdhandharia

fbshipit-source-id: b6780039044e265f26ed2ad03c51f4abbe8b603c
2019-08-29 14:07:37 -07:00
Eli Pozniansky
c2404d9928 Optimizing ApproximateSize to create index iterator just once (#5693)
Summary:
VersionSet::ApproximateSize doesn't need to create two separate index iterators and do binary search for each in BlockBasedTable. So BlockBasedTable::ApproximateSize was added that creates the iterator once and uses it to calculate the data size between start and end keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5693

Differential Revision: D16774056

Pulled By: elipoz

fbshipit-source-id: 53ce262e1a057788243bf30cd9b8aa6581df1a18
2019-08-16 14:18:28 -07:00
sdong
bd2c753dd0 Add command "list_file_range_deletes" in ldb (#5615)
Summary:
Add a command in ldb so that users can print out tombstones in SST files.
In order to test the code, change the interface of LDBCommandRunner::RunCommand() so that it doesn't return from the program, but return the status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5615

Test Plan: Add a new unit test

Differential Revision: D16550326

fbshipit-source-id: 88ddfe6984bdcbb3a528abdd115089df09eba52e
2019-08-15 17:01:03 -07:00
Vijay Nadimpalli
d150e01474 New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
      const ReadOptions& options, ColumnFamilyHandle* column_family,
      const Slice& key, PinnableSlice* merge_operands,
      GetMergeOperandsOptions* get_merge_operands_options,
      int* number_of_operands)

Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);

Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604

Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist

Differential Revision: D16657366

Pulled By: vjnadimpalli

fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
2019-08-06 14:26:44 -07:00
Eli Pozniansky
4834dab578 Improve CPU Efficiency of ApproximateSize (part 2) (#5609)
Summary:
In some cases, we don't have to get really accurate number. Something like 10% off is fine, we can create a new option for that use case. In this case, we can calculate size for full files first, and avoid estimation inside SST files if full files got us a huge number. For example, if we already covered 100GB of data, we should be able to skip partial dives into 10 SST files of 30MB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5609

Differential Revision: D16433481

Pulled By: elipoz

fbshipit-source-id: 5830b31e1c656d0fd3a00d7fd2678ddc8f6e601b
2019-07-31 08:50:00 -07:00
Eli Pozniansky
6b7fcc0d5f Improve CPU Efficiency of ApproximateSize (part 1) (#5613)
Summary:
1. Avoid creating the iterator in order to call BlockBasedTable::ApproximateOffsetOf(). Instead, directly call into it.
2. Optimize BlockBasedTable::ApproximateOffsetOf() keeps the index block iterator in stack.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5613

Differential Revision: D16442660

Pulled By: elipoz

fbshipit-source-id: 9320be3e918c139b10e758cbbb684706d172e516
2019-07-23 15:34:33 -07:00
haoyuhuang
705b8eecb4 Add more callers for table reader. (#5454)
Summary:
This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block.
1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name.
2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable.

This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454

Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32.

Differential Revision: D15819451

Pulled By: HaoyuHuang

fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d
2019-06-20 14:31:48 -07:00
Yanqin Jin
f287f8dc93 Fix a bug caused by secondary not skipping the beginning of new MANIFEST (#5472)
Summary:
While the secondary is replaying after the primary, the primary may switch to a new MANIFEST. The secondary is already able to detect and follow the primary to the new MANIFEST. However, the current implementation has a bug, described as follows.
The new MANIFEST's first records have been generated by VersionSet::WriteSnapshot to describe the current state of the column families and the db as of the MANIFEST creation. Since the secondary instance has already finished recovering upon start, there is no need for the secondary to process these records. Actually, if the secondary were to replay these records, the secondary may end up adding the same SST files **again** to each column family, causing consistency checks done by VersionBuilder to fail. Therefore, we record the number of records to skip at the beginning of the new MANIFEST and ignore them.

Test plan (on dev server)
```
$make clean && make -j32 all
$./db_secondary_test
```
All existing unit tests must pass as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5472

Differential Revision: D15866771

Pulled By: riversand963

fbshipit-source-id: a1eec4837fb2ad13059398efb0f437e74fd53bed
2019-06-18 11:21:37 -07:00
haoyuhuang
bb4178066d Integrate block cache tracer into db_impl (#5433)
Summary:
This PR integrates the block cache tracer class into db_impl.cc.
db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433

Differential Revision: D15728016

Pulled By: HaoyuHuang

fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71
2019-06-13 15:43:10 -07:00
haoyuhuang
5efa0d6b0d Create a BlockCacheLookupContext to enable fine-grained block cache tracing. (#5421)
Summary:
BlockCacheLookupContext only contains the caller for now.
We will trace block accesses at five places:
1. BlockBasedTable::GetFilter.
2. BlockBasedTable::GetUncompressedDict.
3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.)
4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.)

We create the context at:
1. BlockBasedTable::Get. (kUserGet)
2. BlockBasedTable::MultiGet. (kUserMGet)
3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.)
4. BlockBasedTable::Open. (kPrefetch)
5. Index/Filter::CacheDependencies. (kPrefetch)
6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize).

I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable.
Throughput of this PR: 231334 ops/s.
Throughput of the master branch: 238428 ops/s.

Experiment setup:
RocksDB:    version 6.2
Date:       Mon Jun 10 10:42:51 2019
CPU:        24 * Intel Core Processor (Skylake)
CPUCache:   16384 KB
Keys:       20 bytes each
Values:     100 bytes each (100 bytes after compression)
Entries:    1000000
Prefix:    20 bytes
Keys per prefix:    0
RawSize:    114.4 MB (estimated)
FileSize:   114.4 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1

Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000

Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120

TODOs:
1. Create a caller for external SST file ingestion and differentiate the callers for iterator.
2. Integrate tracer to trace block cache accesses.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5421

Differential Revision: D15704258

Pulled By: HaoyuHuang

fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a
2019-06-10 15:33:27 -07:00
haoyuhuang
227b5d52df Make RocksDB secondary instance respect atomic groups in version edits. (#5411)
Summary:
With this commit, RocksDB secondary instance respects atomic groups in version edits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5411

Differential Revision: D15617512

Pulled By: HaoyuHuang

fbshipit-source-id: 913f4ede391d772dcaf5649e3cd2099fa292d120
2019-06-04 10:56:19 -07:00
Vijay Nadimpalli
49c5a12dbe Organizing rocksdb/db directory
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390

Differential Revision: D15579388

Pulled By: vjnadimpalli

fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366
2019-05-31 11:57:01 -07:00
Yanqin Jin
f1302ebab8 Add class-level comments to version-related classes (#5348)
Summary:
As title.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5348

Differential Revision: D15564595

Pulled By: riversand963

fbshipit-source-id: dd45aa86a70e0343c2e9ef702fad165163f548e6
2019-05-30 16:18:33 -07:00
haoyuhuang
518cd1a62a Use GetCurrentManifestPath to locate current MANIFEST file (#5331)
Summary:
In version_set.cc, there is a function GetCurrentManifestPath. The goal of this task is to refactor ListColumnFamilies function so that ListColumnFamilies calls GetCurrentManifestPath to search for MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5331

Differential Revision: D15444524

Pulled By: HaoyuHuang

fbshipit-source-id: 1dcbd030bc0f2e835695741f450bba150f2f2903
2019-05-22 09:21:56 -07:00
JiYou
5b7e09bd6f VersionSet: optmize GetOverlappingInputsRangeBinarySearch (#4987)
Summary:
`GetOverlappingInputsRangeBinarySearch` firstly use binary search
to find a index in the given range `[begin, end]`. But after find
the index, then use linear search to find the `start_index` and
`end_index`. So the search process degraded to linear time.

Here optmize the search process with below changes:

- use `std::lower_bound` and `std::upper_bound` to get
  `lg(n)` search complexity.
- use uniformed lambda for search process.
- simplify process for `within_interval` true or false.
- remove function `ExtendFileRangeWithinInterval`
  and `ExtendFileRangeOverlappingInterval`.

Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4987

Differential Revision: D14984192

Pulled By: riversand963

fbshipit-source-id: fae4b8e59a21b7e350718d60cdc94dd55ac81e89
2019-04-17 18:15:20 -07:00
anand76
fefd4b98c5 Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.

Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency

The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.

Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).

Batch   Sizes

1        | 2        | 4         | 8      | 16  | 32

Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)

Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135

Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62

Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891

dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011

Differential Revision: D14348703

Pulled By: anand1976

fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:28:26 -07:00
Sagar Vemuri
d3d20dcdca Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.

This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted.  And also, of course, it helps to cleanup data older than certain threshold.

- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).

This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166

Differential Revision: D14884441

Pulled By: sagar0

fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
2019-04-10 19:31:18 -07:00
Yanqin Jin
9358178edc Support for single-primary, multi-secondary instances (#4899)
Summary:
This PR allows RocksDB to run in single-primary, multi-secondary process mode.
The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.

This PR has several components:
1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.

2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.

3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
3.1 Tail the primary's MANIFEST during recovery.
3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
3.3 Tailing WAL will be in a future PR.

4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899

Differential Revision: D14510945

Pulled By: riversand963

fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
2019-03-26 16:45:31 -07:00
Siying Dong
48e7effa79 Avoid to go through every CF for every ReleaseSnapshot() (#5090)
Summary:
With https://github.com/facebook/rocksdb/pull/3009 we go through every CF
to check whether a bottommost compaction is needed to be triggered. This is done
within DB mutex. What we do within DB mutex may heavily influece the write throughput
we can achieve, so we always want to minimize work there.

Here we try to avoid this for-loop by first check a global threshold. In most of
the time, the CF loop can be avoided.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5090

Differential Revision: D14582684

Pulled By: siying

fbshipit-source-id: 968f6d9bb6affe1a5ebc4910b418300b076f166f
2019-03-25 19:18:04 -07:00