Commit Graph

1072 Commits

Author SHA1 Message Date
Hui Xiao
6785135bc2 Update HISTORY.md for new rate limiter io priorities (#8896)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8896

Reviewed By: ajkr

Differential Revision: D30846120

Pulled By: hx235

fbshipit-source-id: 9224ebce5437d63b0fb8af9171c6041a9ea5d90f
2021-09-09 13:26:20 -07:00
hx235
45175ca2e1 Charge read to rate limiter in BackupEngine (#8722)
Summary:
Context:
While all the non-trivial write operations in BackupEngine go through the RateLimiter, reads currently do not. In general, this is not a huge issue because (especially since some I/O efficiency fixes) reads in BackupEngine are mostly limited by corresponding writes, for both backup and restore. But in principle we should charge the RateLimiter for reads as well.
- Charged read operations in `BackupEngineImpl::CopyOrCreateFile`, `BackupEngineImpl::ReadFileAndComputeChecksum`, `BackupEngineImpl::BackupMeta::LoadFromFile` and `BackupEngineImpl::GetFileDbIdentities`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8722

Test Plan:
- Passed existing tests
- Passed added unit tests

Reviewed By: pdillinger

Differential Revision: D30610464

Pulled By: hx235

fbshipit-source-id: 9b08c9387159a5385c8d390d6666377a0d0117e5
2021-09-08 16:24:40 -07:00
Zhiyi Zhang
0cb0fc6fd3 Add DB properties for BlobDB (#8734)
Summary:
RocksDB exposes certain internal statistics via the DB property interface.
However, there are currently no properties related to BlobDB.

For starters, we would like to add the following BlobDB properties:
`rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree).
`rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version.
`rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions.
`rocksdb.live-blob-file-size`: the total size of all blob files in the current Version.
`rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734

Test Plan:
```
➜  rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test
[==========] Running 16 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 10 tests from DBBlobBasicTest
[ RUN      ] DBBlobBasicTest.GetBlob
[       OK ] DBBlobBasicTest.GetBlob (12 ms)
[ RUN      ] DBBlobBasicTest.MultiGetBlobs
[       OK ] DBBlobBasicTest.MultiGetBlobs (11 ms)
[ RUN      ] DBBlobBasicTest.GetBlob_CorruptIndex
[       OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms)
[ RUN      ] DBBlobBasicTest.GetBlob_InlinedTTLIndex
[       OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms)
[ RUN      ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber
[       OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms)
[ RUN      ] DBBlobBasicTest.GenerateIOTracing
[       OK ] DBBlobBasicTest.GenerateIOTracing (11 ms)
[ RUN      ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile
[       OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms)
[ RUN      ] DBBlobBasicTest.GetMergeBlobWithPut
[       OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms)
[ RUN      ] DBBlobBasicTest.MultiGetMergeBlobWithPut
[       OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms)
[ RUN      ] DBBlobBasicTest.BlobDBProperties
[       OK ] DBBlobBasicTest.BlobDBProperties (21 ms)
[----------] 10 tests from DBBlobBasicTest (124 ms total)

[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms)
[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total)

[----------] Global test environment tear-down
[==========] 16 tests from 2 test cases ran. (2190 ms total)
[  PASSED  ] 16 tests.
```

Reviewed By: ltamasi

Differential Revision: D30690849

Pulled By: Zhiyi-Zhang

fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 12:22:04 -07:00
Peter Dillinger
e40b04e9fa Fix POSIX LockFile after failure to create file (#8747)
Summary:
Failure to create the lock file (e.g. out of space) could
prevent future LockFile attempts in the same process on the same file
from succeeding.

Also added DEBUG code to fail assertion if PosixFileLock is destroyed
without using UnlockFile (which is a risk because FileLock is in the
public API with virtual destructor).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8747

Test Plan: test added

Reviewed By: ajkr

Differential Revision: D30732543

Pulled By: pdillinger

fbshipit-source-id: 4c30a959566d91f778d6fad3fbbd5f3941b097c1
2021-09-07 22:41:36 -07:00
Andrew Kryczka
9308ff366c Bytes read/written stats for CreateNewBackup*() (#8819)
Summary:
Gets `Statistics` from the options associated with the `DB` undergoing backup, and populates new ticker stats with the thread-local `IOContext` read/write counters for the threads doing backup work.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8819

Reviewed By: pdillinger

Differential Revision: D30779238

Pulled By: ajkr

fbshipit-source-id: 75ccafc355f90906df5cf80367f7245b985772d8
2021-09-07 18:25:16 -07:00
Andrew Kryczka
941543721d Bytes read stat for VerifyChecksum() and VerifyFileChecksums() APIs (#8741)
Summary:
- Clarified some comments on compatibility for adding new ticker stats
- Added read I/O stats for `VerifyChecksum()` and `VerifyFileChecksums()` APIs

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8741

Test Plan: new unit test

Reviewed By: zhichao-cao

Differential Revision: D30708578

Pulled By: ajkr

fbshipit-source-id: d06b961f7e199ae92c266b683e39870aa8f63449
2021-09-07 13:28:29 -07:00
Peter Dillinger
c9cd5d25a8 Remove some unneeded code (#8736)
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736

Test Plan: existing tests

Reviewed By: mrambacher

Differential Revision: D30700039

Pulled By: pdillinger

fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
2021-09-01 14:28:58 -07:00
Peter Dillinger
32752551b9 Fix a buffer size race condition in BackupEngine (#8732)
Summary:
If RateLimiter burst bytes changes during concurrent Restore
operations

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8732

Test Plan: updated unit test fails with TSAN before change, passes after

Reviewed By: ajkr

Differential Revision: D30683879

Pulled By: pdillinger

fbshipit-source-id: d0ddb3587ade91ee2a4d926b475acf7781b03086
2021-09-01 14:28:58 -07:00
anand76
ec9f52ece6 Fix a race in LRUCacheShard::Promote (#8717)
Summary:
In ```LRUCacheShard::Promote```, a reference is released outside the LRU mutex. Fix the race condition.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8717

Reviewed By: zhichao-cao

Differential Revision: D30649206

Pulled By: anand1976

fbshipit-source-id: 09c0af05b2294a7fe2c02876a61b0bad6e3ada61
2021-08-30 19:10:55 -07:00
Peter Dillinger
13ded69484 Built-in support for generating unique IDs, bug fix (#8708)
Summary:
Env::GenerateUniqueId() works fine on Windows and on POSIX
where /proc/sys/kernel/random/uuid exists. Our other implementation is
flawed and easily produces collision in a new multi-threaded test.
As we rely more heavily on DB session ID uniqueness, this becomes a
serious issue.

This change combines several individually suitable entropy sources
for reliable generation of random unique IDs, with goal of uniqueness
and portability, not cryptographic strength nor maximum speed.

Specifically:
* Moves code for getting UUIDs from the OS to port::GenerateRfcUuid
rather than in Env implementation details. Callers are now told whether
the operation fails or succeeds.
* Adds an internal API GenerateRawUniqueId for generating high-quality
128-bit unique identifiers, by combining entropy from three "tracks":
  * Lots of info from default Env like time, process id, and hostname.
  * std::random_device
  * port::GenerateRfcUuid (when working)
* Built-in implementations of Env::GenerateUniqueId() will now always
produce an RFC 4122 UUID string, either from platform-specific API or
by converting the output of GenerateRawUniqueId.

DB session IDs now use GenerateRawUniqueId while DB IDs (not as
critical) try to use port::GenerateRfcUuid but fall back on
GenerateRawUniqueId with conversion to an RFC 4122 UUID.

GenerateRawUniqueId is declared and defined under env/ rather than util/
or even port/ because of the Env dependency.

Likely follow-up: enhance GenerateRawUniqueId to be faster after the
first call and to guarantee uniqueness within the lifetime of a single
process (imparting the same property onto DB session IDs).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708

Test Plan:
A new mini-stress test in env_test checks the various public
and internal APIs for uniqueness, including each track of
GenerateRawUniqueId individually. We can't hope to verify anywhere close
to 128 bits of entropy, but it can at least detect flaws as bad as the
old code. Serial execution of the new tests takes about 350 ms on
my machine.

Reviewed By: zhichao-cao, mrambacher

Differential Revision: D30563780

Pulled By: pdillinger

fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364
2021-08-30 15:20:41 -07:00
Merlin Mao
6c2bd28a61 Update comments, fix typos. (#8721)
Summary:
- Removed the default empty constructors of `TraceWriter` and `TraceReader`.
- Removed unused `ReadFooter()` from `ReplayerImpl`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8721

Test Plan: None

Reviewed By: zhichao-cao

Differential Revision: D30609743

Pulled By: autopear

fbshipit-source-id: 7e2626b015bd57ebb408a2836b4b4217cea10002
2021-08-27 13:16:32 -07:00
anand76
ebaa3c8a59 Fix a race condition in DumpStats() during iteration of the ColumnFamilySet (#8714)
Summary:
DumpStats() iterates through the ColumnFamilySet. There is a potential
race condition because it does Ref the cfd, and the cfd could get
destroyed during the iteration.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8714

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D30580199

Pulled By: anand1976

fbshipit-source-id: 60a3443ad0d4f7ac6a977dec780e6d2c1b70b850
2021-08-26 15:40:26 -07:00
Yanqin Jin
f235f4b0a3 Fix a bug of secondary instance sequence going backward (#8653)
Summary:
Recent refactor of `ReactiveVersionSet::ReadAndApply()` uses
`ManifestTailer` whose `Iterate()` method can cause the db's
`last_sequence_` to go backward. Consequently, read requests can see
out-dated data. For example, latest changes to the primary will not be
seen on the secondary even after a `TryCatchUpWithPrimary()` if no new
write batches are read from the WALs and no new MANIFEST entries are
read from the MANIFEST.

Fix the bug so that `VersionEditHandler::CheckIterationResult` will
never decrease `last_sequence_`, `last_allocated_sequence_` and
`last_published_sequence_`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8653

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D30272084

Pulled By: riversand963

fbshipit-source-id: c6a49c534b2509b93ef62d8936ed0acd5b860eaa
2021-08-24 18:18:36 -07:00
Yanqin Jin
229350ef48 Allow iterate refresh for secondary instance (#8700)
Summary:
Test plan
make check

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8700

Reviewed By: zhichao-cao

Differential Revision: D30523907

Pulled By: riversand963

fbshipit-source-id: 68928ab4dafb64ce80ab7bc69d83727a4713ab91
2021-08-24 15:40:56 -07:00
Jay Zhuang
249b1078c9 Add extra information to RemoteCompaction APIs (#8680)
Summary:
Currently, we only provide job_id in RemoteCompaction APIs, the
main problem of `job_id` is it cannot uniquely identify a compaction job
between DB instances or between sessions.
Providing DB and session id to the user, which will make building cross
DB compaction service easier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8680

Test Plan: unittest

Reviewed By: ajkr

Differential Revision: D30444859

Pulled By: jay-zhuang

fbshipit-source-id: fdf107f4286564049637f154193c6d94c3c59448
2021-08-23 16:27:38 -07:00
Peter Dillinger
0637c8d36c Fix typo in 6.24.0 HISTORY.md (#8694)
Summary:
fix typo

Also, clarified change of C API signatures.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8694

Test Plan: visual

Reviewed By: ltamasi

Differential Revision: D30492882

Pulled By: pdillinger

fbshipit-source-id: ac6dc3dcefa01c91fd87fc7f50279ea5e13fa41d
2021-08-23 13:30:34 -07:00
Levi Tamasi
8c9e689790 Update version.h and HISTORY.md for the 6.24 release (#8688)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8688

Reviewed By: ajkr, riversand963

Differential Revision: D30467746

Pulled By: ltamasi

fbshipit-source-id: 0fce0d42fe2fe3cb56d7a89607154b3b957f09b6
2021-08-20 22:28:16 -07:00
Peter Dillinger
2a383f21f4 Add Bloom/Ribbon hybrid API support (#8679)
Summary:
This is essentially resurrection and fixing of the part of
https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
when configuring Ribbon filter, you can specify an LSM level before which
Bloom will be used instead of Ribbon. But Bloom is only considered for
Leveled and Universal compaction styles and file going into a known LSM
level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
you would expect with NewRibbonFilterPolicy.

So that this can be controlled with a single int value and so that flushes
can be distinguished from intra-L0, we consider flush to go to level -1 for
the purposes of this option. (Explained in API comment.)

I also expect the most common and recommended Ribbon configuration to
use Bloom during flush, to minimize slowing down writes and because according
to my estimates, Ribbon only pays off if the structure lives in memory for
more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
to be this mild hybrid configuration. I don't really want to add something like
NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
flush, Ribbon otherwise) should be considered a natural choice.

C APIs also updated, but because they don't support overloading,
rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
configuration. While touching C API, I changed bits per key options from
int to double.

BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
unused fields from BloomFilterPolicy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679

Test Plan: new + updated tests, including crash test

Reviewed By: jay-zhuang

Differential Revision: D30445797

Pulled By: pdillinger

fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
2021-08-20 18:00:16 -07:00
anand76
f35042ca40 Add a PerfContext counter for secondary cache hits (#8685)
Summary:
Add a PerfContext counter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8685

Reviewed By: zhichao-cao

Differential Revision: D30453957

Pulled By: anand1976

fbshipit-source-id: 42888a3ced240e1c44446d52d3b04adfb01f5665
2021-08-20 15:17:30 -07:00
Akanksha Mahajan
5efec84c60 Fix blob callback in compaction and atomic flush (#8681)
Summary:
Pass BlobFileCompletionCallback  in case of atomic flush and
compaction job which is currently nullptr(default parameter).
BlobFileCompletionCallback is used in case of IntegratedBlobDB to report new blob files to
SstFileManager.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8681

Test Plan: CircleCI jobs

Reviewed By: ltamasi

Differential Revision: D30445998

Pulled By: akankshamahajan15

fbshipit-source-id: ba48093843864faec57f1f365cce7b5a569c4021
2021-08-20 11:41:14 -07:00
mrambacher
9eb002fcf0 Fix some minor issues in the Customizable infrastructure (#8566)
Summary:
- Fix issue with OptionType::Vector when the nested item is a Customizable with no names
- Fix issue with OptionType::Vector to appropriately wrap the elements in a Vector;
- Fix an issue with nested Customizable object with a null immutable object still appearing in the mutable options;
- Fix/Add tests for null/empty customizable objects
- Move the RegisterTestObjects from customizable_test into testutil.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8566

Reviewed By: zhichao-cao

Differential Revision: D30303724

Pulled By: mrambacher

fbshipit-source-id: 33fa8ea2a3b663210cb356da05e64aab7585b1b5
2021-08-19 10:10:47 -07:00
Merlin Mao
d10801e983 Allow Replayer to report the results of TraceRecords. (#8657)
Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.

New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".

`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657

Reviewed By: ajkr

Differential Revision: D30290216

Pulled By: autopear

fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
2021-08-18 17:06:14 -07:00
Yanqin Jin
2b367fa8cc Fix bug caused by releasing snapshot(s) during compaction (#8608)
Summary:
In debug mode, we are seeing assertion failure as follows

```
db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \
Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed.
```

It is caused by releasing earliest snapshot during compaction between the execution of
`NextFromInput()` and `PrepareOutput()`.

In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`,
incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation
level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also
outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be
skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened.

Fix by taking a more conservative approach: make compaction zero out sequence number only
if key is in the earliest snapshot when the compaction starts.

Another assertion failure is
```
Assertion `current_user_key_snapshot_ == last_snapshot' failed.
```

It's caused by releasing the snapshot between the PUT and SingleDelete during compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D30145645

Pulled By: riversand963

fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81
2021-08-17 22:14:20 -07:00
Levi Tamasi
6878cedcc3 Add statistics support to integrated BlobDB (#8667)
Summary:
The patch adds statistics support to the integrated BlobDB implementation,
namely the tickers `BLOB_DB_BLOB_FILE_BYTES_READ` and
`BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED`, and the histograms
`BLOB_DB_(DE)COMPRESSION_MICROS`. (Some other statistics, like
`BLOB_DB_BLOB_FILE_BYTES_WRITTEN`, `BLOB_DB_BLOB_FILE_SYNCED`,
`BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS` were already supported.)
Note that the vast majority of the old BlobDB's tickers/histograms are not
really applicable to the new implementation, since they e.g. pertain to calling
dedicated BlobDB APIs (which the integrated BlobDB does not have) or are
tied to the legacy BlobDB's design of writing blob files synchronously when
a write API is called. Such statistics are marked "legacy BlobDB only" in
`statistics.h`.

Fixes https://github.com/facebook/rocksdb/issues/8645 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8667

Test Plan: Ran `make check` and tested the new statistics using `db_bench`.

Reviewed By: riversand963

Differential Revision: D30356884

Pulled By: ltamasi

fbshipit-source-id: 5f8a833faee60401c5643c2f0a6c0415488190a4
2021-08-17 17:22:31 -07:00
anand76
add68bd28a Add a stat to count secondary cache hits (#8666)
Summary:
Add a stat for secondary cache hits. The ```Cache::Lookup``` API had an unused ```stats``` parameter. This PR uses that to pass the pointer to a ```Statistics``` object that ```LRUCache``` uses to record the stat.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8666

Test Plan: Update a unit test in lru_cache_test

Reviewed By: zhichao-cao

Differential Revision: D30353816

Pulled By: anand1976

fbshipit-source-id: 2046f78b460428877a26ffdd2bb914ae47dfbe77
2021-08-16 21:01:14 -07:00
Jay Zhuang
c55460c734 Add property LiveSstFilesSizeAtTemperature for tiered storage (#8644)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8644

Reviewed By: siying, zhichao-cao

Differential Revision: D30236535

Pulled By: jay-zhuang

fbshipit-source-id: 1758d1c46d83a5087560fb63d53a016bf999da81
2021-08-15 14:17:45 -07:00
Merlin Mao
74a652a45f Code cleanup for trace replayer (#8652)
Summary:
- Remove extra `;` in trace_record.h
- Remove some unnecessary `assert` in trace_record_handler.cc
- Initialize `env_` after` exec_handler_` in `ReplayerImpl` to let db be asserted in creating the handler before getting `db->GetEnv()`.
- Update history to include the new `TraceReader::Reset()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8652

Reviewed By: ajkr

Differential Revision: D30276872

Pulled By: autopear

fbshipit-source-id: 476ee162e0f241490c6209307448343a5b326b37
2021-08-12 09:22:43 -07:00
Merlin Mao
f58d276764 Make TraceRecord and Replayer public (#8611)
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.

User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.

Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611

Reviewed By: ajkr

Differential Revision: D30266329

Pulled By: autopear

fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
2021-08-11 19:32:46 -07:00
Jay Zhuang
87e2358736 Add suggestion for btrfs user to disable preallocation (#8646)
Summary:
Add comment for `options.allow_fallocate` that btrfs
preallocated space are not freed and a suggestion to disable
preallocation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8646

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D30240050

Pulled By: jay-zhuang

fbshipit-source-id: 75b7190bc8276ce8d8ac2d0cb9064b386cbf4768
2021-08-11 14:53:37 -07:00
sdong
e7c24168d8 Move old files to warm tier in FIFO compactions (#8310)
Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310

Test Plan: Add several unit tests.

Reviewed By: ajkr

Differential Revision: D28493792

fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501
2021-08-09 12:51:14 -07:00
Levi Tamasi
87882736ef Fix the sorting of KeyContexts for batched MultiGet (#8633)
Summary:
`CompareKeyContext::operator()` on the trunk has a bug: when comparing
column family IDs, `lhs` is used for both sides of the comparison. This
results in the `KeyContext`s getting sorted solely based on key, which
in turn means that keys with the same column family do not necessarily
form a single range in the sorted list. This violates an assumption of the
batched `MultiGet` logic, leading to the same column family
showing up multiple times in the list of `MultiGetColumnFamilyData`.
The end result is the code attempting to check out the thread-local
`SuperVersion` for the same CF multiple times, causing an
assertion violation in debug builds and memory corruption/crash in
release builds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D30169182

Pulled By: ltamasi

fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023
2021-08-06 16:27:42 -07:00
Peter Dillinger
a7fd1d0881 Make backup restore atomic, with sync option (#8568)
Summary:
Guarantees that if a restore is interrupted, DB::Open will fail. This works by
restoring CURRENT first to CURRENT.tmp then as a final step renaming to CURRENT.

Also makes restore respect BackupEngineOptions::sync (default true). When set,
the restore is guaranteed persisted by the time it returns OK. Also makes the above
atomicity guarantee work in case the interruption is power loss or OS crash (not just
process interruption or crash).

Fixes https://github.com/facebook/rocksdb/issues/8500

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8568

Test Plan:
added to backup mini-stress unit test. Passes with
gtest_repeat=100 (whereas fails 7 times without the CURRENT.tmp)

Reviewed By: akankshamahajan15

Differential Revision: D29812605

Pulled By: pdillinger

fbshipit-source-id: 24e9a993b305b1835ca95558fa7a7152e54cda8e
2021-08-06 09:50:21 -07:00
Akanksha Mahajan
fd2079938d Dynamically configure BlockBasedTableOptions.prepopulate_block_cache (#8620)
Summary:
Dynamically configure BlockBasedTableOptions.prepopulate_block_cache using DB::SetOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8620

Test Plan: Added new unit test

Reviewed By: anand1976

Differential Revision: D30091319

Pulled By: akankshamahajan15

fbshipit-source-id: fb586d1848a8dd525bba7b2f9eeac34f2fc6d82c
2021-08-05 19:44:51 -07:00
Yanqin Jin
b01a428d9b Update HISTORY for PR8585 (#8623)
Summary:
Update HISTORY.md for PR https://github.com/facebook/rocksdb/issues/8585 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8623

Reviewed By: ltamasi

Differential Revision: D30121910

Pulled By: riversand963

fbshipit-source-id: 525af43fad908a498f22ed4f934ec5cbf60e6d25
2021-08-04 18:45:52 -07:00
Andrew Kryczka
a685a701ca Do not attempt to rename non-existent info log (#8622)
Summary:
Previously we attempted to rename "LOG" to "LOG.old.*" without checking
its existence first. "LOG" had no reason to exist in a new DB.

Errors in renaming a non-existent "LOG" were swallowed via
`PermitUncheckedError()` so things worked. However the storage service's
error monitoring was detecting all these benign rename failures. So it
is better to fix it. Also with this PR we can now distinguish rename failure
for other reasons and return them.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8622

Test Plan: new unit test

Reviewed By: akankshamahajan15

Differential Revision: D30115189

Pulled By: ajkr

fbshipit-source-id: e2f337ffb2bd171be0203172abc8e16e7809b170
2021-08-04 17:25:00 -07:00
Levi Tamasi
08af0ae3f0 Mention PR 8605 in HISTORY.md (#8619)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8619

Reviewed By: riversand963

Differential Revision: D30081937

Pulled By: ltamasi

fbshipit-source-id: 57505957ae2c22d4b194aa28cb3fd261b3b39919
2021-08-03 16:15:12 -07:00
Akanksha Mahajan
8b2f60b668 Cache warming blocks during flush (#8561)
Summary:
Insert warm blocks  (data, uncompressed dict, index and filter blocks) during flush in Block cache which is enabled under option BlockBasedTableOptions.prepopulate_block_cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8561

Test Plan: Added unit test

Reviewed By: anand1976

Differential Revision: D29773411

Pulled By: akankshamahajan15

fbshipit-source-id: 6631123c10134340ef0bd7e90baafaa6deba0e66
2021-08-03 12:44:15 -07:00
Mikhail Golubev
8f52972cf9 Allow to use a string as a delimiter in StringAppendOperator (#8536)
Summary:
An arbitrary string can be used as a delimiter in StringAppend merge operator
flavor. In particular, it allows using an empty string, combining binary values for
the same key byte-to-byte one next to another.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8536

Reviewed By: mrambacher

Differential Revision: D29962120

Pulled By: zhichao-cao

fbshipit-source-id: 4ef5d846a47835cf428a11200409e30e2dbffc4f
2021-08-02 16:50:41 -07:00
mrambacher
ab7f7c9e49 Allow WAL dir to change with db dir (#8582)
Summary:
Prior to this change, the "wal_dir"  DBOption would always be set (defaults to dbname) when the DBOptions were sanitized.  Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file.

After this change, the "wal_dir" option is only set under specific circumstances.  Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname.  Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path).

Tests were added to the core and ldb to test that a database could be created and renamed without issue.  Additional tests for various permutations of wal_dir were also added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582

Reviewed By: pdillinger, autopear

Differential Revision: D29881122

Pulled By: mrambacher

fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb
2021-07-30 12:16:44 -07:00
Jay Zhuang
a4b8ac9a73 Fix HISTORY.md for #8518 (#8594)
Summary:
PR https://github.com/facebook/rocksdb/issues/8518 merge the change to wrong section.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8594

Reviewed By: riversand963

Differential Revision: D29974565

Pulled By: jay-zhuang

fbshipit-source-id: 51c930d93fbdb406fe31ff73c96548a6f88b9965
2021-07-28 16:10:46 -07:00
mrambacher
3aee4fbd41 Make EventListener into a Customizable Class (#8473)
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "".  If there is no name, the listener cannot be loaded from the ObjectRegistry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473

Reviewed By: zhichao-cao

Differential Revision: D29901488

Pulled By: mrambacher

fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
2021-07-27 07:47:02 -07:00
Yanqin Jin
2e5388178f Return error if trying to open secondary on missing or inaccessible primary (#8200)
Summary:
If the primary's CURRENT file is missing or inaccessible, the secondary should not hang
trying repeatedly to switch to the next MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8200

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D27840627

Pulled By: riversand963

fbshipit-source-id: 071fed97cbab1bc5cdefd1dc235e5cd406c174e1
2021-07-22 15:48:58 -07:00
Jay Zhuang
42eaa45c1b Avoid updating option if there's no value updated (#8518)
Summary:
Try avoid expensive updating options operation if
`SetDBOptions()` does not change any option value.
Skip updating is not guaranteed, for example, changing `bytes_per_sync`
to `0` may still trigger updating, as the value could be sanitized.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8518

Test Plan: added unittest

Reviewed By: riversand963

Differential Revision: D29672639

Pulled By: jay-zhuang

fbshipit-source-id: b7931de62ceea6f1bdff0d1209adf1197d3ed1f4
2021-07-21 13:45:59 -07:00
Jay Zhuang
c04a86a0e9 Update HISTORY.md and version.h 6.23 release (#8552)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8552

Reviewed By: ajkr

Differential Revision: D29746828

Pulled By: jay-zhuang

fbshipit-source-id: 17d564895ae9cb675d455e73626b9a6717db6279
2021-07-16 17:52:14 -07:00
hongrubb
870033291a Fix Get() return status when block cache is disabled (#8485)
Summary:
This PR is for https://github.com/facebook/rocksdb/issues/8453

We need to update `s = biter.status();`  when `biter.status().IsIncomplete()` is true. By doing this, can fix the problem in issue.
Besides, we still need to update `db_statistics`  in `get_context.ReportCounters()` before return back.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8485

Reviewed By: jay-zhuang

Differential Revision: D29604835

Pulled By: ajkr

fbshipit-source-id: c7f2f1cd058223ce1b507ec05d57cf264b9c9710
2021-07-13 18:13:24 -07:00
anand76
d1b70b05a6 Avoid passing existing BG error to WriteStatusCheck (#8511)
Summary:
In ```DBImpl::WriteImpl()```, we call ```PreprocessWrite()``` which, among other things, checks the BG error and returns it set. This return status is later on passed to ```WriteStatusCheck()```, which calls ```SetBGError()```. This results in a spurious call, and info logs, on every user write request. We should avoid passing the ```PreprocessWrite()``` return status to ```WriteStatusCheck()```, as the former would have called ```SetBGError()``` already if it encountered any new errors, such as error when creating a new WAL file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8511

Test Plan: Run existing tests

Reviewed By: zhichao-cao

Differential Revision: D29639917

Pulled By: anand1976

fbshipit-source-id: 19234163969e1645dbeb273712aaf5cd9ea2b182
2021-07-11 22:37:52 -07:00
mrambacher
570248aeff Make SecondaryCache Customizable (#8480)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8480

Reviewed By: zhichao-cao

Differential Revision: D29528740

Pulled By: mrambacher

fbshipit-source-id: fd0f70d15f66611c8498257a9973f7e98ca13839
2021-07-06 09:18:08 -07:00
Akanksha Mahajan
c76778e2bd Call OnCompactionCompleted API in case of DisableManualCompaction (#8469)
Summary:
Call OnCompactionCompleted API in case of
DisableManualCompaction() with updated Status::Incomplete

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8469

Reviewed By: ajkr

Differential Revision: D29475517

Pulled By: akankshamahajan15

fbshipit-source-id: a1726c5e6ee18c0b5097ea04f5e6975fbe108055
2021-07-01 19:18:55 -07:00
Jay Zhuang
93a7389442 Add statistics support on CompactionService remote side (#8368)
Summary:
Add statistics option on CompactionService remote side.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8368

Test Plan: `make check`

Reviewed By: ajkr

Differential Revision: D28944427

Pulled By: jay-zhuang

fbshipit-source-id: 2a19217f4a69b6e511af87eed12391860ef00c5e
2021-06-29 11:48:14 -07:00
mrambacher
89f66d4484 Add customizable_util.h to the public API (#8301)
Summary:
Useful for allowing new classes to create and manage Customizable objects without using internal APIs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8301

Reviewed By: zhichao-cao

Differential Revision: D29428303

Pulled By: mrambacher

fbshipit-source-id: 3d33d5197cc8379fe35b54d3d169f91f0dfe7a47
2021-06-29 09:08:57 -07:00
mrambacher
be219089ad Add BlobMetaData retrieval methods (#8273)
Summary:
Added BlobMetaData to ColumnFamilyMetaData and LiveBlobMetaData and DB API GetLiveBlobMetaData to retrieve it.

First pass at struct.  More tests and maybe fields to come...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8273

Reviewed By: ltamasi

Differential Revision: D29102400

Pulled By: mrambacher

fbshipit-source-id: 8a2383a4446328be6b91dced9841fdd3dfc80b73
2021-06-28 08:13:29 -07:00
Akanksha Mahajan
be8199cdb9 Run Merge with Integrated BlobDB in stress, crash and db_bench (#8461)
Summary:
Run Merge with Intergrated BlobDB in stress tests, crash tests and db_bench.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8461

Test Plan:
1. python3 -u tools/db_crashtest.py --simple whitebox
---use_merge=1 --enable_blob_files=1
           2.  ./db_bench --benchmarks="readwhilemerging"
--merge_operator=uint64add --enable_blob_files=true

Reviewed By: ltamasi

Differential Revision: D29394824

Pulled By: akankshamahajan15

fbshipit-source-id: 0a8e492b13129673e088fb8af3402ab678bb473a
2021-06-25 10:45:52 -07:00
Andrew Kryczka
3d844dff1d add missing fields to GetLiveFilesMetaData() (#8460)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8460

Reviewed By: jay-zhuang

Differential Revision: D29381865

Pulled By: ajkr

fbshipit-source-id: 47ba54c25f3cc039d72ea32e1df20875795683b3
2021-06-24 21:05:03 -07:00
Levi Tamasi
66b62a123a Update HISTORY.md for PR 8450 (#8458)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8458

Reviewed By: zhichao-cao

Differential Revision: D29378728

Pulled By: ltamasi

fbshipit-source-id: d5a40b1414500f53823763be5c2bfce8db04daf8
2021-06-24 17:26:32 -07:00
Baptiste Lemaire
3f20925dc4 Add list live files metadata (#8446)
Summary:
Add an argument to ldb to dump live file names, column families, and levels, `list_live_files_metadata`. The output shows all active SST file names, sorted first by column family and then by level. For each level the SST files are sorted alphabetically.

Typically, the output looks like this:
```
./ldb --db=/tmp/test_db list_live_files_metadata
Live SST Files:
===== Column Family: default =====
---------- level 0 ----------
/tmp/test_db/000069.sst
---------- level 1 ----------
/tmp/test_db/000064.sst
/tmp/test_db/000065.sst
/tmp/test_db/000066.sst
/tmp/test_db/000071.sst
---------- level 2 ----------
/tmp/test_db/000038.sst
/tmp/test_db/000039.sst
/tmp/test_db/000052.sst
/tmp/test_db/000067.sst
/tmp/test_db/000070.sst
------------------------------
```

Second, a flag was added `--sort_by_filename`, to change the layout of the output. When this flag is added to the command, the output shows all active SST files sorted by name, in front of which the LSM level and the column family are mentioned. With the same example, the following command would return:
```
./ldb --db=/tmp/test_db list_live_files_metadata --sort_by_filename
Live SST Files:
/tmp/test_db/000038.sst : level 2, column family 'default'
/tmp/test_db/000039.sst : level 2, column family 'default'
/tmp/test_db/000052.sst : level 2, column family 'default'
/tmp/test_db/000064.sst : level 1, column family 'default'
/tmp/test_db/000065.sst : level 1, column family 'default'
/tmp/test_db/000066.sst : level 1, column family 'default'
/tmp/test_db/000067.sst : level 2, column family 'default'
/tmp/test_db/000069.sst : level 0, column family 'default'
/tmp/test_db/000070.sst : level 2, column family 'default'
/tmp/test_db/000071.sst : level 1, column family 'default'
------------------------------
```

Thus, the user can either request to show the files by levels, or sorted by filenames.
This PR includes a simple Python unit test that makes sure the file name and level printed out by this new feature matches the one found with an existing feature, `dump_live_file`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8446

Reviewed By: akankshamahajan15

Differential Revision: D29320080

Pulled By: bjlemaire

fbshipit-source-id: 01fb7b5637c59010d74c80730a28d815994e7009
2021-06-22 19:07:46 -07:00
Levi Tamasi
005c046734 Update HISTORY.md for PR 8437 (#8442)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8442

Reviewed By: akankshamahajan15

Differential Revision: D29306432

Pulled By: ltamasi

fbshipit-source-id: 0fe55736368ac024b2f51213b65f4b06d71d60e6
2021-06-22 13:37:22 -07:00
Jay Zhuang
f89423a57a Revert "Revert "Snapshot release triggered compaction without multiple tombstones (#8357)" (#8410)" (#8438)
Summary:
This reverts commit 25be1ed66a.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8438

Test Plan: Run the impacted mysql test 40 times

Reviewed By: ajkr

Differential Revision: D29286247

Pulled By: jay-zhuang

fbshipit-source-id: d3bd056971a19a8b012d5d0295fa045c012b3c04
2021-06-22 11:10:03 -07:00
Andrew Kryczka
b511b042f4 Fixup HISTORY.md for 6.22 release (#8441)
Summary:
`git diff origin/6.21.fb origin/6.22.fb -- HISTORY.md` looked odd.
This PR fixes it up by moving items from 6.21.0 to 6.22.0 that were
never in any 6.21 release. Also mentioned the background stat collection
fix under 6.22 (previously it was mentioned under 6.21 patch releases
only).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8441

Reviewed By: jay-zhuang

Differential Revision: D29304812

Pulled By: ajkr

fbshipit-source-id: 2a928a9518a1d6615321d5c2d1e22b17cbb59093
2021-06-22 10:30:10 -07:00
Jay Zhuang
54d73d6429 Fix DeleteFilesInRange may cause inconsistent compaction error (#8434)
Summary:
`DeleteFilesInRange()` marks deleting files to `being_compacted`
before deleting, which may cause ongoing compactions report corruption
exception or ASSERT for debug build.

Adding the missing `ComputeCompactionScore()` when `being_compacted` is set.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8434

Test Plan: Unittest

Reviewed By: ajkr

Differential Revision: D29276127

Pulled By: jay-zhuang

fbshipit-source-id: f5b223e3c1fc6d821e100e3f3442bc70c1d50cf7
2021-06-22 09:17:37 -07:00
Andrew Kryczka
7ed69fbe72 Update HISTORY.md and version.h 6.22 release (#8427)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8427

Reviewed By: zhichao-cao

Differential Revision: D29246916

Pulled By: ajkr

fbshipit-source-id: ccd44ca1a6dd5101dc37f19b8e1fe6c0e3883e0a
2021-06-19 12:12:46 -07:00
Baptiste Lemaire
e817bc9628 Added memtable garbage statistics (#8411)
Summary:
**Summary**:
2 new statistics counters are added to RocksDB: `MEMTABLE_PAYLOAD_BYTES_AT_FLUSH` and `MEMTABLE_GARBAGE_BYTES_AT_FLUSH`. The former tracks how many raw bytes of useful data are present on the memtable at flush time, whereas the latter is tracks how many of these raw bytes are considered garbage, meaning that they ended up not being imported on the SSTables resulting from the flush operations.

**Unit test**: run `make db_flush_test -j$(nproc); ./db_flush_test` to run the unit test.
This executable includes 3 tests, that test support and correct stat calculations for workloads with inserts, deletes, and DeleteRanges. The parameters are set such that the workloads are performed on a single memtable, and a single SSTable is created as a result of the flush operation. The flush operation is manually called in the test file. The tests verify that the values of these 2 statistics counters introduced in this PR  can be exactly predicted, showing that we have a full understanding of the underlying operations.

**Performance testing**:
`./db_bench -statistics -benchmarks=fillrandom -num=10000000` repeated 10 times.
Timing done using "date" function in a bash script.
_Results_:
Original Rocksdb fork: mean 66.6 sec, std 1.18 sec.
This feature branch: mean 67.4 sec, std 1.35 sec.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8411

Reviewed By: akankshamahajan15

Differential Revision: D29150629

Pulled By: bjlemaire

fbshipit-source-id: 7b3c2e86d50c6aa34fa50fd134282eacb543a5b1
2021-06-18 04:57:27 -07:00
Akanksha Mahajan
5ba1b6e549 Cache warming data blocks during flush (#8242)
Summary:
This PR prepopulates warm/hot data blocks which are already in memory
into block cache at the time of flush. On a flush, the data block that is
in memory (in memtables) get flushed to the device. If using Direct IO,
additional IO is incurred to read this data back into memory again, which
is avoided by enabling newly added option.

 Right now, this is enabled only for flush for data blocks. We plan to
expand this option to cover compactions in the future and for other types
 of blocks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242

Test Plan: Add new unit test

Reviewed By: anand1976

Differential Revision: D28521703

Pulled By: akankshamahajan15

fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05
2021-06-17 21:56:47 -07:00
聂佩轩
d53f7ff69a Add DeteleRange support for DBWithTTL (#8384)
Summary:
This commit is for enabling `DBWithTTL` to use `DeteleRange` which it cannot before.
As (int32_t)Timestamp is suffixed to values in `DBWithTTL`, there is no reason that it
cannot use the common used api. I added `DeleteRangeCF` in `DBWithTTLImpl::Write`
so that we can use `DeteleRange` normally. When we run code like
`dbWithTtl->DeleteRange(start, end)`, it executes`WriteBatchInternal::DeleteRange`
internally. Intended to fix https://github.com/facebook/rocksdb/issues/7218

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8384

Test Plan: added corresponded testing logic to existing unit test

Reviewed By: jay-zhuang

Differential Revision: D29176734

fbshipit-source-id: 6874ed979fc08e1d138149d03653e43a75f0e0e6
2021-06-17 16:00:50 -07:00
Peter Dillinger
865a25101d Mark Ribbon filter and optimize_filters_for_memory as production (#8408)
Summary:
Marked the Ribbon filter and optimize_filters_for_memory features
as production-ready, each enabling memory savings for Bloom-like filters.
Use `NewRibbonFilterPolicy` in place of `NewBloomFilterPolicy` to use
Ribbon filters instead of Bloom, or `ribbonfilter` in place of
`bloomfilter` in configuration string.

Some small refactoring in db_stress.

Removed/refactored unused code in db_bench, in part preparing for future
default possibly being different from "disabled."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8408

Test Plan:
Lots of prior automated, ad-hoc, and "real world" testing.
Updated tests for new API names. Quick db_bench test:

bloom fillrandom
77730 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 89929384

ribbon fillrandom
71492 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 64531384

Reviewed By: mrambacher

Differential Revision: D29140805

Pulled By: pdillinger

fbshipit-source-id: d742c922722421678f95ad85eeb0aaebc9f5e49a
2021-06-17 12:29:16 -07:00
Andrew Kryczka
25be1ed66a Revert "Snapshot release triggered compaction without multiple tombstones (#8357)" (#8410)
Summary:
This reverts commit 9167ece586.

It was found to reliably trip a compaction picking conflict assertion in a MyRocks unit test. We don't understand why yet so reverting in the meantime.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8410

Test Plan: `make check -j48`

Reviewed By: jay-zhuang

Differential Revision: D29150300

Pulled By: ajkr

fbshipit-source-id: 2de8664f355d6da015e84e5fec2e3f90f49741c8
2021-06-15 18:15:15 -07:00
Levi Tamasi
146263887f Disable subcompactions for user-defined timestamps (#8393)
Summary:
The subcompaction boundary picking logic does not currently guarantee
that all user keys that differ only by timestamp get processed by the same
subcompaction. This can cause issues with the `CompactionIterator` state
machine: for instance, one subcompaction that processes a subset of such KVs
might drop a tombstone based on the KVs it sees, while in reality the
tombstone might not have been eligible to be optimized out.
(See also https://github.com/facebook/rocksdb/issues/6645, which adjusted the way compaction inputs are picked for the
same reason.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8393

Test Plan: Ran `make check` and the crash test script with timestamps enabled.

Reviewed By: jay-zhuang

Differential Revision: D29071635

Pulled By: ltamasi

fbshipit-source-id: f6c72442122b4e581871e096fabe3876a9e8a5a6
2021-06-12 12:09:25 -07:00
Peter Dillinger
b3dbeadc34 Fix double-dumping CF stats to log (#8380)
Summary:
DBImpl::DumpStats is supposed to do this:
Dump DB stats to LOG
For each CF, dump CFStatsNoFileHistogram to LOG
For each CF, dump CFFileHistogram to LOG

Instead, due to a longstanding bug from 2017 (https://github.com/facebook/rocksdb/issues/2126), it would dump
CFStats, which includes both CFStatsNoFileHistogram and CFFileHistogram,
in both loops, resulting in near-duplicate output.

This fixes the bug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8380

Test Plan: Manual inspection of LOG after db_bench

Reviewed By: jay-zhuang

Differential Revision: D29017535

Pulled By: pdillinger

fbshipit-source-id: 3010604c4a629a80347f129cd746ce9b0d0cbda6
2021-06-11 17:06:09 -07:00
matthewvon
5a2b4ed671 BugFix: fs_posix.cc GetFreeSpace uses wrong value non-root users (#8370)
Summary:
fs_posix.cc GetFreeSpace() calculates free space based upon a call to statvfs().  However, there are two extremely different values in statvfs's returned structure:  f_bfree which is free space for root and f_bavail which is free space for non-root users.  The existing code uses f_bfree.  Many disks have 5 to 10% of the total disk space reserved for root only.  Therefore GetFreeSpace() does not realize that non-root users may not have storage available.

This PR detects whether the effective posix user is root or not, then selects the appropriate available space value.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8370

Reviewed By: mrambacher

Differential Revision: D29032710

Pulled By: jay-zhuang

fbshipit-source-id: 57feba34ed035615a479956d28f98d85735281c0
2021-06-10 11:11:54 -07:00
David Devecsery
80a59a03a7 Cancel compact range (#8351)
Summary:
Added the ability to cancel an in-progress range compaction by storing to an atomic "canceled" variable pointed to within the CompactRangeOptions structure.

Tested via two tests added to db_tests2.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8351

Reviewed By: ajkr

Differential Revision: D28808894

Pulled By: ddevec

fbshipit-source-id: cb321361c9e23b084b188bb203f11c375a22c2dd
2021-06-07 11:41:31 -07:00
Andrew Kryczka
9167ece586 Snapshot release triggered compaction without multiple tombstones (#8357)
Summary:
This is a duplicate of https://github.com/facebook/rocksdb/issues/4948 by mzhaom to fix tests after rebase.

This change is a follow-up to https://github.com/facebook/rocksdb/issues/4927, which made this possible by allowing tombstone dropping/seqnum zeroing optimizations on the last key in the compaction. Now the `largest_seqno != 0` condition suffices to prevent snapshot release triggered compaction from entering an infinite loop.

The issues caused by the extraneous condition `level_and_file.second->num_deletions > 1` are:

- files could have `largest_seqno > 0` forever making it impossible to tell they cannot contain any covering keys
- it doesn't trigger compaction when there are many overwritten keys. Some MyRocks use case actually doesn't use Delete but instead calls Put with empty value to "delete" keys, so we'd like to be able to trigger compaction in this case too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8357

Test Plan: - make check

Reviewed By: jay-zhuang

Differential Revision: D28855340

Pulled By: ajkr

fbshipit-source-id: a261b51eecafec492499e6d01e8e43112f801798
2021-06-04 00:21:40 -07:00
anand76
799cf37cb1 Update HISTORY and version to 6.21 (#8363)
Summary:
Update HISTORY and version to 6.21 on master.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8363

Reviewed By: jay-zhuang

Differential Revision: D28888818

Pulled By: anand1976

fbshipit-source-id: 9e5fac3b99ecc9f3b7d9f21474a39fa50decb117
2021-06-03 19:32:14 -07:00
Peter Dillinger
956ce9bde2 Some API clarification for manual compaction and listeners (#8330)
Summary:
Avoid people hitting bugs

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8330

Test Plan: comments only

Reviewed By: siying

Differential Revision: D28683157

Pulled By: pdillinger

fbshipit-source-id: 2b34d3efb5e2fa34bea93d54c940cbd425212d25
2021-05-26 08:14:38 -07:00
Peter Dillinger
3469d60fcc Add table properties for number of entries added to filters (#8323)
Summary:
With Ribbon filter work and possible variance in actual bits
per key (or prefix; general term "entry") to achieve certain FP rates,
I've received a request to be able to track actual bits per key in
generated filters. This change adds a num_filter_entries table
property, which can be combined with filter_size to get bits per key
(entry).

This can vary from num_entries in at least these ways:
* Different versions of same key are only counted once in filters.
* With prefix filters, several user keys map to the same filter entry.
* A single filter can include both prefixes and user keys.

Note that FilterBlockBuilder::NumAdded() didn't do anything useful
except distinguish empty from non-empty.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8323

Test Plan: basic unit test included, others updated

Reviewed By: jay-zhuang

Differential Revision: D28596210

Pulled By: pdillinger

fbshipit-source-id: 529a111f3c84501e5a470bc84705e436ee68c376
2021-05-21 17:11:32 -07:00
Jay Zhuang
6c86543590 Fix manual compaction max_compaction_bytes under-calculated issue (#8269)
Summary:
Fix a bug that for manual compaction, `max_compaction_bytes` is only
limit the SST files from input level, but not overlapped files on output
level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8269

Test Plan: `make check`

Reviewed By: ajkr

Differential Revision: D28231044

Pulled By: jay-zhuang

fbshipit-source-id: 9d7d03004f30cc4b1b9819830141436907554b7c
2021-05-21 14:03:44 -07:00
sdong
bd3d080ef8 Try to build with liburing by default. (#8322)
Summary:
By default, try to build with liburing. For make, if ROCKSDB_USE_IO_URING is not set, treat as 1, which means RocksDB will try to build with liburing. For cmake, add WITH_LIBURING to control it, with default on.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8322

Test Plan: Build using cmake and make.

Reviewed By: anand1976

Differential Revision: D28586498

fbshipit-source-id: cfd39159ab697f4b93a9293a59c07f839b1e7ed5
2021-05-21 10:21:53 -07:00
sdong
2f1984dd45 Compare memtable insert and flush count (#8288)
Summary:
When a memtable is flushed, it will validate number of entries it reads, and compare the number with how many entries inserted into memtable. This serves as one sanity c\
heck against memory corruption. This change will also allow more counters to be added in the future for better validation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8288

Test Plan: Pass all existing tests

Reviewed By: ajkr

Differential Revision: D28369194

fbshipit-source-id: 7ff870380c41eab7f99eee508550dcdce32838ad
2021-05-20 16:07:28 -07:00
Jay Zhuang
3786181a90 Add remote compaction public API (#8300)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8300

Reviewed By: ajkr

Differential Revision: D28464726

Pulled By: jay-zhuang

fbshipit-source-id: 49e9f4fb791808a6cbf39a7b1a331373f645fc5e
2021-05-19 21:41:31 -07:00
Peter Dillinger
311a544c2a Use deleters to label cache entries and collect stats (#8297)
Summary:
This change gathers and publishes statistics about the
kinds of items in block cache. This is especially important for
profiling relative usage of cache by index vs. filter vs. data blocks.
It works by iterating over the cache during periodic stats dump
(InternalStats, stats_dump_period_sec) or on demand when
DB::Get(Map)Property(kBlockCacheEntryStats), except that for
efficiency and sharing among column families, saved data from
the last scan is used when the data is not considered too old.

The new information can be seen in info LOG, for example:

    Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
    Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)

And also through DB::GetProperty and GetMapProperty (here using
ldb just for demonstration):

    $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
    rocksdb.block-cache-entry-stats.bytes.data-block: 0
    rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
    rocksdb.block-cache-entry-stats.bytes.filter-block: 0
    rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
    rocksdb.block-cache-entry-stats.bytes.index-block: 178992
    rocksdb.block-cache-entry-stats.bytes.misc: 0
    rocksdb.block-cache-entry-stats.bytes.other-block: 0
    rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
    rocksdb.block-cache-entry-stats.capacity: 8388608
    rocksdb.block-cache-entry-stats.count.data-block: 0
    rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
    rocksdb.block-cache-entry-stats.count.filter-block: 0
    rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
    rocksdb.block-cache-entry-stats.count.index-block: 215
    rocksdb.block-cache-entry-stats.count.misc: 1
    rocksdb.block-cache-entry-stats.count.other-block: 0
    rocksdb.block-cache-entry-stats.count.write-buffer: 0
    rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
    rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
    rocksdb.block-cache-entry-stats.percent.misc: 0.000000
    rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
    rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
    rocksdb.block-cache-entry-stats.secs_since_last_collection: 0

Solution detail - We need some way to flag what kind of blocks each
entry belongs to, preferably without changing the Cache API.
One of the complications is that Cache is a general interface that could
have other users that don't adhere to whichever convention we decide
on for keys and values. Or we would pay for an extra field in the Handle
that would only be used for this purpose.

This change uses a back-door approach, the deleter, to indicate the
"role" of a Cache entry (in addition to the value type, implicitly).
This has the added benefit of ensuring proper code origin whenever we
recognize a particular role for a cache entry; if the entry came from
some other part of the code, it will use an unrecognized deleter, which
we simply attribute to the "Misc" role.

An internal API makes for simple instantiation and automatic
registration of Cache deleters for a given value type and "role".

Another internal API, CacheEntryStatsCollector, solves the problem of
caching the results of a scan and sharing them, to ensure scans are
neither excessive nor redundant so as not to harm Cache performance.

Because code is added to BlocklikeTraits, it is pulled out of
block_based_table_reader.cc into its own file.

This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
(could still be added), and with actual stat gathering.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297

Test Plan: manual testing with db_bench, and a couple of basic unit tests

Reviewed By: ltamasi

Differential Revision: D28488721

Pulled By: pdillinger

fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 16:51:13 -07:00
anand76
9d61a0856d Sync ingested files only if reopen is supported by the FS (#8296)
Summary:
Some file systems (especially distributed FS) do not support reopening a file for writing. The ExternalSstFileIngestionJob calls ReopenWritableFile in order to sync the ingested file, which typically makes sense only on a local file system with a page cache (i.e Posix). So this change tries to sync the ingested file only if ReopenWritableFile doesn't return Status::NotSupported().

Tests:
Add a new unit test in external_sst_file_basic_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8296

Reviewed By: jay-zhuang

Differential Revision: D28420865

Pulled By: anand1976

fbshipit-source-id: 380e7f5ff95324997f7a59864a9ac96ebbd0100c
2021-05-18 19:33:55 -07:00
sdong
60e5af83c1 Handle return code by io_uring_submit_and_wait() and io_uring_wait_cqe() (#8311)
Summary:
Right now return codes by io_uring_submit_and_wait() and io_uring_wait_cqe() are not handled. It is not the good practice. Although these two functions are not supposed to return non-0 values in normal exeuction, people suspect that they might return non-0 value when an interruption happens, and the code might cause hanging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8311

Test Plan: Make sure at least normal test cases still pass.

Reviewed By: anand1976

Differential Revision: D28500828

fbshipit-source-id: 8a76cea9cafbd041102e0b6a8eef9d0bfed7c211
2021-05-18 16:09:14 -07:00
Peter Dillinger
78a309bf86 New Cache API for gathering statistics (#8225)
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:

* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)

I have enhanced cache_bench to validate this approach:

* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.

To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225

Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.

The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.

Baseline typical output:

```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662

Operation latency (ns):
Count: 80000000 Average: 11223.9494  StdDev: 29.61
Min: 0  Median: 7759.3973  Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[       0,       1 ]       68   0.000%   0.000%
(    2900,    4400 ]       89   0.000%   0.000%
(    4400,    6600 ] 33630240  42.038%  42.038% ########
(    6600,    9900 ] 18129842  22.662%  64.700% #####
(    9900,   14000 ]  7877533   9.847%  74.547% ##
(   14000,   22000 ] 15193238  18.992%  93.539% ####
(   22000,   33000 ]  3037061   3.796%  97.335% #
(   33000,   50000 ]  1626316   2.033%  99.368%
(   50000,   75000 ]   421532   0.527%  99.895%
(   75000,  110000 ]    56910   0.071%  99.966%
(  110000,  170000 ]    16134   0.020%  99.986%
(  170000,  250000 ]     5166   0.006%  99.993%
(  250000,  380000 ]     3017   0.004%  99.996%
(  380000,  570000 ]     1337   0.002%  99.998%
(  570000,  860000 ]      805   0.001%  99.999%
(  860000, 1200000 ]      319   0.000% 100.000%
( 1200000, 1900000 ]      231   0.000% 100.000%
( 1900000, 2900000 ]      100   0.000% 100.000%
( 2900000, 4300000 ]       39   0.000% 100.000%
( 4300000, 6500000 ]       16   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
```

New, gather_stats=false. Median thread ops/sec of 5 runs:

```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458

Operation latency (ns):
Count: 80000000 Average: 11298.1027  StdDev: 42.18
Min: 0  Median: 7722.0822  Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[       0,       1 ]      109   0.000%   0.000%
(    2900,    4400 ]      793   0.001%   0.001%
(    4400,    6600 ] 34054563  42.568%  42.569% #########
(    6600,    9900 ] 17482646  21.853%  64.423% ####
(    9900,   14000 ]  7908180   9.885%  74.308% ##
(   14000,   22000 ] 15032072  18.790%  93.098% ####
(   22000,   33000 ]  3237834   4.047%  97.145% #
(   33000,   50000 ]  1736882   2.171%  99.316%
(   50000,   75000 ]   446851   0.559%  99.875%
(   75000,  110000 ]    68251   0.085%  99.960%
(  110000,  170000 ]    18592   0.023%  99.983%
(  170000,  250000 ]     7200   0.009%  99.992%
(  250000,  380000 ]     3334   0.004%  99.997%
(  380000,  570000 ]     1393   0.002%  99.998%
(  570000,  860000 ]      700   0.001%  99.999%
(  860000, 1200000 ]      293   0.000% 100.000%
( 1200000, 1900000 ]      196   0.000% 100.000%
( 1900000, 2900000 ]       69   0.000% 100.000%
( 2900000, 4300000 ]       32   0.000% 100.000%
( 4300000, 6500000 ]       10   0.000% 100.000%
```

New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:

```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551

Operation latency (ns):
Count: 80000000 Average: 11311.2629  StdDev: 45.28
Min: 0  Median: 7686.5458  Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[       0,       1 ]       71   0.000%   0.000%
(    2900,    4400 ]      291   0.000%   0.000%
(    4400,    6600 ] 34492060  43.115%  43.116% #########
(    6600,    9900 ] 16727328  20.909%  64.025% ####
(    9900,   14000 ]  7845828   9.807%  73.832% ##
(   14000,   22000 ] 15510654  19.388%  93.220% ####
(   22000,   33000 ]  3216533   4.021%  97.241% #
(   33000,   50000 ]  1680859   2.101%  99.342%
(   50000,   75000 ]   439059   0.549%  99.891%
(   75000,  110000 ]    60540   0.076%  99.967%
(  110000,  170000 ]    14649   0.018%  99.985%
(  170000,  250000 ]     5242   0.007%  99.991%
(  250000,  380000 ]     3260   0.004%  99.995%
(  380000,  570000 ]     1599   0.002%  99.997%
(  570000,  860000 ]     1043   0.001%  99.999%
(  860000, 1200000 ]      471   0.001%  99.999%
( 1200000, 1900000 ]      275   0.000% 100.000%
( 1900000, 2900000 ]      143   0.000% 100.000%
( 2900000, 4300000 ]       60   0.000% 100.000%
( 4300000, 6500000 ]       27   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
( 9800000, 14000000 ]        1   0.000% 100.000%

Gather stats latency (us):
Count: 46 Average: 980387.5870  StdDev: 60911.18
Min: 879155  Median: 1033777.7778  Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
(  860000, 1200000 ]       45  97.826%  97.826% ####################
( 1200000, 1900000 ]        1   2.174% 100.000%

Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```

Reviewed By: mrambacher

Differential Revision: D28295742

Pulled By: pdillinger

fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
2021-05-11 16:17:10 -07:00
mrambacher
9f2d255aed Add ObjectRegistry to ConfigOptions (#8166)
Summary:
This change enables a couple of things:
- Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
- The ObjectRegistry is created fewer times and can be re-used

The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166

Reviewed By: zhichao-cao

Differential Revision: D27657952

Pulled By: mrambacher

fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd
2021-05-11 06:47:22 -07:00
mrambacher
ff463742b5 Add Merge Operator support to WriteBatchWithIndex (#8135)
Summary:
The WBWI has two differing modes of operation dependent on the value
of the constructor parameter `overwrite_key`.
Currently, regardless of the parameter, neither mode performs as
expected when using Merge. This PR remedies this by correctly invoking
the appropriate Merge Operator before returning results from the WBWI.

Examples of issues that exist which are solved by this PR:

## Example 1 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `v2`, that is to say that the Merge behaves like a Put.

## Example 2 with o`verwrite_key=true`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

## Example 3 with `overwrite_key=false`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`

## Example 4 with `overwrite_key=true`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v1')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

## Example 5 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`

## Example 6 with `overwrite_key=true`
Currently, from an empty database, `('k1' -> 'v1')`, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8135

Reviewed By: pdillinger

Differential Revision: D27657938

Pulled By: mrambacher

fbshipit-source-id: 0fbda6bbc66bedeba96a84786d90141d776297df
2021-05-10 12:50:25 -07:00
sdong
f89a53655d Change date format in HISTORY.md (#8278)
Summary:
Per previous discussion, change date format in HISTORY.md to follow ISO 8601.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8278

Reviewed By: jay-zhuang

Differential Revision: D28294022

fbshipit-source-id: 563f29c56143519b4a871df82a17dd0a168a578c
2021-05-07 16:16:30 -07:00
Andrew Kryczka
a639c02f8e Allow applying CompactionFilter outside of compaction (#8243)
Summary:
From HISTORY.md release note:

- Allow `CompactionFilter`s to apply in more table file creation scenarios such as flush and recovery. For compatibility, `CompactionFilter`s by default apply during compaction. Users can customize this behavior by overriding `CompactionFilterFactory::ShouldFilterTableFileCreation()`.
- Removed unused structure `CompactionFilterContext`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8243

Test Plan: added unit tests

Reviewed By: pdillinger

Differential Revision: D28088089

Pulled By: ajkr

fbshipit-source-id: 0799be7908e3b39fea09fc3f1ab00e13ad817fae
2021-05-07 16:01:40 -07:00
Peter Dillinger
c26b75baa5 Deprecate obsolete "backupable db" from public APIs (#8274)
Summary:
An early design of BackupEngine used stackable DB, so I guess a
DB had to opt-in to being backupable. Unfortunately the naming of that
obsolete design still infects our public API and implementation.

This change fixes the public API, with a deprecated
backward-compatibility header. `BackupableDBOptions` is renamed to
`BackupEngineOptions` (copy-replace in the public header) and
backup_engine.h replaces backupable_db.h (present for backward
compatibility). The only other change in backupable_db.h ->
backup_engine.h is cleaning up headers.

Later changes will fix the internal implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8274

Test Plan:
The internal implementation of BackupEngine uses the name
BackupEngineOptions, while the unit tests use the old name
BackupableDBOptions. This gives me confidence that both still work.

Reviewed By: mrambacher

Differential Revision: D28259471

Pulled By: pdillinger

fbshipit-source-id: a25dbe327b9772143488e7bb0ec7139ee42d0613
2021-05-07 13:53:15 -07:00
sdong
a4919d6b62 Cap automatic arena block size to 1 MB (#7907)
Summary:
Larger arena block size does provide the benefit of reducing allocation overhead, however it may cause other troubles. For example, allocator is more likely not to allocate them to physical memory and trigger page fault. Weighing the risk, we cap the arena block size to 1MB. Users can always use a larger value if they want.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7907

Test Plan: Run all existing tests

Reviewed By: pdillinger

Differential Revision: D26135269

fbshipit-source-id: b7f55afd03e6ee1d8715f90fa11b6c33944e9ea8
2021-05-07 13:15:34 -07:00
Andrew Kryczka
0f42e50fec Fix GetLiveFiles() returning OPTIONS-000000 (#8268)
Summary:
See release note in HISTORY.md.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8268

Test Plan: unit test repro

Reviewed By: siying

Differential Revision: D28227901

Pulled By: ajkr

fbshipit-source-id: faf61d13b9e43a761e3d5dcf8203923126b51339
2021-05-05 12:54:46 -07:00
Peter Dillinger
3b981eaa1d Fix use-after-free threading bug in ClockCache (#8261)
Summary:
In testing for https://github.com/facebook/rocksdb/issues/8225 I found cache_bench would crash with
-use_clock_cache, as well as db_bench -use_clock_cache, but not
single-threaded. Smaller cache size hits failure much faster. ASAN
reported the failuer as calling malloc_usable_size on the `key` pointer
of a ClockCache handle after it was reportedly freed. On detailed
inspection I found this bad sequence of operations for a cache entry:

state=InCache=1,refs=1
[thread 1] Start ClockCacheShard::Unref (from Release, no mutex)
[thread 1] Decrement ref count
state=InCache=1,refs=0
[thread 1] Suspend before CalcTotalCharge (no mutex)

[thread 2] Start UnsetInCache (from Insert, mutex held)
[thread 2] clear InCache bit
state=InCache=0,refs=0
[thread 2] Calls RecycleHandle (based on pre-updated state)
[thread 2] Returns to Insert which calls Cleanup which deletes `key`

[thread 1] Resume ClockCacheShard::Unref
[thread 1] Read `key` in CalcTotalCharge

To fix this, I've added a field to the handle to store the metadata
charge so that we can efficiently remember everything we need from
the handle in Unref. We must not read from the handle again if we
decrement the count to zero with InCache=1, which means we don't own
the entry and someone else could eject/overwrite it immediately.

Note before this change, on amd64 sizeof(Handle) == 56 even though there
are only 48 bytes of data. Grouping together the uint32_t fields would
cut it down to 48, but I've added another uint32_t, which takes it
back up to 56. Not a big deal.

Also fixed DisownData to cooperate with ASAN as in LRUCache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8261

Test Plan:
Manual + adding use_clock_cache to db_crashtest.py

Base performance
./cache_bench -use_clock_cache
Complete in 17.060 s; QPS = 2458513
New performance
./cache_bench -use_clock_cache
Complete in 17.052 s; QPS = 2459695

Any difference is easily buried in small noise.

Crash test shows still more bug(s) in ClockCache, so I'm expecting to
disable ClockCache from production code in a follow-up PR (if we
can't find and fix the bug(s))

Reviewed By: mrambacher

Differential Revision: D28207358

Pulled By: pdillinger

fbshipit-source-id: aa7a9322afc6f18f30e462c75dbbe4a1206eb294
2021-05-04 22:18:00 -07:00
Peter Dillinger
d2ca04e3ed Add more LSM info to FilterBuildingContext (#8246)
Summary:
Add `num_levels`, `is_bottommost`, and table file creation
`reason` to `FilterBuildingContext`, in anticipation of more powerful
Bloom-like filter support.

To support this, added `is_bottommost` and `reason` to
`TableBuilderOptions`, which allowed removing `reason` parameter from
`rocksdb::BuildTable`.

I attempted to remove `skip_filters` from `TableBuilderOptions`, because
filter construction decisions should arise from options, not one-off
parameters. I could not completely remove it because the public API for
SstFileWriter takes a `skip_filters` parameter, and translating this
into an option change would mean awkwardly replacing the table_factory
if it is BlockBasedTableFactory with new filter_policy=nullptr option.
I marked this public skip_filters option as deprecated because of this
oddity. (skip_filters on the read side probably makes sense.)

At least `skip_filters` is now largely hidden for users of
`TableBuilderOptions` and is no longer used for implementing the
optimize_filters_for_hits option. Bringing the logic for that option
closer to handling of FilterBuildingContext makes it more obvious that
hese two are using the same notion of "bottommost." (Planned:
configuration options for Bloom-like filters that generalize
`optimize_filters_for_hits`)

Recommended follow-up: Try to get away from "bottommost level" naming of
things, which is inaccurate (see
VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
"bottommost run" or just "bottommost."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246

Test Plan:
extended an existing unit test to exercise and check various
filter building contexts. Also, existing tests for
optimize_filters_for_hits validate some of the "bottommost" handling,
which is now closely connected to FilterBuildingContext::is_bottommost
through TableBuilderOptions::is_bottommost

Reviewed By: mrambacher

Differential Revision: D28099346

Pulled By: pdillinger

fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
2021-04-30 13:50:13 -07:00
Peter Dillinger
85becd94c1 Refactor: use TableBuilderOptions to reduce parameter lists (#8240)
Summary:
Greatly reduced the not-quite-copy-paste giant parameter lists
of rocksdb::NewTableBuilder, rocksdb::BuildTable,
BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor.

Moved weird separate parameter `uint32_t column_family_id` of
TableFactory::NewTableBuilder into TableBuilderOptions.

Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t
target_file_size` is not randomly placed between uint64_t timestamps
(was easy to mix up).

Replaced a couple of fields of BlockBasedTableBuilder::Rep with a
FilterBuildingContext. The motivation for this change is making it
easier to pass along more data into new fields in FilterBuildingContext
(follow-up PR).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240

Test Plan: ASAN make check

Reviewed By: mrambacher

Differential Revision: D28075891

Pulled By: pdillinger

fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f
2021-04-29 07:00:50 -07:00
Akanksha Mahajan
a0e0feca62 Improve BlockPrefetcher to prefetch only for sequential scans (#7394)
Summary:
BlockPrefetcher is used by iterators to prefetch data if they
anticipate more data to be used in future and this is valid for forward sequential
scans. But BlockPrefetcher tracks only num_file_reads_ and not if reads
are sequential. This presents problem for MultiGet with large number of
keys when it reseeks index iterator and data block. FilePrefetchBuffer
can end up doing large readahead for reseeks as readahead size
increases exponentially once readahead is enabled. Same issue is with
BlockBasedTableIterator.

Add previous length and offset read as well in BlockPrefetcher (creates
FilePrefetchBuffer) and FilePrefetchBuffer (does prefetching of data) to
determine if reads are sequential and then  prefetch.

Update the last block read after cache hit to take reads from cache also
in account.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7394

Test Plan: Add new unit test case

Reviewed By: anand1976

Differential Revision: D23737617

Pulled By: akankshamahajan15

fbshipit-source-id: 8e6917c25ed87b285ee495d1b68dc623d71205a3
2021-04-28 12:53:46 -07:00
Zhichao Cao
09a9ec3ac0 Fix the false positive alert of CF consistency check in WAL recovery (#8207)
Summary:
In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue.

Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here.

Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207

Test Plan: added unit test, make crash_test

Reviewed By: riversand963

Differential Revision: D27898839

Pulled By: zhichao-cao

fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e
2021-04-22 10:28:37 -07:00
Akanksha Mahajan
596e9008e4 Stall writes in WriteBufferManager when memory_usage exceeds buffer_size (#7898)
Summary:
When WriteBufferManager is shared across DBs and column families
to maintain memory usage under a limit, OOMs have been observed when flush cannot
finish but writes continuously insert to memtables.
In order to avoid OOMs, when memory usage goes beyond buffer_limit_ and DBs tries to write,
this change will stall incoming writers until flush is completed and memory_usage
drops.

Design: Stall condition: When total memory usage exceeds WriteBufferManager::buffer_size_
(memory_usage() >= buffer_size_) WriterBufferManager::ShouldStall() returns true.

DBImpl first block incoming/future writers by calling write_thread_.BeginWriteStall()
(which adds dummy stall object to the writer's queue).
Then DB is blocked on a state State::Blocked (current write doesn't go
through). WBStallInterface object maintained by every DB instance is added to the queue of
WriteBufferManager.

If multiple DBs tries to write during this stall, they will also be
blocked when check WriteBufferManager::ShouldStall() returns true.

End Stall condition: When flush is finished and memory usage goes down, stall will end only if memory
waiting to be flushed is less than buffer_size/2. This lower limit will give time for flush
to complete and avoid continous stalling if memory usage remains close to buffer_size.

WriterBufferManager::EndWriteStall() is called,
which removes all instances from its queue and signal them to continue.
Their state is changed to State::Running and they are unblocked. DBImpl
then signal all incoming writers of that DB to continue by calling
write_thread_.EndWriteStall() (which removes dummy stall object from the
queue).

DB instance creates WBMStallInterface which is an interface to block and
signal DBs during stall.
When DB needs to be blocked or signalled by WriteBufferManager,
state_for_wbm_ state is changed accordingly (RUNNING or BLOCKED).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7898

Test Plan: Added a new test db/db_write_buffer_manager_test.cc

Reviewed By: anand1976

Differential Revision: D26093227

Pulled By: akankshamahajan15

fbshipit-source-id: 2bbd982a3fb7033f6de6153aa92a221249861aae
2021-04-21 13:54:02 -07:00
Peter Dillinger
95f6add746 Revert Ribbon starting level support from #8198 (#8212)
Summary:
This partially reverts commit 10196d7edc.

The problem with this change is because of important filter use cases:
FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so
would only use Ribbon filters if specifically including level 0 for the
Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate
unknown level, and this would be treated the same as level 0 unless
fixed.

We are keeping the part about committing to permanent schema, which is
only changes to API comments and HISTORY.md.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212

Test Plan: CI

Reviewed By: jay-zhuang

Differential Revision: D27896468

Pulled By: pdillinger

fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e
2021-04-20 19:46:40 -07:00
Andrew Kryczka
905dd17b35 Fix seqno in ingested file boundary key metadata (#8209)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/6245.

Adapted from https://github.com/facebook/rocksdb/issues/8201 and https://github.com/facebook/rocksdb/issues/8205.

Previously we were writing the ingested file's smallest/largest internal keys
with sequence number zero, or `kMaxSequenceNumber` in case of range
tombstone. The former (sequence number zero) is incorrect and can lead
to files being incorrectly ordered. The fix in this PR is to overwrite
boundary keys that have sequence number zero with the ingested file's assigned
sequence number.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8209

Test Plan: repro unit test

Reviewed By: riversand963

Differential Revision: D27885678

Pulled By: ajkr

fbshipit-source-id: 4a9f2c6efdfff81c3a9923e915ea88b250ee7b6a
2021-04-20 14:00:21 -07:00
Levi Tamasi
1b99947e99 Mention PR 8206 in HISTORY.md (#8210)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8210

Reviewed By: akankshamahajan15

Differential Revision: D27887612

Pulled By: ltamasi

fbshipit-source-id: 0db8d0b6047334dc47fe30a98804449043454386
2021-04-20 12:07:40 -07:00
Yanqin Jin
a376c22066 Handle rename() failure in non-local FS (#8192)
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.

This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.

As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
  MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
  code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
  POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
  new MANIFEST.) Therefore, we keep the new MANIFEST.
    - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
    - If process reopens the db immediately after the failure, then the CURRENT file can point
      to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
      succeed and ignore the other.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D27804648

Pulled By: riversand963

fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
2021-04-19 18:11:13 -07:00
Akanksha Mahajan
531a5f88a1 Update release version to 6.20 (#8199)
Summary:
Update release version to 6.20

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8199

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D27838750

Pulled By: akankshamahajan15

fbshipit-source-id: f02f722fc6bdd37d626d47a0e932bbecea3507a8
2021-04-16 20:15:36 -07:00
Peter Dillinger
10196d7edc Ribbon long-term support, starting level support (#8198)
Summary:
Since the Ribbon filter schema seems good (compatible back to
6.15.0), this change commits to long term support of the SST schema,
even though we expect the API for enabling Ribbon to change (still
called NewExperimentalRibbonFilterPolicy).

This also adds support for "hybrid" configuration in which some levels
use Bloom (higher levels, lower numbered) for speed and the rest use
Ribbon (lower levels, higher numbered) for memory space efficiency.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8198

Test Plan: unit test added, crash test support

Reviewed By: jay-zhuang

Differential Revision: D27831232

Pulled By: pdillinger

fbshipit-source-id: 90e528677689474d293ed6710b42ba89fbd5b5ab
2021-04-16 15:43:08 -07:00