Commit Graph

4480 Commits

Author SHA1 Message Date
Yanqin Jin
9f7c02dad5 Move compacted_db_impl.[c|h] to db/db_impl (#8082)
Summary:
As title. All core db implementations should stay in db_impl.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8082

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D27211442

Pulled By: riversand963

fbshipit-source-id: e0953fde75064740e899aaff7989ff033b7f5232
2021-03-23 13:49:26 -07:00
storagezhang
c8b0842bcd Remove unused variable (#8067)
Summary:
Remove unused variable `Slice blob_to_write` in `db/blob/blob_file_cache_test.cc`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8067

Reviewed By: zhichao-cao

Differential Revision: D27107693

Pulled By: riversand963

fbshipit-source-id: 9bfd4d296a6a1714ad5c1fa5bb231a0c52dbd56d
2021-03-19 12:13:59 -07:00
storagezhang
d9be6556aa Include C++ standard library headers instead of C compatibility headers (#8068)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8068

Reviewed By: zhichao-cao

Differential Revision: D27147685

Pulled By: riversand963

fbshipit-source-id: 5428b1c0142ecae17c977fba31a6d49b52983d1c
2021-03-19 12:09:47 -07:00
storagezhang
c706324208 Add default in switch (#8065)
Summary:
switch may not cover all branch in `db/c.cc`:

```c++
void rocksdb_options_set_access_hint_on_compaction_start(
    rocksdb_options_t* opt, int v) {
  switch(v) {
    case 0:
      opt->rep.access_hint_on_compaction_start =
          ROCKSDB_NAMESPACE::Options::NONE;
      break;
    case 1:
      opt->rep.access_hint_on_compaction_start =
          ROCKSDB_NAMESPACE::Options::NORMAL;
      break;
    case 2:
      opt->rep.access_hint_on_compaction_start =
          ROCKSDB_NAMESPACE::Options::SEQUENTIAL;
      break;
    case 3:
      opt->rep.access_hint_on_compaction_start =
          ROCKSDB_NAMESPACE::Options::WILLNEED;
      break;
  }
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8065

Reviewed By: riversand963

Differential Revision: D27102892

Pulled By: zhichao-cao

fbshipit-source-id: ad1d20d192712878e61597311ba75b55df0066d7
2021-03-19 11:57:52 -07:00
Zhichao Cao
dd0447ae2c Add new Append API with DataVerificationInfo to Env WritableFile (#8071)
Summary:
Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071

Test Plan: make check, added new unit test.

Reviewed By: anand1976

Differential Revision: D27177043

Pulled By: zhichao-cao

fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e
2021-03-19 11:44:13 -07:00
Yanqin Jin
7ee41a5d25 Fix a test failure when built with ASSERT_STATUS_CHECKED=1 (#8075)
Summary:
As title.
Test plan
ASSERT_STATUS_CHECKED=1 make -j20 backupable_db_test error_handler_fs_test
./backupable_db_test
./error_handler_fs_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8075

Reviewed By: zhichao-cao

Differential Revision: D27173832

Pulled By: riversand963

fbshipit-source-id: 37dac50f7c89127804ff2572abddd4174642de30
2021-03-18 21:52:48 -07:00
Zhichao Cao
c810947184 Separate handling of WAL Sync io error with SST flush io error (#8049)
Summary:
In previous codebase, if WAL is used, all the retryable IO Error will be treated as hard error. So write is stalled. In this PR, the retryable IO error from WAL sync is separated from SST file flush io error. If WAL Sync is ok and retryable IO Error only happens during SST flush, the error is mapped to soft error. So user can continue insert to Memtable and append to WAL.

Resolve the bug that if WAL sync fails, the memtable status does not roll back due to calling PickMemtable early than calling and checking SyncClosedLog.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8049

Test Plan: added new unit test, make check

Reviewed By: anand1976

Differential Revision: D26965529

Pulled By: zhichao-cao

fbshipit-source-id: f5fecb66602212523c92ee49d7edcb6065982410
2021-03-18 14:33:16 -07:00
Peter Dillinger
e7a60d01b2 Revamp WriteController (#8064)
Summary:
WriteController had a number of issues:
* It could introduce a delay of 1ms even if the write rate never exceeded the
configured delayed_write_rate.
* The DB-wide delayed_write_rate could be exceeded in a number of ways
with multiple column families:
  * Wiping all pending delay "debts" when another column family joins
  the delay with GetDelayToken().
  * Resetting last_refill_time_ to (now + sleep amount) means each
  column family can write with delayed_write_rate for large writes.
  * Updating bytes_left_ for a partial refill without updating
  last_refill_time_ would essentially give out random bonuses,
  especially to medium-sized writes.

Now the code is much simpler, with these issues fixed. See comments in
the new code and new (replacement) tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8064

Test Plan: new tests, better than old tests

Reviewed By: mrambacher

Differential Revision: D27064936

Pulled By: pdillinger

fbshipit-source-id: 497c23fe6819340b8f3d440bd634d8a2bc47323f
2021-03-18 09:47:31 -07:00
Zhichao Cao
08ec5e7321 Add the statistics and info log for Error handler (#8050)
Summary:
Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8050

Test Plan: make check and add test to error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D26990565

Pulled By: zhichao-cao

fbshipit-source-id: 49f71e8ea4e9db8b189943976404205b56ab883f
2021-03-17 22:38:13 -07:00
Akanksha Mahajan
27d57a035e Use SST file manager to track blob files as well (#8037)
Summary:
Extend support to track blob files in SST File manager.
 This PR notifies SstFileManager whenever a new blob file is created,
 via OnAddFile and  an obsolete blob file deleted via OnDeleteFile
 and delete file via ScheduleFileDeletion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037

Test Plan: Add new unit tests

Reviewed By: ltamasi

Differential Revision: D26891237

Pulled By: akankshamahajan15

fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6
2021-03-17 20:44:49 -07:00
Yanqin Jin
0304352882 Fix a bug in key comparison when index type is kBinarySearchWithFirstKey (#8062)
Summary:
When timestamp is enabled, key comparison should take this into account.
In `BlockBasedTableReader::Get()`, `BlockBasedTableReader::MultiGet()`,
assume the target key is `key`, and the timestamp upper bound is `ts`.
The highest key in current block is (key, ts1), while the lowest key in next
block is (key, ts2).
If
```
ts1 > ts > ts2
```
then
```
(key, ts1) < (key, ts) < (key, ts2)
```
It can be shown that if `Compare()` is used, then we will mistakenly skip the next
block. Instead, we should use `CompareWithoutTimestamp()`.

The majority of this PR makes some existing tests in `db_with_timestamp_basic_test.cc`
parameterized so that different index types can be tested. A new unit test is
also added for more coverage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8062

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D27057557

Pulled By: riversand963

fbshipit-source-id: c1062fa7c159ed600a1ad7e461531d52265021f1
2021-03-15 17:44:52 -07:00
Yanqin Jin
85d4f2c8b3 Move a test file to a better location (#8054)
Summary:
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8054

Test Plan: make check

Reviewed By: mrambacher

Differential Revision: D27017955

Pulled By: riversand963

fbshipit-source-id: 829497d507bc89afbe982f8a8cf3555e52fd7098
2021-03-15 15:03:27 -07:00
mrambacher
3dff28cf9b Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033)
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>.  The shared ptr has some performance degradation on certain hardware classes.

For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere.  For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it.  The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.

There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold.  In those cases, the shared pointer was preserved.

Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:

6.17: readrandom   :      28.046 micros/op 854902 ops/sec;   61.3 MB/s (355999 of 355999 found)
6.18: readrandom   :      32.615 micros/op 735306 ops/sec;   52.7 MB/s (290999 of 290999 found)
PR: readrandom   :      27.500 micros/op 871909 ops/sec;   62.5 MB/s (367999 of 367999 found)

(Note that the times for 6.18 are prior to revert of the SystemClock).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033

Reviewed By: pdillinger

Differential Revision: D27014563

Pulled By: mrambacher

fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
2021-03-15 04:34:11 -07:00
Andrew Kryczka
b8f40f7f7b Deflake tests of compaction based on compensated file size (#8036)
Summary:
CompactionDeletionTriggerReopen was observed to be flaky recently:
https://app.circleci.com/pipelines/github/facebook/rocksdb/6030/workflows/787af4f3-b9f7-4645-8e8d-1fb0ebf05539/jobs/101451.

I went through it and the related tests and arrived at different
conclusions on what constraints we can expect on DB size. Some
constraints got looser and some got tighter. The particular constraint
that flaked got a lot looser so at least the flake linked above would have been prevented.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8036

Reviewed By: riversand963

Differential Revision: D26862566

Pulled By: ajkr

fbshipit-source-id: 3512b86b4fb41aeecae32e1c7382c03916d88d88
2021-03-14 20:25:42 -07:00
Levi Tamasi
b708b166dc Fix a harmless data race affecting two test cases (#8055)
Summary:
`DBTest.GetLiveBlobFiles` and `ObsoleteFilesTest.BlobFiles` both modify the
current `Version` in their setup phase, implicitly assuming that no other
threads would touch the `Version` while this is happening. The periodic
stats dumper thread violates this assumption; the patch fixes this by
disabling it in the affected test cases. (Note: the data race is
harmless in the sense that it only affects test code.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8055

Test Plan:
```
COMPILE_WITH_TSAN=1 make db_test -j24
gtest-parallel --repeat=10000 ./db_test --gtest_filter="*GetLiveBlobFiles"
COMPILE_WITH_TSAN=1 make obsolete_files_test -j24
gtest-parallel --repeat=10000 ./obsolete_files_test --gtest_filter="*BlobFiles"
```

Reviewed By: riversand963

Differential Revision: D27022715

Pulled By: ltamasi

fbshipit-source-id: b6cc77ed63d8bc1cbe0603522ff1a572182fc9ab
2021-03-12 16:44:35 -08:00
Peter Dillinger
119dda2195 Instantiate tests DBIteratorTestForPinnedData (#8051)
Summary:
a trial gtest upgrade discovered some parameterized tests missing instantiation. By some miracle, they still pass.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8051

Test Plan: thisisthetest

Reviewed By: mrambacher

Differential Revision: D27003684

Pulled By: pdillinger

fbshipit-source-id: cde1cab1551fb282f67d462d46574bd30bd5e61f
2021-03-12 12:31:29 -08:00
Yanqin Jin
82b3888433 Enable backward iterator for keys with user-defined timestamp (#8035)
Summary:
This PR does the following:

- Enable backward iteration for keys with user-defined timestamp. Note that merge, single delete, range delete are not supported yet.
- Introduces a new helper API `Comparator::EqualWithoutTimestamp()`.
- Fix a typo in `SetTimestamp()`.
- Add/update unit tests

Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced for CPU-intensive workloads with a lot of `Prev()`. Also provided results of iterating keys with timestamps.

1. Disable timestamp, run:
```
./db_bench -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Results:
> Baseline
> - seekrandom [AVG    6 runs] : 96115 ops/sec;   53.2 MB/sec
> - seekrandom [MEDIAN 6 runs] : 98075 ops/sec;   54.2 MB/sec
>
> This PR
> - seekrandom [AVG    6 runs] : 95521 ops/sec;   52.8 MB/sec
> - seekrandom [MEDIAN 6 runs] : 96338 ops/sec;   53.3 MB/sec

2. Enable timestamp, run:
```
./db_bench -user_timestamp_size=8  -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Result:
> Baseline: not supported
>
> This PR
> - seekrandom [AVG    6 runs] : 90514 ops/sec;   50.1 MB/sec
> - seekrandom [MEDIAN 6 runs] : 90834 ops/sec;   50.2 MB/sec

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8035

Reviewed By: ltamasi

Differential Revision: D26926668

Pulled By: riversand963

fbshipit-source-id: 95330cc2242397c03e09d29e5417dfb0adc98ef5
2021-03-10 11:15:46 -08:00
Yanqin Jin
64517d184a Make secondary instance use ManifestTailer (#7998)
Summary:
This PR

- adds a class `ManifestTailer` that inherits from `VersionEditHandlerPointInTime`. `ManifestTailer::Iterate()` can be called multiple times to tail the primary instance's MANIFEST and apply the changes to the secondary,
- updates the implementation of `ReactiveVersionSet::ReadAndApply` to use this class,
- removes unused code in version_set.cc,
- updates existing tests, e.g. removing deleted sync points from unit tests,
- adds a new test to address the bug in https://github.com/facebook/rocksdb/issues/7815.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7998

Test Plan:
make check
Existing and newly-added tests in version_set_test.cc and db_secondary_test.cc

Reviewed By: jay-zhuang

Differential Revision: D26926641

Pulled By: riversand963

fbshipit-source-id: 8d4dd15db0ba863c213f743e33b5a207e948c980
2021-03-10 10:59:44 -08:00
qinzuoyan
6fad38ebe8 Fix compile error (#7908)
Summary:
OS: Ubuntu 14.04
Compiler: GCC 4.9.4
Compile error:
```
db/forward_iterator.cc:996:62: error: declaration of ‘key’ shadows a member of 'this' [-Werror=shadow]
   auto cmp = [&](const FileMetaData* f, const Slice& key) -> bool {
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7908

Reviewed By: jay-zhuang

Differential Revision: D26899986

Pulled By: ajkr

fbshipit-source-id: 66b0b97aefd0f13a085e063491f8207366a9f848
2021-03-09 20:53:33 -08:00
Ed rodriguez
7381dad1b1 make:Fix c header prototypes (#7994)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7994

Reviewed By: jay-zhuang

Differential Revision: D26904603

Pulled By: ajkr

fbshipit-source-id: 0af92a51de895b40c7faaa4f0870b3f63279fe21
2021-03-09 20:44:23 -08:00
Peter Dillinger
0028e3398b Make format_version=5 new default (#8017)
Summary:
Haven't seen any production issues with new Bloom filter and
it's now > 1 year old (added in 6.6.0).

Updated check_format_compatible.sh and HISTORY.md

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8017

Test Plan: tests updated (or prior bugs fixed)

Reviewed By: ajkr

Differential Revision: D26762197

Pulled By: pdillinger

fbshipit-source-id: 0e755c46b443087c1544da0fd545beb9c403d1c2
2021-03-09 12:42:53 -08:00
fanrui03
67d72fb5dc Fix checkpoint stuck (#7921)
Summary:
## 1. Bug description:

When RocksDB Checkpoint, it may be stuck in `WaitUntilFlushWouldNotStallWrites` method.

## 2. Simple analysis of the reasons:

### 2.1 Configuration parameters:

```yaml
Compaction Style : Universal

max_write_buffer_number : 4
min_write_buffer_number_to_merge : 3
```

Checkpoint is usually very fast. When the Checkpoint is executed, `WaitUntilFlushWouldNotStallWrites` is called. If there are 2 Immutable MemTables, which are less than `min_write_buffer_number_to_merge`, they will not be flushed. But will enter this code.

```c++
// method: GetWriteStallConditionAndCause
if (mutable_cf_options.max_write_buffer_number> 3 &&
              num_unflushed_memtables >=
                  mutable_cf_options.max_write_buffer_number-1) {
     return {WriteStallCondition::kDelayed, WriteStallCause::kMemtableLimit};
}
```

code link: fbed72f03c/db/column_family.cc (L847)

Checkpoint thought there was a FlushJob, but it didn't. So will always wait.

### 2.2 solution:

Increase the restriction: the `number of Immutable MemTable` >= `min_write_buffer_number_to_merge will wait`.

If there are other better solutions, you can correct me.

### 2.3 Code that can reproduce the problem:

https://github.com/1996fanrui/fanrui-learning/blob/flink-1.12/module-java/src/main/java/com/dream/rocksdb/RocksDBCheckpointStuck.java

## 3. Interesting point

This bug will be triggered only when `the number of sorted runs >= level0_file_num_compaction_trigger`.

Because there is a break in WaitUntilFlushWouldNotStallWrites.

```c++
if (cfd->imm()->NumNotFlushed() <
        cfd->ioptions()->min_write_buffer_number_to_merge &&
    vstorage->l0_delay_trigger_count() <
        mutable_cf_options.level0_file_num_compaction_trigger) {
  break;
}
```

code link: fbed72f03c/db/db_impl/db_impl_compaction_flush.cc (L1974)

Universal may have `l0_delay_trigger_count() >= level0_file_num_compaction_trigger`, so this bug is triggered.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7921

Reviewed By: jay-zhuang

Differential Revision: D26900559

Pulled By: ajkr

fbshipit-source-id: 133c1252dad7393753f04a47590b68c7d8e670df
2021-03-09 02:21:25 -08:00
Andrew Kryczka
0ff0b625a1 Deflake DBTest2.PartitionedIndexUserToInternalKey on ppc64le (#8044)
Summary:
For some reason I still cannot figure out, the manual flush in this test
was sometimes producing a third tiny file. I saw it a bunch of times on
ppc64le, but even running a qemu system with that architecture (and
playing with various other options) could not repro. However we did get
an instrumented Travis run to confirm the problem is indeed a third tiny
file - https://travis-ci.org/github/facebook/rocksdb/jobs/761986592. We
can avoid it by filling memtables less full and using manual flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8044

Reviewed By: akankshamahajan15

Differential Revision: D26892635

Pulled By: ajkr

fbshipit-source-id: 775c04176931cf01d07cc78fb82cfe3a11beebcf
2021-03-08 14:47:56 -08:00
Levi Tamasi
cb25bc1128 Update compaction statistics to include the amount of data read from blob files (#8022)
Summary:
The patch does the following:
1) Exposes the amount of data (number of bytes) read from blob files from
`BlobFileReader::GetBlob` / `Version::GetBlob`.
2) Tracks the total number and size of blobs read from blob files during a
compaction (due to garbage collection or compaction filter usage) in
`CompactionIterationStats` and propagates this data to
`InternalStats::CompactionStats` / `CompactionJobStats`.
3) Updates the formulae for write amplification calculations to include the
amount of data read from blob files.
4) Extends the compaction stats dump with a new column `Rblob(GB)` and
a new line containing the total number and size of blob files in the current
`Version` to complement the information about the shape and size of the LSM tree
that's already there.
5) Updates `CompactionJobStats` so that the number of files and amount of data
written by a compaction are broken down per file type (i.e. table/blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022

Test Plan: Ran `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D26801199

Pulled By: ltamasi

fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2
2021-03-04 00:43:48 -08:00
Yanqin Jin
72d1e258cd Possibly bump NUMBER_OF_RESEEKS_IN_ITERATION (#8015)
Summary:
When changing db iterator direction, we may perform a reseek.
Therefore, we should bump the NUMBER_OF_RESEEKS_IN_ITERATION counter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8015

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D26755415

Pulled By: riversand963

fbshipit-source-id: 211f51f1a454bcda768fc46c0dce51edeb7f05fe
2021-03-02 22:41:04 -08:00
Levi Tamasi
a46f080cce Break down the amount of data written during flushes/compactions per file type (#8013)
Summary:
The patch breaks down the "bytes written" (as well as the "number of output files")
compaction statistics into two, so the values are logged separately for table files
and blob files in the info log, and are shown in separate columns (`Write(GB)` for table
files, `Wblob(GB)` for blob files) when the compaction statistics are dumped.
This will also come in handy for fixing the write amplification statistics, which currently
do not consider the amount of data read from blob files during compaction. (This will
be fixed by an upcoming patch.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013

Test Plan: Ran `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D26742156

Pulled By: ltamasi

fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5
2021-03-02 09:48:00 -08:00
Akanksha Mahajan
f19612970d Support retrieving checksums for blob files from the MANIFEST when checkpointing (#8003)
Summary:
The checkpointing logic supports passing file level checksums
to the copy_file_cb callback function which is used by the backup code
for detecting corruption during file copies.
However, this is currently implemented only for table files.

This PR extends the checksum retrieval to blob files as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8003

Test Plan: Add new test units

Reviewed By: ltamasi

Differential Revision: D26680701

Pulled By: akankshamahajan15

fbshipit-source-id: 1bd1e2464df6e9aa31091d35b8c72786d94cd1c5
2021-03-01 20:07:07 -08:00
Yanqin Jin
c370d8aa12 Remove unused/incorrect fwd declaration (#8002)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8002

Reviewed By: anand1976

Differential Revision: D26659354

Pulled By: riversand963

fbshipit-source-id: 6b464dbea9fd8240ead8cc5af393f0b78e8f9dd1
2021-02-25 23:07:31 -08:00
Yanqin Jin
cef4a6c49f Compaction filter support for (new) BlobDB (#7974)
Summary:
Allow applications to implement a custom compaction filter and pass it to BlobDB.

The compaction filter's custom logic can operate on blobs.
To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method.
Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely
on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in
db/blob/db_blob_compaction_test.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D26509280

Pulled By: riversand963

fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39
2021-02-25 16:32:35 -08:00
sherriiiliu
e017af15c1 Fix testcase failures on windows (#7992)
Summary:
Fixed 5 test case failures found on Windows 10/Windows Server 2016
1. In `flush_job_test`, the DestroyDir function fails in deconstructor because some file handles are still being held by VersionSet. This happens on Windows Server 2016, so need to manually reset versions_ pointer to release all file handles.
2. In `StatsHistoryTest.InMemoryStatsHistoryPurging` test, the capping memory cost of stats_history_size on Windows becomes 14000 bytes with latest changes, not just 13000 bytes.
3. In `SSTDumpToolTest.RawOutput` test, the output file handle is not closed at the end.
4. In `FullBloomTest.OptimizeForMemory` test, ROCKSDB_MALLOC_USABLE_SIZE is undefined on windows so `total_mem` is always equal to `total_size`. The internal memory fragmentation assertion does not apply in this case.
5. In `BlockFetcherTest.FetchAndUncompressCompressedDataBlock` test, XPRESS cannot reach 87.5% compression ratio with original CreateTable method, so I append extra zeros to the string value to enhance compression ratio. Beside, since XPRESS allocates memory internally, thus does not support for custom allocator verification, we will skip the allocator verification for XPRESS

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7992

Reviewed By: jay-zhuang

Differential Revision: D26615283

Pulled By: ajkr

fbshipit-source-id: 3632612f84b99e2b9c77c403b112b6bedf3b125d
2021-02-23 14:35:06 -08:00
Akanksha Mahajan
46cf5fbfdd Extend VerifyFileChecksums API for blob files (#7979)
Summary:
Extend VerifyFileChecksums API to verify blob files in case of
use_file_checksum.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7979

Test Plan: New unit test db_blob_corruption_test

Reviewed By: ltamasi

Differential Revision: D26534040

Pulled By: akankshamahajan15

fbshipit-source-id: 7dc5951a3df9d265ea1265e0122b43c966856ade
2021-02-22 22:09:22 -08:00
Andrew Kryczka
d904233d2f Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.

However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.

Related changes include:

- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970

Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.

Reviewed By: pdillinger

Differential Revision: D26467994

Pulled By: ajkr

fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 14:09:54 -08:00
mrambacher
4bc9df9459 Fix handling of Mutable options; Allow DB::SetOptions to update mutable TableFactory Options (#7936)
Summary:
Added a "only_mutable_options" flag to the ConfigOptions.  When set, the Configurable methods will only look at/update options that are marked as kMutable.

Fixed DB::SetOptions to allow for the update of any mutable TableFactory options.  Fixes https://github.com/facebook/rocksdb/issues/7385.

Added tests for the new flag.  Updated HISTORY.md

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936

Reviewed By: akankshamahajan15

Differential Revision: D26389646

Pulled By: mrambacher

fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0
2021-02-19 10:29:02 -08:00
Zhichao Cao
b0fd1cc45a Introduce a new trace file format (v 0.2) for better extension (#7977)
Summary:
The trace file record and payload encode is fixed, which requires complex backward compatibility resolving. This PR introduce a new trace file format, which makes it easier to add new entries to the payload and does not have backward compatible issues. V 0.1 is still supported in this PR. Added the tracing for lower_bound and upper_bound for iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7977

Test Plan: make check. tested with old trace file in replay and analyzing.

Reviewed By: anand1976

Differential Revision: D26529948

Pulled By: zhichao-cao

fbshipit-source-id: ebb75a127ce3c07c25a1ccc194c551f917896a76
2021-02-18 23:05:35 -08:00
Jay Zhuang
59ba104e4a Fix txn MultiGet() return un-committed data with snapshot (#7963)
Summary:
TransactionDB uses read callback to filter out un-committed data before
a snapshot. But `MultiGet()` API doesn't use that at all, which causes
returning unwanted data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963

Test Plan: Added unittest to reproduce

Reviewed By: anand1976

Differential Revision: D26455851

Pulled By: jay-zhuang

fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55
2021-02-18 08:49:00 -08:00
Akanksha Mahajan
6a85aea5b1 Bug fix for status overridden by Status::NotFound in db_impl_readonly (#7972)
Summary:
Bug fix for status returned being overridden by Status::NotFound in
DBImpl::OpenForReadOnlyCheckExistence. This was casuing some service
owners to misinterpret the actual error and take appropriate steps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7972

Reviewed By: riversand963

Differential Revision: D26499598

Pulled By: akankshamahajan15

fbshipit-source-id: 05e9fedbe2a2e0e53135760f8ff578a2816d2b8e
2021-02-17 19:35:57 -08:00
Akanksha Mahajan
ea8bb82fc7 Add support for IOTracing in blob files (#7958)
Summary:
Add support for IOTracing in blob files

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7958

Test Plan:
Add a new test and checked manually the trace_file for blob
files being recorded during read and write.

Reviewed By: ltamasi

Differential Revision: D26415950

Pulled By: akankshamahajan15

fbshipit-source-id: 49c2859b3a4f8307e7cb69a92704403a4da46d44
2021-02-16 09:49:10 -08:00
Jay Zhuang
9df78a94f1 Disable flaky error_handler_fs_test that could hang (#7964)
Summary:
The test is hang on 95013df278/db/error_handler_fs_test.cc (L947)
Seems db.mutex_ is lock twice in the test:
cf160b98e1/db/db_impl/db_impl_compaction_flush.cc (L3208)
0a9a05ae12/db/db_impl/db_impl.cc (L469)
As it's just a test issue, disable it for now until the test is fixed.

The hang could be reproduced by:
`gtest-parallel ./error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.CompactionWriteFileScopeError -r 1000`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7964

Reviewed By: zhichao-cao

Differential Revision: D26447325

Pulled By: jay-zhuang

fbshipit-source-id: 72f6a346458e059d10e9cc3347bd6bde040cf89e
2021-02-15 09:45:23 -08:00
Zhichao Cao
d1c510baec Handoff checksum Implementation (#7523)
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523

Test Plan: add new unit test, pass make check/ make asan_check

Reviewed By: pdillinger

Differential Revision: D24313271

Pulled By: zhichao-cao

fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
2021-02-10 22:20:32 -08:00
Jay Zhuang
cf160b98e1 Add full_history_ts_low option to compaction (#7884)
Summary:
The full_history_ts_low is used for user-defined timestamp GC
compaction, which is introduced in https://github.com/facebook/rocksdb/issues/7740, https://github.com/facebook/rocksdb/issues/7657 and https://github.com/facebook/rocksdb/issues/7655.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7884

Reviewed By: ltamasi

Differential Revision: D25982553

Pulled By: jay-zhuang

fbshipit-source-id: 36303d412d65b5d8166b6da24fa21ad85adbabee
2021-02-08 13:45:48 -08:00
Levi Tamasi
974458891c Revert "Turn on memtable bloom filter by default. (#6584)" (#7939)
Summary:
This reverts commit ee79a28963.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7939

Reviewed By: siying

Differential Revision: D26298564

Pulled By: ltamasi

fbshipit-source-id: 6d663516e82e6de436f8d5317932ca9a98e152bd
2021-02-06 22:34:30 -08:00
sdong
ee79a28963 Turn on memtable bloom filter by default. (#6584)
Summary:
Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584

Test Plan: Run all existing tests.

Reviewed By: pdillinger

Differential Revision: D20626739

fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9
2021-02-05 12:59:46 -08:00
Stanislav Tkach
3feee6db17 Add get/set deadline and io_timeout C functions (read options) (#7914)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7914

Reviewed By: jay-zhuang

Differential Revision: D26184409

Pulled By: ajkr

fbshipit-source-id: 8e30faac5223ec80c22e2b617af67775322065d8
2021-02-04 17:00:58 -08:00
Levi Tamasi
e5311a8ea4 Fix a SingleDelete related optimization for blob indexes (#7904)
Summary:
There is a small `SingleDelete` related optimization in the
`CompactionIterator` code: when a `SingleDelete`-`Put` pair is preserved
solely for the purposes of transaction conflict checking, the value
itself gets cleared. (This is referred to as "optimization 3" in the
`CompactionIterator` code.) Though the rest of the code got updated to
support `SingleDelete`'ing blob indexes, this chunk was apparently
missed, resulting in an assertion failure (or `ROCKS_LOG_FATAL` in release
builds) when triggered. Note: in addition to clearing the value, we also
need to update the type of the KV to regular value when dealing with
blob indexes here.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7904

Test Plan: `make check`

Reviewed By: ajkr

Differential Revision: D26118009

Pulled By: ltamasi

fbshipit-source-id: 6bf78043d20265e2b15c2e1ab8865025040c42ae
2021-01-29 12:41:25 -08:00
Andrew Kryczka
78ee8564ad Integrity protection for live updates to WriteBatch (#7748)
Summary:
This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.).

The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer.

When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748

Test Plan:
- an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught
- add to stress/crash test to verify it works in variety of configs/operations without intentional corruption
- [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc.

Reviewed By: pdillinger

Differential Revision: D25754492

Pulled By: ajkr

fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866
2021-01-29 12:18:58 -08:00
mrambacher
4a09d632c4 Remove Legacy and Custom FileWrapper classes from header files (#7851)
Summary:
Removed the uses of the Legacy FileWrapper classes from the source code.  The wrappers were creating an additional layer of indirection/wrapping, as the Env already has a FileSystem.

Moved the Custom FileWrapper classes into the CustomEnv, as these classes are really for the private use the the CustomEnv class.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7851

Reviewed By: anand1976

Differential Revision: D26114816

Pulled By: mrambacher

fbshipit-source-id: db32840e58d969d3a0fa6c25aaf13d6dcdc74150
2021-01-28 22:10:32 -08:00
mrambacher
0a9a05ae12 Make builds reproducible (#7866)
Summary:
Closes https://github.com/facebook/rocksdb/issues/7035

Changed how build_version.cc was generated:
- Included the GIT tag/branch in the build_version file
- Changed the "Build Date" to be:
      - If the GIT branch is "clean" (no changes), the date of the last git commit
      - If the branch is not clean, the current date
 - Added APIs to access the "build information", rather than accessing the strings directly.

The build_version.cc file is now regenerated whenever the library objects are rebuilt.

Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866

Reviewed By: pdillinger

Differential Revision: D26086565

Pulled By: mrambacher

fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f
2021-01-28 17:42:16 -08:00
Levi Tamasi
c696f27432 Accumulate blob file additions in VersionEdit during recovery (#7903)
Summary:
During recovery, RocksDB performs a kind of dummy flush; namely, entries
from the WAL are added to memtables, which then get written to SSTs and
blob files (if enabled) just like during a regular flush. Note that
multiple memtables might be flushed during recovery for the same column
family, for example, if the DB is reopened with a lower write buffer size,
and therefore, we need to make sure to collect all SST and blob file
additions. The patch fixes a bug in the earlier logic which resulted in
later blob file additions overwriting earlier ones.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7903

Test Plan: Added a unit test and ran `db_stress`.

Reviewed By: jay-zhuang

Differential Revision: D26110847

Pulled By: ltamasi

fbshipit-source-id: eddb50a608a88f54f3cec3a423de8235aba951fd
2021-01-27 18:46:15 -08:00
Zhichao Cao
95013df278 Do not set bg error for compaction in retryable IO Error case (#7899)
Summary:
When retryable IO error occurs during compaction, it is mapped to soft error and set the BG error. However, auto resume is not called to clean the soft error since compaction will reschedule by itself. In this change, When retryable IO error occurs during compaction, BG error is not set. User will be informed the error via EventHelper.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7899

Test Plan: tested with error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D26094097

Pulled By: zhichao-cao

fbshipit-source-id: c53424f11d237405592cd762f43cbbdf8da8234f
2021-01-27 17:58:12 -08:00
Jay Zhuang
c6ff4c0b70 Fix deadlock in fs_test.WALWriteRetryableErrorAutoRecover1 (#7897)
Summary:
The recovery thread could hold the db.mutex, which is needed from sync
write in main thread.
Make sure the write is done before recovery thread starts.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7897

Test Plan: `gtest-parallel ./error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.WALWriteRetryableErrorAutoRecover1 -r 10000 --workers=200`

Reviewed By: zhichao-cao

Differential Revision: D26082933

Pulled By: jay-zhuang

fbshipit-source-id: 226fc49228c0e5903f86ff45cc3fed3080abdb1f
2021-01-26 17:02:03 -08:00
Jay Zhuang
9425acacce Fix flaky error_handler_fs_test.MultiDBCompactionError (#7896)
Summary:
The error recovery thread may out-live DBImpl object, which causing
access released DBImpl.mutex. Close SstFileManager before closing DB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7896

Test Plan:
the issue can be reproduced by adding sleep in recovery code.
Pass the tests with sleep.

Reviewed By: zhichao-cao

Differential Revision: D26076655

Pulled By: jay-zhuang

fbshipit-source-id: 0d9cc5639c12fcfc001427015e75a9736f33cd96
2021-01-26 11:00:12 -08:00
mrambacher
12f1137355 Add a SystemClock class to capture the time functions of an Env (#7858)
Summary:
Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.

Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.

There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).

Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858

Reviewed By: pdillinger

Differential Revision: D26006406

Pulled By: mrambacher

fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
2021-01-25 22:09:11 -08:00
Akanksha Mahajan
1d226018af In IOTracing, add filename with each operation in trace file. (#7885)
Summary:
1. In IOTracing, add filename with each IOTrace record. Filename is stored in file object (Tracing Wrappers).
         2. Change the logic of figuring out which additional information (file_size,
            length, offset etc) needs to be store with each operation
            which is different for different operations.
            When new information will be added in future (depends on operation),
            this change would make the future additions simple.

Logic: In IOTraceRecord, io_op_data is added and its
         bitwise positions represent which additional information need
         to added in the record from enum IOTraceOp. Values in IOTraceOp represent bitwise positions.
         So if length and offset needs to be stored (IOTraceOp::kIOLen
         is 1 and IOTraceOp::kIOOffset is 2), position 1 and 2 (from rightmost bit) will be set
         and io_op_data will contain 110.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7885

Test Plan: Updated io_tracer_test and verified the trace file manually.

Reviewed By: anand1976

Differential Revision: D25982353

Pulled By: akankshamahajan15

fbshipit-source-id: ebfc5539cc0e231d7794a6b42b73f5403e360b22
2021-01-25 14:37:35 -08:00
Levi Tamasi
431e8afba7 Do not explicitly flush blob files when using the integrated BlobDB (#7892)
Summary:
In the original stacked BlobDB implementation, which writes blobs to blob files
immediately and treats blob files as logs, it makes sense to flush the file after
writing each blob to protect against process crashes; however, in the integrated
implementation, which builds blob files in the background jobs, this unnecessarily
reduces performance. This patch fixes this by simply adding a `do_flush` flag to
`BlobLogWriter`, which is set to `true` by the stacked implementation and to `false`
by the new code. Note: the change itself is trivial but the tests needed some work;
since in the new implementation, blobs are now buffered, adding a blob to
`BlobFileBuilder` is no longer guaranteed to result in an actual I/O. Therefore, we can
no longer rely on `FaultInjectionTestEnv` when testing failure cases; instead, we
manipulate the return values of I/O methods directly using `SyncPoint`s.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7892

Test Plan: `make check`

Reviewed By: jay-zhuang

Differential Revision: D26022814

Pulled By: ltamasi

fbshipit-source-id: b3dce419f312137fa70d84cdd9b908fd5d60d8cd
2021-01-25 13:32:33 -08:00
Matthew Von-Maszewski
12a8be1d44 MergeHelper::FilterMerge() calling ElapsedNanosSafe() upon exit even … (#7867)
Summary:
…when unused.  Causes many calls to clock_gettime, impacting performance.

Was looking for something else via Linux "perf" command when I spotted heavy usage of clock_gettime during a compaction.  Our product heavily uses the rocksdb::Options::merge_operator.  MergeHelper::FilterMerge() properly tests if timing is enabled/disabled upon entry, but not on exit.  This patch fixes the exit.

Note:  the entry test also verifies if "nullptr!=stats_".  This test is redundant to code within ShouldReportDetailedTime().  Therefore I omitted it in my change.

merge_test.cc updated with test that shows failure before merge_helper.cc change ... and fix after change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7867

Reviewed By: jay-zhuang

Differential Revision: D25960175

Pulled By: ajkr

fbshipit-source-id: 56e66d7eb6ae5eae89c8e0d5a262bd2905a226b6
2021-01-21 13:13:02 -08:00
Andrew Kryczka
e18a4df62a workaround race conditions during PeriodicWorkScheduler registration (#7888)
Summary:
This provides a workaround for two race conditions that will be fixed in
a more sophisticated way later. This PR:

(1) Makes the client serialize calls to `Timer::Start()` and `Timer::Shutdown()` (see https://github.com/facebook/rocksdb/issues/7711). The long-term fix will be to make those functions thread-safe.
(2) Makes `PeriodicWorkScheduler` atomically add/cancel work together with starting/shutting down its `Timer`. The long-term fix will be for `Timer` API to offer more specialized APIs so the client will not need to synchronize.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7888

Test Plan: ran the repro provided in https://github.com/facebook/rocksdb/issues/7881

Reviewed By: jay-zhuang

Differential Revision: D25990891

Pulled By: ajkr

fbshipit-source-id: a97fdaebbda6d7db7ddb1b146738b68c16c5be38
2021-01-21 08:50:38 -08:00
Levi Tamasi
2d37830e44 Make blob related VersionEdit tags unignorable (#7886)
Summary:
BlobFileAddition and BlobFileGarbage should not be in the ignorable tag
range, since if they are present in the MANIFEST, users cannot downgrade
to a RocksDB version that does not understand them without losing access
to the data in the blob files. The patch moves these two tags to the
unignorable range; this should still be safe at this point, since the
integrated BlobDB project is still work in progress and thus there
shouldn't be any ignorable BlobFileAddition/BlobFileGarbage tags out
there.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7886

Test Plan: `make check`

Reviewed By: cheng-chang

Differential Revision: D25980956

Pulled By: ltamasi

fbshipit-source-id: 13cf5bd61d77f049b513ecd5ad0be8c637e40a9d
2021-01-20 20:29:04 -08:00
Cheng Chang
e44948295e Make it able to ignore WAL related VersionEdits in older versions (#7873)
Summary:
Although the tags for `WalAddition`, `WalDeletion` are after `kTagSafeIgnoreMask`, to actually be able to skip these entries in older versions of RocksDB, we require that they are encoded with their encoded size as the prefix. This requirement is not met in the current codebase, so a downgraded DB may fail to open if these entries exist in the MANIFEST.

If a DB wants to downgrade, and its MANIFEST contains `WalAddition` or `WalDeletion`, it can set `track_and_verify_wals_in_manifest` to `false`, then restart twice, then downgrade. On the first restart, a new MANIFEST will be created with a `WalDeletion` indicating that all previously tracked WALs are removed from MANIFEST. On the second restart, since there is  no tracked WALs in MANIFEST now, a new MANIFEST will be created with neither `WalAddition` nor `WalDeletion`. Then the DB can downgrade.

Tags for `BlobFileAddition`, `BlobFileGarbage` also have the same problem, but this PR focuses on solving the problem for WAL edits.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7873

Test Plan: Added a `VersionEditTest::IgnorableTags` unit test to verify all entries with tags larger than `kTagSafeIgnoreMask` can actually be skipped and won't affect parsing of other entries.

Reviewed By: ajkr

Differential Revision: D25935930

Pulled By: cheng-chang

fbshipit-source-id: 7a02fdba4311d6084328c14aed110a26d08c3efb
2021-01-19 19:27:53 -08:00
Vladimir Maksimovski
4db58bcfb2 Fix write-ahead log file size overflow (#7870)
Summary:
The WAL's file size is stored as an unsigned 64 bit integer.

In db_info_dumper.cc, this integer gets converted to a string. Since 2^64 is approximately 10^19, we need 20 digits to represent the integer correctly. To store the decimal representation, we need 21 bytes (+1 due to the '\0' terminator at the end). The code previously used 16 bytes, which would overflow if the log is really big (>1 petabyte).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7870

Reviewed By: ajkr

Differential Revision: D25938776

Pulled By: jay-zhuang

fbshipit-source-id: 6ee9e21ebd65d297ea90fa1e7e74f3e1c533299d
2021-01-19 13:47:48 -08:00
darionyaphet
2fb6d9337f Using emplace_back replace push_back (#7568)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7568

Reviewed By: akankshamahajan15

Differential Revision: D24437383

Pulled By: jay-zhuang

fbshipit-source-id: 7c9b3c4944b959aa7796c53b410c2b1055dc5641
2021-01-15 16:56:41 -08:00
Jay Zhuang
77b4bfe511 Disable PeriodicWorkScheduler during RateLimited test (#7810)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7810

Reviewed By: akankshamahajan15

Differential Revision: D25695454

Pulled By: jay-zhuang

fbshipit-source-id: 963d11f38a959de7227ba2be15795af2792413a6
2021-01-11 15:01:52 -08:00
Jay Zhuang
eccc47e81c Fix tsan options_test (#7845)
Summary:
Minor tsan issue that counter could be bumped concurrently:
https://app.circleci.com/pipelines/github/facebook/rocksdb/5431/workflows/79312c7c-5815-4f07-8836-94625db8e33e/jobs/81619

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7845

Reviewed By: akankshamahajan15

Differential Revision: D25851472

Pulled By: jay-zhuang

fbshipit-source-id: 74cc8797ac503413bec27a30e5d1f055379777e8
2021-01-11 10:17:57 -08:00
Adam Retter
4926b33742 Improvements to Env::GetChildren (#7819)
Summary:
The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html

There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source.

Also some minor fixes to `Env::GetChildren`:
* Improve error handling in POSIX implementation
* Remove unnecessary array allocation on Windows
* Fix struct name for Windows Non-UTF-8 API

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819

Reviewed By: ajkr

Differential Revision: D25837394

Pulled By: jay-zhuang

fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e
2021-01-09 09:44:34 -08:00
Zhichao Cao
48c0843e69 Treat File Scope Write IO Error the same as Retryable IO Error (#7840)
Summary:
In RocksDB, when IO error happens, the flags of IOStatus can be set. If the IOStatus is set as "File Scope IO Error", it indicate that the error is constrained in the file level. Since RocksDB does not continues write data to a file when any IO Error happens, File Scope IO Error can be treated the same as Retryable IO Error. Adding the logic to ErrorHandler::SetBGError to include the file scope IO Error in its error handling logic, which is the same as retryable IO Error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7840

Test Plan: added new unit tests in error_handler_fs_test. make check

Reviewed By: anand1976

Differential Revision: D25820481

Pulled By: zhichao-cao

fbshipit-source-id: 69cabd3d010073e064d6142ce1cabf341b8a6806
2021-01-07 16:31:33 -08:00
mrambacher
cc2a180d00 Add more tests to the ASC pass list (#7834)
Summary:
Fixed the following  to now pass ASC checks:
* `ttl_test`
* `blob_db_test`
* `backupable_db_test`,
* `delete_scheduler_test`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7834

Reviewed By: jay-zhuang

Differential Revision: D25795398

Pulled By: ajkr

fbshipit-source-id: a10037817deda4fc7cbb353a2e00b62ed89b6476
2021-01-07 15:22:53 -08:00
Adam Retter
6e0f62f2b6 Add more tests to ASSERT_STATUS_CHECKED (3), API change (#7715)
Summary:
Third batch of adding more tests to ASSERT_STATUS_CHECKED.

* db_compaction_filter_test
* db_compaction_test
* db_dynamic_level_test
* db_inplace_update_test
* db_sst_test
* db_tailing_iter_test
* db_io_failure_test

Also update GetApproximateSizes APIs to all return Status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7715

Reviewed By: jay-zhuang

Differential Revision: D25806896

Pulled By: pdillinger

fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262
2021-01-06 14:15:02 -08:00
Zhichao Cao
5792b73fdc Fixed the swallowed IOStatus in Compaction Job introduced in PR 7718 (#7838)
Summary:
The IOStatus of TableBuilder is returned by copy the io status from builder->io_status(). pr https://github.com/facebook/rocksdb/issues/7718 swallowed the io status and it will cause the write IO error become non-retryable and no auto resume logic will handle it. Roll back to previous implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7838

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D25795387

Pulled By: zhichao-cao

fbshipit-source-id: bc35e69e0b71aa4148a6ed76f073357041b8e372
2021-01-06 13:18:00 -08:00
mrambacher
e628f59e87 Create a CustomEnv class; Add WinFileSystem; Make LegacyFileSystemWrapper private (#7703)
Summary:
This PR does the following:
-> Creates a WinFileSystem class.  This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems.
-> Introduces a CustomEnv class.  A CustomEnv is an Env that takes a FileSystem as constructor argument.  I believe there will only ever be two implementations of this class (PosixEnv and WinEnv).  There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls
-> Eliminates the public uses of the LegacyFileSystemWrapper.

With this change in place, there are effectively the following patterns of Env:
- "Base Env classes" (PosixEnv, WinEnv).  These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem.  These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system.
- Wrapped Composite Env classes (MemEnv).  These classes take in an Env and a FileSystem.  The core env functions are re-directed to the wrapped env.  The file system calls are redirected to the input file system
- Legacy Wrapped Env classes.  These classes take in an Env input (but no FileSystem).  The core env functions are re-directed to the wrapped env.  A "Legacy File System" is created using this env and the file system calls directed to the env itself.

With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created.  Any other use of the PosixEnv is via another wrapped env.  This cleans up some of the issues with the env construction and destruction.

Additionally, there were places in the code that required had an Env when they required a FileSystem.  Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem().  These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703

Reviewed By: zhichao-cao

Differential Revision: D25762190

Pulled By: anand1976

fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b
2021-01-06 10:49:32 -08:00
mrambacher
c1a65a4de4 Make StringEnv, StringSink, StringSource use FS classes (#7786)
Summary:
Change the StringEnv and related classes to be based on FileSystem APIs rather than the corresponding Env ones.  The StringSink and StringSource classes were changed to be based on the corresponding FS file classes.

Part of a cleanup to use the newer interfaces.  This change also eliminates some of the casts/wrappers to LegacyFile classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7786

Reviewed By: jay-zhuang

Differential Revision: D25761460

Pulled By: anand1976

fbshipit-source-id: 428ae8e32b3db97dbeeca08c9d3bb0d9d4d3a38f
2021-01-04 16:01:01 -08:00
Andrew Kryczka
225abffd8f Verify file checksum generator name (#7824)
Summary:
Previously we only had a debug assertion to check the right generator was being used for verification. However a user hit a problem in production where their factory was creating the wrong generator for some files, leading to checksum mismatches. It would have been easier to debug if we verified in optimized builds that the generator with the proper name is used. This PR adds such verification.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7824

Reviewed By: zhichao-cao

Differential Revision: D25740254

Pulled By: ajkr

fbshipit-source-id: a6231521747605021bad3231484b5d4f99f4044f
2021-01-04 11:51:50 -08:00
mrambacher
0bad2b4308 Ignore the OnAddFile Status for SSTFileManager (#7826)
Summary:
The returned Status is ignored here as some stress tests are failing, presumably when attempting to add an empty file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7826

Reviewed By: jay-zhuang

Differential Revision: D25742931

fbshipit-source-id: a1fcd620d9472993a009929306dfc421f93eb43b
2021-01-04 11:08:28 -08:00
Zhichao Cao
44ebc24dca Add rate_limiter to GenerateOneFileChecksum (#7811)
Summary:
In GenerateOneFileChecksum(), RocksDB reads the file and computes its checksum. A rate limiter can be passed to the constructor of RandomAccessFileReader so that read I/O can be rate limited.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7811

Test Plan: make check

Reviewed By: cheng-chang

Differential Revision: D25699896

Pulled By: zhichao-cao

fbshipit-source-id: e2688bc1126c543979a3bcf91dda784bd7b74164
2020-12-26 22:07:24 -08:00
mrambacher
55e99688cc No elide constructors (#7798)
Summary:
Added "no-elide-constructors to the ASSERT_STATUS_CHECK builds.  This flag gives more errors/warnings for some of the Status checks where an inner class checks a Status and later returns it.  In this case,  without the elide check on, the returned status may not have been checked in the caller, thereby bypassing the checked code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7798

Reviewed By: jay-zhuang

Differential Revision: D25680451

Pulled By: pdillinger

fbshipit-source-id: c3f14ed9e2a13f0a8c54d839d5fb4d1fc1e93917
2020-12-23 16:55:53 -08:00
Akanksha Mahajan
30a5ed9c53 Update "num_data_read" stat in RetrieveMultipleBlocks (#7770)
Summary:
RetrieveMultipleBlocks which is used by MultiGet to read data blocks is not updating num_data_read stat in
GetContextStats.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7770

Test Plan: make check -j64

Reviewed By: anand1976

Differential Revision: D25538982

Pulled By: akankshamahajan15

fbshipit-source-id: e3daedb035b1be8ab6af6f115cb3793ccc7b1ec6
2020-12-23 15:16:46 -08:00
Peter Dillinger
a727efca99 Remove flaky, redundant, and dubious DBTest.SparseMerge (#7800)
Summary:
This test would occasionally fail like this:

    WARNING: c:\users\circleci\project\db\db_test.cc(1343): error: Expected:
    (dbfull()->TEST_MaxNextLevelOverlappingBytes(handles_[1])) <= (20 * 1048576), actual: 33501540 vs 20971520

And being a super old test, it's not structured in a sound way. And it appears that DBTest2.MaxCompactionBytesTest is a better test of what SparseMerge was intended to test. In fact, SparseMerge fails if I set

    options.max_compaction_bytes = options.target_file_size_base * 1000;

Thus, we are removing this negative-value test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7800

Test Plan: Q.E.D.

Reviewed By: ajkr

Differential Revision: D25693366

Pulled By: pdillinger

fbshipit-source-id: 9da07d4dce0559547fc938b2163a2015e956c548
2020-12-23 11:08:12 -08:00
mrambacher
02418194d7 Add more tests for assert status checked (#7524)
Summary:
Added 10 more tests that pass the ASSERT_STATUS_CHECKED test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7524

Reviewed By: akankshamahajan15

Differential Revision: D24323093

Pulled By: ajkr

fbshipit-source-id: 28d4106d0ca1740c3b896c755edf82d504b74801
2020-12-22 23:45:58 -08:00
Adam Retter
81592d9ffa Add more tests to ASSERT_STATUS_CHECKED (4) (#7718)
Summary:
Fourth batch of adding more tests to ASSERT_STATUS_CHECKED.

* db_range_del_test
* db_write_test
* random_access_file_reader_test
* merge_test
* external_sst_file_test
* write_buffer_manager_test
* stringappend_test
* deletefile_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7718

Reviewed By: pdillinger

Differential Revision: D25671608

fbshipit-source-id: 687a794e98a9e0cd5428ead9898ef05ced987c31
2020-12-22 15:09:39 -08:00
cheng-chang
41ff125a8a SyncWAL shouldn't be supported in compacted db (#7788)
Summary:
`CompactedDB` is a kind of read-only DB, so it shouldn't support `SyncWAL`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7788

Test Plan: watch existing tests to pass.

Reviewed By: akankshamahajan15

Differential Revision: D25661209

Pulled By: cheng-chang

fbshipit-source-id: 9eb2cc3f73736dcc205c8410e5944aa203f002d3
2020-12-22 14:53:43 -08:00
sdong
9057d0a079 Minimize Timing Issue in test WALTrashCleanupOnOpen (#7796)
Summary:
We saw DBWALTestWithParam/DBWALTestWithParam.WALTrashCleanupOnOpen sometimes fail with:

db/db_sst_test.cc:575: Failure
Expected: (trash_log_count) >= (1), actual: 0 vs 1

The suspicious is that delete scheduling actually deleted all trash files based on rate, but it is not expected. This can be reproduced if we manually add sleep after DB is closed for serveral seconds. Minimize its chance by setting the delete rate to be lowest possible.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7796

Test Plan: The test doesn't fail with the manual sleeping anymore

Reviewed By: anand1976

Differential Revision: D25675000

fbshipit-source-id: a39fd05e1a83719c41014e48843792e752368e22
2020-12-22 14:44:08 -08:00
Peter Dillinger
4d1ac19e3d aggregated-table-properties with GetMapProperty (#7779)
Summary:
So that we can more easily get aggregate live table data such
as total filter, index, and data sizes.

Also adds ldb support for getting properties

Also fixed some missing/inaccurate related comments in db.h

For example:

    $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties
    rocksdb.aggregated-table-properties.data_size: 102871
    rocksdb.aggregated-table-properties.filter_size: 0
    rocksdb.aggregated-table-properties.index_partitions: 0
    rocksdb.aggregated-table-properties.index_size: 2232
    rocksdb.aggregated-table-properties.num_data_blocks: 100
    rocksdb.aggregated-table-properties.num_deletions: 0
    rocksdb.aggregated-table-properties.num_entries: 15000
    rocksdb.aggregated-table-properties.num_merge_operands: 0
    rocksdb.aggregated-table-properties.num_range_deletions: 0
    rocksdb.aggregated-table-properties.raw_key_size: 288890
    rocksdb.aggregated-table-properties.raw_value_size: 198890
    rocksdb.aggregated-table-properties.top_level_index_size: 0
    $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1
    rocksdb.aggregated-table-properties-at-level1.data_size: 80909
    rocksdb.aggregated-table-properties-at-level1.filter_size: 0
    rocksdb.aggregated-table-properties-at-level1.index_partitions: 0
    rocksdb.aggregated-table-properties-at-level1.index_size: 1787
    rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81
    rocksdb.aggregated-table-properties-at-level1.num_deletions: 0
    rocksdb.aggregated-table-properties-at-level1.num_entries: 12466
    rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0
    rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0
    rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210
    rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414
    rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0
    $

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779

Test Plan: Added a test to ldb_test.py

Reviewed By: jay-zhuang

Differential Revision: D25653103

Pulled By: pdillinger

fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 08:00:14 -08:00
Cheng Chang
fbce7a3808 Track WAL obsoletion when updating empty CF's log number (#7781)
Summary:
In the write path, there is an optimization: when a new WAL is created during SwitchMemtable, we update the internal log number of the empty column families to the new WAL. `FindObsoleteFiles` marks a WAL as obsolete if the WAL's log number is less than `VersionSet::MinLogNumberWithUnflushedData`. After updating the empty column families' internal log number, `VersionSet::MinLogNumberWithUnflushedData` might change, so some WALs might become obsolete to be purged from disk.

For example, consider there are 3 column families: 0, 1, 2:
1. initially, all the column families' log number is 1;
2. write some data to cf0, and flush cf0, but the flush is pending;
3. now a new WAL 2 is created;
4. write data to cf1 and WAL 2, now cf0's log number is 1, cf1's log number is 2, cf2's log number is 2 (because cf1 and cf2 are empty, so their log numbers will be set to the highest log number);
5. now cf0's flush hasn't finished, flush cf1, a new WAL 3 is created, and cf1's flush finishes, now cf0's log number is 1, cf1's log number is 3, cf2's log number is 3, since WAL 1 still contains data for the unflushed cf0, no WAL can be deleted from disk;
6. now cf0's flush finishes, cf0's log number is 2 (because when cf0 was switching memtable, WAL 3 does not exist yet), cf1's log number is 3, cf2's log number is 3, so WAL 1 can be purged from disk now, but WAL 2 still cannot because `MinLogNumberToKeep()` is 2;
7. write data to cf2 and WAL 3, because cf0 is empty, its log number is updated to 3, so now cf0's log number is 3, cf1's log number is 3, cf2's log number is 3;
8. now if the background threads want to purge obsolete files from disk, WAL 2 can be purged because `MinLogNumberToKeep()` is 3. But there are only two flush results written to MANIFEST: the first is for flushing cf1, and the `MinLogNumberToKeep` is 1, the second is for flushing cf0, and the `MinLogNumberToKeep` is 2. So without this PR, if the DB crashes at this point and try to recover, `WalSet` will still expect WAL 2 to exist.

When WAL tracking is enabled, we assume WALs will only become obsolete after a flush result is written to MANIFEST in `MemtableList::TryInstallMemtableFlushResults` (or its atomic flush counterpart). The above situation breaks this assumption.

This PR tracks WAL obsoletion if necessary before updating the empty column families' log numbers.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7781

Test Plan:
watch existing tests and stress tests to pass.
`make -j48 blackbox_crash_test` on devserver

Reviewed By: ltamasi

Differential Revision: D25631695

Pulled By: cheng-chang

fbshipit-source-id: ca7fff967bdb42204b84226063d909893bc0a4ec
2020-12-18 21:34:36 -08:00
Burton Li
2021392e25 Do not full scan obsolete files on compaction busy (#7739)
Summary:
When ConcurrentTaskLimiter is enabled and there are too many outstanding compactions, BackgroundCompaction returns Status::Busy(), which shouldn't be treat as compaction failure.
This caused performance issue when outstanding compactions reached the limit.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7739

Reviewed By: cheng-chang

Differential Revision: D25508319

Pulled By: ltamasi

fbshipit-source-id: 3b181b16ada0ca3393cfa3a7412985764e79c719
2020-12-15 13:51:10 -08:00
Jay Zhuang
a0e4421e81 Log sst number in Corruption status (#7767)
Summary:
sst file number in corruption error would be very useful for debugging

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7767

Reviewed By: zhichao-cao

Differential Revision: D25485872

Pulled By: jay-zhuang

fbshipit-source-id: 67315b582cedeefbce6676015303ebe5bf6526a3
2020-12-14 14:07:52 -08:00
Levi Tamasi
1afbd1948c Add initial blob support to batched MultiGet (#7766)
Summary:
The patch adds initial support for reading blobs to the batched `MultiGet` API.
The current implementation simply retrieves the blob values as the blob indexes
are encountered; that is, reads from blob files are currently not batched. (This
will be optimized in a separate phase.) In addition, the patch removes some dead
code related to BlobDB from the batched `MultiGet` implementation, namely the
`is_blob` / `is_blob_index` flags that are passed around in `DBImpl` and `MemTable` /
`MemTableListVersion`. These were never hooked up to anything and wouldn't
work anyways, since a single flag is not sufficient to communicate the "blobness"
of multiple key-values.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7766

Test Plan: `make check`

Reviewed By: jay-zhuang

Differential Revision: D25479290

Pulled By: ltamasi

fbshipit-source-id: 7aba2d290e31876ee592bcf1adfd1018713a8000
2020-12-14 13:48:22 -08:00
Peter Dillinger
b1ee191405 Fix memory leak for ColumnFamily drop with live iterator (#7749)
Summary:
Uncommon bug seen by ASAN with
ColumnFamilyTest.LiveIteratorWithDroppedColumnFamily, if the last two
references to a ColumnFamilyData are both SuperVersions (during
InstallSuperVersion). The fix is to use UnrefAndTryDelete even in
SuperVersion::Cleanup but with a parameter to avoid re-entering Cleanup
on the same SuperVersion being cleaned up.

ColumnFamilyData::Unref is considered unsafe so removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7749

Test Plan: ./column_family_test --gtest_filter=*LiveIter* --gtest_repeat=100

Reviewed By: jay-zhuang

Differential Revision: D25354304

Pulled By: pdillinger

fbshipit-source-id: e78f3a3f67c40013b8432f31d0da8bec55c5321c
2020-12-11 11:18:21 -08:00
Cheng Chang
fd7d8dc56e Do not log unnecessary WAL obsoletion events (#7765)
Summary:
min_wal_number_to_keep should not be decreasing, if it does not increase, then there is no need to log the WAL obsoletions in MANIFEST since a previous one has been logged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7765

Test Plan: watch existing tests and stress tests to pass

Reviewed By: pdillinger

Differential Revision: D25462542

Pulled By: cheng-chang

fbshipit-source-id: 0085fcb6edf5cf2b0fc32f9932a7566f508768ff
2020-12-10 12:55:49 -08:00
Adam Retter
8ff6557e7f Add further tests to ASSERT_STATUS_CHECKED (2) (#7698)
Summary:
Second batch of adding more tests to ASSERT_STATUS_CHECKED.

* external_sst_file_basic_test
* checkpoint_test
* db_wal_test
* db_block_cache_test
* db_logical_block_size_cache_test
* db_blob_index_test
* optimistic_transaction_test
* transaction_test
* point_lock_manager_test
* write_prepared_transaction_test
* write_unprepared_transaction_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7698

Reviewed By: cheng-chang

Differential Revision: D25441664

Pulled By: pdillinger

fbshipit-source-id: 9e78867f32321db5d4833e95eb96c5734526ef00
2020-12-09 21:21:16 -08:00
Cheng Chang
80159f6e0b Carry over min_log_number_to_keep_2pc in new MANIFEST (#7747)
Summary:
When two phase commit is enabled, `VersionSet::min_log_number_to_keep_2pc` is set during flush.
But when a new MANIFEST is created, the `min_log_number_to_keep_2pc` is not carried over to the new MANIFEST. So if a new MANIFEST is created and then DB is reopened, the `min_log_number_to_keep_2pc` will be lost.  This may cause DB recovery errors.
The bug is reproduced in a new unit test in `version_set_test.cc`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7747

Test Plan: The new unit test in `version_set_test.cc` should pass.

Reviewed By: jay-zhuang

Differential Revision: D25350661

Pulled By: cheng-chang

fbshipit-source-id: eee890d5b19f15769069670692e270ae31044ece
2020-12-09 19:07:25 -08:00
anand76
8a1488efbf Ensure that MultiGet works properly with compressed cache (#7756)
Summary:
Ensure that when direct IO is enabled and a compressed block cache is
configured, MultiGet inserts compressed data blocks into the compressed
block cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7756

Test Plan: Add unit test to db_basic_test

Reviewed By: cheng-chang

Differential Revision: D25416240

Pulled By: anand1976

fbshipit-source-id: 75d57526370c9c0a45ff72651f3278dbd8a9086f
2020-12-09 17:01:13 -08:00
Cheng Chang
3c2a448856 Add a test for disabling tracking WAL (#7757)
Summary:
If WAL tracking was enabled, then disabled during reopen, the previously tracked WALs should be removed from MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7757

Test Plan: a new unit test `DBBasicTest.DisableTrackWal` is added.

Reviewed By: jay-zhuang

Differential Revision: D25410508

Pulled By: cheng-chang

fbshipit-source-id: 9d8d9e665066135930a7c1035bb8c2f68bded6a0
2020-12-09 16:58:26 -08:00
Cheng Chang
efe827baf0 Always track WAL obsoletion (#7759)
Summary:
Currently, when a WAL becomes obsolete after flushing, if VersionSet::WalSet does not contain the WAL, we do not track the WAL obsoletion event in MANIFEST.

But consider this case:
* WAL 10 is synced, a VersionEdit is LogAndApplied to MANIFEST to log this WAL addition event, but the VersionEdit is not applied to WalSet yet since its corresponding ManifestWriter is still pending in the write queue;
* Since the above ManifestWriter is blocking, the LogAndApply will block on a conditional variable and release the db mutex, so another LogAndApply can proceed to enqueue other VersionEdits concurrently;
* Now flush happens, and WAL 10 becomes obsolete, although WalSet does not contain WAL 10 yet, we should call LogAndApply to enqueue a VersionEdit to indicate the obsoletion of WAL 10;
* otherwise, when the queued edit indicating WAL 10 addition is logged to MANIFEST, and DB crashes and reopens, the WAL 10 might have been removed from disk, but it still exists in MANIFEST.

This PR changes the behavior to: always `LogAndApply` any WAL addition or obsoletion event, without considering the order issues caused by concurrency, but when applying the edits to `WalSet`, do not add the WALs if they are already obsolete. In this approach, the logical events of WAL addition and obsoletion are always tracked in MANIFEST, so we can inspect the MANIFEST and know all the previous WAL events, but we choose to ignore certain events due to the concurrency issues such as the case above, or the case in https://github.com/facebook/rocksdb/pull/7725.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7759

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D25423089

Pulled By: cheng-chang

fbshipit-source-id: 9cb9a7fbc1875bf954f2a42f9b6cfd6d49a7b21c
2020-12-09 16:02:12 -08:00
Adam Retter
7b2216c906 Add further tests to ASSERT_STATUS_CHECKED (1) (#7679)
Summary:
First batch of adding more tests to ASSERT_STATUS_CHECKED.

* db_iterator_test
* db_memtable_test
* db_merge_operator_test
* db_merge_operand_test
* write_batch_test
* write_batch_with_index_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7679

Reviewed By: ajkr

Differential Revision: D25399270

Pulled By: pdillinger

fbshipit-source-id: 3017d0a686aec5cd2d743fc2acbbf75df239f3ba
2020-12-08 15:55:04 -08:00
Cheng Chang
07030c6f4a Do not track obsolete WALs in MANIFEST even if they are synced (#7725)
Summary:
Consider the case:
1. All column families are flushed, so all WALs become obsolete, but no WAL is removed from disk yet because the removal is asynchronous, a VersionEdit is written to MANIFEST indicating that WALs before a certain WAL number are obsolete, let's say this number is 3;
2. `SyncWAL` is called, so all the on-disk WALs are synced, and if track_and_verify_wal_in_manifest=true, the WALs will be tracked in MANIFEST, let's say the WAL numbers are 1 and 2;
3. DB crashes;
4. During DB recovery, when replaying MANIFEST, we first see that WAL with number < 3 are obsolete, then we see that WAL 1 and 2 are synced, so according to current implementation of `WalSet`, the `WalSet` will be recovered to include WAL 1 and 2;
5. WAL 1 and 2 are asynchronously deleted from disk, then the WAL verification algorithm fails with `Corruption: missing WAL`.

The above case is reproduced in a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal`.

The fix is to maintain the upper bound of the obsolete WAL numbers, any WAL with number less than the maintained number is considered to be obsolete, so shouldn't be tracked even if they are later synced. The number is maintained in `WalSet`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7725

Test Plan:
1. a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal` is added.
2. run `make crash_test` on devserver.

Reviewed By: riversand963

Differential Revision: D25238914

Pulled By: cheng-chang

fbshipit-source-id: f5dccd57c3d89f19565ec5731f2d42f06d272b72
2020-12-08 10:58:04 -08:00
Yanqin Jin
11c4be2222 Refactor ProcessManifestWrites a little bit (#7751)
Summary:
This PR removes a nested loop inside ProcessManifestWrites. The new
implementation has the same behavior as the old code with simpler logic
and lower complexity.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7751

Test Plan:
make check
Run make crash_test on devserver and succeeds 3 times.

Reviewed By: ltamasi

Differential Revision: D25363526

Pulled By: riversand963

fbshipit-source-id: 27e681949dacd7501a752e5e517b9e85b54ccb2e
2020-12-08 02:37:38 -08:00
mrambacher
db03172d08 Change ErrorHandler methods to return const Status& (#7539)
Summary:
This change eliminates the need for a lot of the PermitUncheckedError calls on return from ErrorHandler methods.  The calls are no longer needed as the status is returned as a reference rather than a copy.  Additionally, this means that the originating status (recovery_error_, bg_error_) is not cleared implicitly as a result of calling one of these methods.

For this class, I do not know if the proper behavior should be to call PermitUncheckedError in the destructor or if the checked state should be cleared when the status is cleared.  I did tests both ways.  Without the code in the destructor, the status will need to be cleared in at least some of the places where it is set to OK.  When running tests, I found no instances where this class was destructed with a non-OK, non-checked Status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7539

Reviewed By: anand1976

Differential Revision: D25340565

Pulled By: pdillinger

fbshipit-source-id: 1730c035c81a475875ea745226112030ec25136c
2020-12-07 20:11:35 -08:00
Levi Tamasi
8a06fe278f Do not use ASSERT_OK in child threads in ExternalSstFileTest.PickedLevelBug (#7754)
Summary:
`googletest` uses exceptions to communicate assertion failures when
`GTEST_THROW_ON_FAILURE` is set, which does not go well with
`std::thread`s, since an exception escaping the top-level function of an
`std::thread` object or an `std::thread` getting destroyed without
having been `join`ed or `detach`ed first results in a call to
`std::terminate`. The patch fixes this by moving the `Status` assertions
of background operations in `ExternalSstFileTest.PickedLevelBug` to the
main thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7754

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D25383808

Pulled By: ltamasi

fbshipit-source-id: 32fb2721e5169ec898d218900bc0d83eead45d03
2020-12-07 17:37:17 -08:00
Akanksha Mahajan
20c7d7c58a Handling misuse of snprintf return value (#7686)
Summary:
Handle misuse of snprintf return value to avoid Out of bound
read/write.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7686

Test Plan: make check -j64

Reviewed By: riversand963

Differential Revision: D25030831

Pulled By: akankshamahajan15

fbshipit-source-id: 1a1d181c067c78b94d720323ae00b79566b57cfa
2020-12-07 13:43:55 -08:00
Akanksha Mahajan
1df8584896 Fix unit test failure ppc64le in travis (#7752)
Summary:
Added a fix for the failure of
DBTest2.PartitionedIndexUserToInternalKey on ppc64le in travis
Closes https://github.com/facebook/rocksdb/issues/7746

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7752

Test Plan:
Ran travis job multiple times and it passed. Will keep
watching the travis job after this patch.

Reviewed By: pdillinger

Differential Revision: D25373130

Pulled By: akankshamahajan15

fbshipit-source-id: fa0e3f85f75b687415044a506e42cc38ead87975
2020-12-07 10:24:33 -08:00
Yanqin Jin
eee0af9af1 Add full_history_ts_low to column family (#7740)
Summary:
Following https://github.com/facebook/rocksdb/issues/7655 and https://github.com/facebook/rocksdb/issues/7657, this PR adds `full_history_ts_low_` to `ColumnFamilyData`.
`ColumnFamilyData::full_history_ts_low_` will be used to create `FlushJob` and `CompactionJob`.

`ColumnFamilyData::full_history_ts_low` is persisted to the MANIFEST file. An application can only
increase its value. Consider the following case:

>
> The database has a key at ts=950. `full_history_ts_low` is first set to 1000, and then a GC is triggered
> and cleans up all data older than 1000. If the application sets `full_history_ts_low` to 900 afterwards,
> and tries to read at ts=960, the key at 950 is not seen. From the perspective of the read, the result
> is hard to reason. For simplicity, we just do now allow decreasing full_history_ts_low for now.
>

During recovery, the value of `full_history_ts_low` is restored for each column family if applicable. Note that
version edits in the MANIFEST file for the same column family may have `full_history_ts_low` unsorted due
to the potential interleaving of `LogAndApply` calls. Only the max will be used to restore the state of the
column family.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7740

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D25296217

Pulled By: riversand963

fbshipit-source-id: 24acda1df8262cd7cfdc6ce7b0ec56438abe242a
2020-12-05 14:18:22 -08:00
Levi Tamasi
61932cdf1d Add blob support to DBIter (#7731)
Summary:
The patch adds iterator support to the integrated BlobDB implementation.
Whenever a blob reference is encountered during iteration, the corresponding
blob is retrieved by calling `Version::GetBlob`, assuming the `expose_blob_index`
(formerly `allow_blob`) flag is *not* set. (Note: the flag is set by the old stacked
BlobDB implementation, which has its own blob file handling/blob retrieval logic.)

In addition, `DBIter` now uniformly returns `Status::NotSupported` with the error
message `"BlobDB does not support merge operator."` when encountering a
blob reference while performing a merge (instead of potentially returning a
message that implies the database should be opened using the stacked BlobDB's
`Open`.)

TODO: We can implement support for lazily retrieving the blob value (or in other
words, bypassing the retrieval of blob values based on key) by extending the `Iterator`
API with a new `PrepareValue` method (similarly to `InternalIterator`, which already
supports lazy values).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7731

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D25256293

Pulled By: ltamasi

fbshipit-source-id: c39cd782011495a526cdff99c16f5fca400c4811
2020-12-04 21:29:38 -08:00
Zhichao Cao
e102de7318 Fix assert(cfd->imm()->NumNotFlushed() > 0) in FlushMemtable (#7744)
Summary:
In current code base, in FlushMemtable, when `(Flush_reason == FlushReason::kErrorRecoveryRetryFlush && (!cfd->mem()->IsEmpty() || !cached_recoverable_state_empty_.load()))`, we assert that cfd->imm()->NumNotFlushed() > 0. However, there are some corner cases that can fail this assert: 1) if there are multiple CFs, some CF has immutable memtable, some CFs don't. In ResumeImpl, all CFs will call FlushMemtable, which will hit the assert. 2) Regular flush is scheduled and running, the resume thread is waiting. New KVs are inserted and SchedulePendingFlush is called. Regular flush will continue call MaybeScheduleFlushAndCompaction until all the immutable memtables are flushed. When regular flush ends and auto resume thread starts to schedule new flushes, cfd->imm()->NumNotFlushed() can be 0.

Remove the assert and added the comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7744

Test Plan: make check and pass the stress test

Reviewed By: riversand963

Differential Revision: D25340573

Pulled By: zhichao-cao

fbshipit-source-id: eac357bdace660247c197f01a9ff6857e3c97672
2020-12-04 20:31:39 -08:00
Zhichao Cao
eb5a8c06dd Fix the thread wait case in error_handler (#7700)
Summary:
In error_handler auto recovery case, if recovery_in_prog_ is false, the recover is finished or failed. In this case, the auto recovery thread should finish its execution so recovery_thread_ should be null. However, in some cases, it is not null, the caller should not directly returned. Instead, it should wait for a while and create a new thread to execute the new recovery.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7700

Test Plan: make check, error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D25098233

Pulled By: zhichao-cao

fbshipit-source-id: 5a1cba234ca18f6dd5d1be88e02d66e1d5ce931b
2020-12-04 14:58:37 -08:00
Cheng Chang
70f2e0916a Write min_log_number_to_keep to MANIFEST during atomic flush under 2 phase commit (#7570)
Summary:
When 2 phase commit is enabled, if there are prepared data in a WAL, the WAL should be kept, the minimum log number for such a WAL is written to MANIFEST during flush. In atomic flush, such information is not written to MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7570

Test Plan: Added a new unit test `DBAtomicFlushTest.ManualFlushUnder2PC`, this test fails in atomic flush without this PR, after this PR, it succeeds.

Reviewed By: riversand963

Differential Revision: D24394222

Pulled By: cheng-chang

fbshipit-source-id: 60ce74b21b704804943be40c8de01b41269cf116
2020-12-03 19:22:24 -08:00
Zhichao Cao
29e8f6a698 Add kManifestWriteNoWAL to BackgroundErrorReason to handle Flush IO Error when WAL is disabled (#7693)
Summary:
In the current code base, all the manifest writes with IO error will be set with reason: BackgroundErrorReason::kManifestWrite, which will be mapped to the kHardError if the IO Error is retryable. However, if the system does not use the WAL, all the retryable IO error should be mapped to kSoftError. Create this PR to handle is special case by adding kManifestWriteNoWAL to BackgroundErrorReason.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7693

Test Plan: make check, add new testing cases to error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D25066204

Pulled By: zhichao-cao

fbshipit-source-id: d59553896c2eac3fb37c05238544d2b265379462
2020-12-02 18:24:01 -08:00
Jay Zhuang
7fec715db4 Make CompactRange and GetApproximateSizes work with timestamp (#7684)
Summary:
Add timestamp to the `CompactRange()` and `GetApproximateSizes` range keys if needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7684

Test Plan: make check

Reviewed By: riversand963

Differential Revision: D25015421

Pulled By: jay-zhuang

fbshipit-source-id: 51ca0756087eb053a3b11801e5c7ce1c6e2d38a9
2020-12-02 13:00:53 -08:00
Yanqin Jin
e062a719cc Fix assertion failure in bg flush (#7362)
Summary:
https://github.com/facebook/rocksdb/issues/7340 reports and reproduces an assertion failure caused by a combination of the following:
- atomic flush is disabled.
- a column family can appear multiple times in the flush queue at the same time. This behavior was introduced in release 5.17.

Consequently, it is possible that two flushes race with each other. One bg flush thread flushes all memtables. The other thread calls `FlushMemTableToOutputFile()` afterwards, and hits the assertion error below.

```
  assert(cfd->imm()->NumNotFlushed() != 0);
  assert(cfd->imm()->IsFlushPending());
```

Fix this by reverting the behavior. In non-atomic-flush case, a column family can appear in the flush queue at most once at the same time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7362

Test Plan:
make check
Also run stress test successfully for 10 times.
```
make crash_test
```

Reviewed By: ajkr

Differential Revision: D25172996

Pulled By: riversand963

fbshipit-source-id: f1559b6366cc609e961e3fc83fae548f1fad08ce
2020-12-02 09:31:14 -08:00
Jay Zhuang
9e1640403a Exclude timestamp from prefix extractor (#7668)
Summary:
Timestamp should not be included in prefix extractor, as we discussed here: https://github.com/facebook/rocksdb/pull/7589#discussion_r511068586

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7668

Test Plan: added unittest

Reviewed By: riversand963

Differential Revision: D24966265

Pulled By: jay-zhuang

fbshipit-source-id: 0dae618c333d4b7942a40d556535a1795e060aea
2020-12-01 14:07:15 -08:00
Andrew Kryczka
eb65d673fe Fix kPointInTimeRecovery handling of truncated WAL (#7701)
Summary:
WAL may be truncated to an incomplete record due to crash while writing
the last record or corruption. In the former case, no hole will be
produced since no ACK'd data was lost. In the latter case, a hole could
be produced without this PR since we proceeded to recover the next WAL
as if nothing happened. This PR changes the record reading code to
always report a corruption for incomplete records in
`kPointInTimeRecovery` mode, and the upper layer will only ignore them
if the next WAL has consecutive seqnum (i.e., we are guaranteed no
hole).

While this solves the hole problem for the case of incomplete
records, the possibility is still there if the WAL is corrupted by
truncation to an exact record boundary. This PR also regresses how much data
can be recovered when writes are mixed with/without
`WriteOptions::disableWAL`, as then we can not distinguish between a
seqnum gap caused by corruption and a seqnum gap caused by a `disableWAL` write.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7701

Test Plan:
Interestingly there already was a test for this case
(`DBWALTestWithParams.kPointInTimeRecovery`); it just had a typo bug in
the verification that prevented it from noticing holes in recovery.

Reviewed By: anand1976

Differential Revision: D25111765

Pulled By: ajkr

fbshipit-source-id: 5e330b13b1ee2b5be096cea9d0ff6075843e57b6
2020-11-30 18:11:38 -08:00
Levi Tamasi
51a8dc6d14 Integrated blob garbage collection: relocate blobs (#7694)
Summary:
The patch adds basic garbage collection support to the integrated BlobDB
implementation. Valid blobs residing in the oldest blob files are relocated
as they are encountered during compaction. The threshold that determines
which blob files qualify is computed based on the configuration option
`blob_garbage_collection_age_cutoff`, which was introduced in https://github.com/facebook/rocksdb/issues/7661 .
Once a blob is retrieved for the purposes of relocation, it passes through the
same logic that extracts large values to blob files in general. This means that
if, for instance, the size threshold for key-value separation (`min_blob_size`)
got changed or writing blob files got disabled altogether, it is possible for the
value to be moved back into the LSM tree. In particular, one way to re-inline
all blob values if needed would be to perform a full manual compaction with
`enable_blob_files` set to `false`, `enable_blob_garbage_collection` set to
`true`, and `blob_file_garbage_collection_age_cutoff` set to `1.0`.

Some TODOs that I plan to address in separate PRs:

1) We'll have to measure the amount of new garbage in each blob file and log
`BlobFileGarbage` entries as part of the compaction job's `VersionEdit`.
(For the time being, blob files are cleaned up solely based on the
`oldest_blob_file_number` relationships.)
2) When compression is used for blobs, the compression type hasn't changed,
and the blob still qualifies for being written to a blob file, we can simply copy
the compressed blob to the new file instead of going through decompression
and compression.
3) We need to update the formula for computing write amplification to account
for the amount of data read from blob files as part of GC.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7694

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D25069663

Pulled By: ltamasi

fbshipit-source-id: bdfa8feb09afcf5bca3b4eba2ba72ce2f15cd06a
2020-11-23 21:08:22 -08:00
Andrew Kryczka
dd6b7fc520 Return Status from MemTable mutation functions (#7656)
Summary:
This PR updates `MemTable::Add()`, `MemTable::Update()`, and
`MemTable::UpdateCallback()` to return `Status` objects, and adapts the
client code in `MemTableInserter`. The goal is to prepare these
functions for key-value checksum, where we want to verify key-value
integrity while adding to memtable. After this PR, the memtable mutation
functions can report a failed integrity check by returning `Status::Corruption`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7656

Reviewed By: riversand963

Differential Revision: D24900497

Pulled By: ajkr

fbshipit-source-id: 1a7e80581e3774676f2bbba2f0a0b04890f40009
2020-11-23 16:29:04 -08:00
Yanqin Jin
1a5fc4f577 Port corruption test to use custom env (#7699)
Summary:
Allow corruption_test to run on custom env loaded via
`Env::LoadEnv()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7699

Test Plan:
```
make corruption_test
./corruption_test
```

Also run on in-house custom env.

Reviewed By: zhichao-cao

Differential Revision: D25135525

Pulled By: riversand963

fbshipit-source-id: 7941e7ce342dc88ec2cd63e90f7674a2f57de6b7
2020-11-20 18:40:24 -08:00
Cheng Chang
7169ca9c80 Do not track empty WALs (#7697)
Summary:
An empty WAL won't be backed up by the BackupEngine. So if we track the empty WALs in MANIFEST, then when restoring from a backup, it may report corruption that the empty WAL is missing, which is correct because the WAL is actually in the main DB but not in the backup DB, but missing an empty WAL does not logically break DB consistency.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7697

Test Plan: watch existing tests to pass

Reviewed By: pdillinger

Differential Revision: D25077194

Pulled By: cheng-chang

fbshipit-source-id: 01917b57234b92b6063925f2ee9452c5732bdc03
2020-11-18 21:27:54 -08:00
Cheng Chang
8c93b16f02 Track WAL in MANIFEST: Update logic for computing min_log_number_to_keep in atomic flush (#7660)
Summary:
The logic for computing min_log_number_to_keep in atomic flush was incorrect.

For example, when all column families are flushed, the min_log_number_to_keep should be the latest new log. But the incorrect logic calls `PrecomputeMinLogNumberToKeepNon2PC` for each column family, and returns the minimum of them. However, `PrecomputeMinLogNumberToKeepNon2PC(cf)` assumes column families other than `cf` are flushed, but in case all column families are flushed, this assumption is incorrect.

Without this fix, the WAL referenced by the computed min_log_number_to_keep may actually contain no unflushed data, so the WAL might have actually been deleted from disk on recovery, then an incorrect error `Corruption: missing WAL` will be reported.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7660

Test Plan:
run `make crash_test_with_atomic_flush`  on devserver
added a unit test in `db_flush_test`

Reviewed By: riversand963

Differential Revision: D24906265

Pulled By: cheng-chang

fbshipit-source-id: 08deda62e71f67f59e3b7925cdd86dd09bd4f430
2020-11-17 15:55:55 -08:00
Yanqin Jin
869f0538dd Clean up after two test failures in db_basic_test (#7682)
Summary:
In db_basic_test.cc, there are two tests that rely on the underlying
system's `LockFile` support to function correctly:
DBBasicTest.OpenWhenOpen and DBBasicTest.CheckLock. In both tests,
re-opening a db using `DB::Open` is expected to fail because the second
open cannot lock the LOCK file. Some distributed file systems, e.g. HDFS
do not support the POSIX-style file lock. Therefore, these unit tests will cause
assertion failure and the second `Open` will create a db instance.
Currently, these db instances are not closed after the assertion
failure. Since these db instances are registered with some process-wide, static
data structures, e.g. `PeriodicWorkScheduler::Default()`, they can still be
accessed after the unit tests. However, the `Env` object created for this db
instance is destroyed when the test finishes in `~DBTestBase()`. Consequently,
it causes illegal memory access.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7682

Test Plan:
Run the following on a distrubited file system:
```
make check
```

Reviewed By: anand1976

Differential Revision: D25004215

Pulled By: riversand963

fbshipit-source-id: f4327d7716c0e72b13bb43737ec9a5d156da4d52
2020-11-16 22:09:01 -08:00
Andrew Kryczka
1c5f13f2a5 Fail early when merge_operator not configured (#7667)
Summary:
An application may accidentally write merge operands without properly configuring `merge_operator`. We should alert them as early as possible that there's an API misuse. Previously RocksDB only notified them when a query or background operation needed to merge but couldn't. With this PR, RocksDB notifies them of the problem before applying the merge operand to the memtable (although it may already be in WAL, which seems it'd cause a crash loop until they enable `merge_operator`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7667

Reviewed By: riversand963

Differential Revision: D24933360

Pulled By: ajkr

fbshipit-source-id: 3a4a2ceb0b7aed184113dd03b8efd735a8332f7f
2020-11-16 20:39:01 -08:00
Cheng Chang
1aae41786a Do not track WAL in MANIFEST when fsync is disabled in a test (#7669)
Summary:
If fsync is disabled in a unit test, then do not track WAL in MANIFEST, because on DB recovery, the WAL might be missing because the directory is not fsynced.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7669

Test Plan: Tests with fsync enabled should pass.

Reviewed By: riversand963

Differential Revision: D24941431

Pulled By: cheng-chang

fbshipit-source-id: ab3ff0f90769795cfb4e4d6dcf084ea5545d1975
2020-11-13 13:37:14 -08:00
Peter Dillinger
60af964372 Experimental (production candidate) SST schema for Ribbon filter (#7658)
Summary:
Added experimental public API for Ribbon filter:
NewExperimentalRibbonFilterPolicy(). This experimental API will
take a "Bloom equivalent" bits per key, and configure the Ribbon
filter for the same FP rate as Bloom would have but ~30% space
savings. (Note: optimize_filters_for_memory is not yet implemented
for Ribbon filter. That can be added with no effect on schema.)

Internally, the Ribbon filter is configured using a "one_in_fp_rate"
value, which is 1 over desired FP rate. For example, use 100 for 1%
FP rate. I'm expecting this will be used in the future for configuring
Bloom-like filters, as I expect people to more commonly hold constant
the filter accuracy and change the space vs. time trade-off, rather than
hold constant the space (per key) and change the accuracy vs. time
trade-off, though we might make that available.

### Benchmarking

```
$ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 34.1341
Number of filters: 1993
Total size (MB): 238.488
Reported total allocated memory (MB): 262.875
Reported internal fragmentation: 10.2255%
Bits/key stored: 10.0029
----------------------------
Mixed inside/outside queries...
  Single filter net ns/op: 18.7508
  Random filter net ns/op: 258.246
    Average FP rate %: 0.968672
----------------------------
Done. (For more info, run with -legend or -help.)
$ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 130.851
Number of filters: 1993
Total size (MB): 168.166
Reported total allocated memory (MB): 183.211
Reported internal fragmentation: 8.94626%
Bits/key stored: 7.05341
----------------------------
Mixed inside/outside queries...
  Single filter net ns/op: 58.4523
  Random filter net ns/op: 363.717
    Average FP rate %: 0.952978
----------------------------
Done. (For more info, run with -legend or -help.)
```

168.166 / 238.488 = 0.705  -> 29.5% space reduction

130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction)

### Working around a hashing "flaw"

bloom_test discovered a flaw in the simple hashing applied in
StandardHasher when num_starts == 1 (num_slots == 128), showing an
excessively high FP rate.  The problem is that when many entries, on the
order of number of hash bits or kCoeffBits, are associated with the same
start location, the correlation between the CoeffRow and ResultRow (for
efficiency) can lead to a solution that is "universal," or nearly so, for
entries mapping to that start location. (Normally, variance in start
location breaks the effective association between CoeffRow and
ResultRow; the same value for CoeffRow is effectively different if start
locations are different.) Without kUseSmash and with num_starts > 1 (thus
num_starts ~= num_slots), this flaw should be completely irrelevant.  Even
with 10M slots, the chances of a single slot having just 16 (or more)
entries map to it--not enough to cause an FP problem, which would be local
to that slot if it happened--is 1 in millions. This spreadsheet formula
shows that: =1/(10000000*(1 - POISSON(15, 1, TRUE)))

As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is
intended for CPU efficiency of filters with many more entries/slots than
kCoeffBits, a very reasonable work-around is to disallow num_starts==1
when !kUseSmash, by making the minimum non-zero number of slots
2*kCoeffBits. This is the work-around I've applied. This also means that
the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not
space-efficient for less than a few hundred entries. Because of this, I
have made it fall back on constructing a Bloom filter, under existing
schema, when that is more space efficient for small filters. (We can
change this in the future if we want.)

TODO: better unit tests for this case in ribbon_test, and probably
update StandardHasher for kUseSmash case so that it can scale nicely to
small filters.

### Other related changes

* Add Ribbon filter to stress/crash test
* Add Ribbon filter to filter_bench as -impl=3
* Add option string support, as in "filter_policy=experimental_ribbon:5.678;"
where 5.678 is the Bloom equivalent bits per key.
* Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom
* Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on
binary searching CalculateSpace (inefficient), so that subclasses
(especially experimental ones) don't have to provide an efficient
implementation inverting CalculateSpace.
* Minor refactor FastLocalBloomBitsBuilder for new base class
XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder,
which allows the latter to fall back on Bloom construction in some
extreme cases.
* Mostly updated bloom_test for Ribbon filter, though a test like
FullBloomTest::Schema is a next TODO to ensure schema stability
(in case this becomes production-ready schema as it is).
* Add some APIs to ribbon_impl.h for configuring Ribbon filters.
Although these are reasonably covered by bloom_test, TODO more unit
tests in ribbon_test
* Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data
for constructing the linear approximations in GetNumSlotsFor95PctSuccess.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658

Test Plan:
Some unit tests updated but other testing is left TODO. This
is considered experimental but laying down schema compatibility as early
as possible in case it proves production-quality. Also tested in
stress/crash test.

Reviewed By: jay-zhuang

Differential Revision: D24899349

Pulled By: pdillinger

fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8
2020-11-12 20:46:14 -08:00
Yanqin Jin
76ef894f9f Add full_history_ts_low_ to FlushJob (#7655)
Summary:
https://github.com/facebook/rocksdb/issues/7556 enables `CompactionIterator` to perform garbage collection during compaction according
to a lower bound (user-defined) timestamp `full_history_ts_low_`.
This PR adds a data member `full_history_ts_low_` of type `std::string` to `FlushJob`, and
`full_history_ts_low_` does not change during flush. `FlushJob` will pass a pointer to this data member
to the `CompactionIterator` used during flush.

Also refactored flush_job_test.cc to re-use some existing code, which is actually the majority of this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7655

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D24933340

Pulled By: riversand963

fbshipit-source-id: 2e584bfd0cf6e5c295ab1af264e68e9d6a12fca3
2020-11-12 18:44:34 -08:00
Levi Tamasi
bb69b4ce7f Fix InternalStats::DumpCFStats (#7666)
Summary:
https://github.com/facebook/rocksdb/pull/7461 accidentally broke
`InternalStats::DumpCFStats` by making `DumpCFFileHistogram` overwrite
the output of `DumpCFStatsNoFileHistogram` instead of appending to it,
resulting in only the file histogram related information getting logged.
The patch fixes this by reverting to appending in `DumpCFFileHistogram`.

Fixes https://github.com/facebook/rocksdb/issues/7664 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7666

Test Plan: Ran `make check` and checked the info log of `db_bench`.

Reviewed By: riversand963

Differential Revision: D24929051

Pulled By: ltamasi

fbshipit-source-id: 636a3d5ebb5ce23de4f3fe4f03ad3f16cb2858f8
2020-11-12 17:33:04 -08:00
Yanqin Jin
cf9d8e45c0 Add full_history_ts_low_ to CompactionJob (#7657)
Summary:
https://github.com/facebook/rocksdb/issues/7556 enables `CompactionIterator` to perform garbage collection during compaction according
to a lower bound (user-defined) timestamp `full_history_ts_low_`.

This PR adds a data member `full_history_ts_low_` of type `std::string` to `CompactionJob`, and
`full_history_ts_low_` does not change during compaction. `CompactionJob` will pass a pointer to this
data member to the `CompactionIterator` used during compaction.

Also refactored compaction_job_test.cc to re-use some existing code, which is actually the majority of this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7657

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D24913803

Pulled By: riversand963

fbshipit-source-id: 11ad5329ddac365667152e7b3b02f84182c0ca8e
2020-11-12 11:43:24 -08:00
Levi Tamasi
0dc437d65c Clean up CompactionProxy (#7662)
Summary:
`CompactionProxy` is currently both a concrete class used for actual `Compaction`s
and a base class that `FakeCompaction` (which is used in `compaction_iterator_test`)
is derived from. This is bad from an OO design standpoint, and also results in
`FakeCompaction` containing an (uninitialized and unused) `Compaction*` member.
The patch fixes this by making `CompactionProxy` a pure interface and introducing
a separate concrete class `RealCompaction` for non-test/non-fake compactions. It
also removes an unused parameter from the virtual method `level`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7662

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24907680

Pulled By: ltamasi

fbshipit-source-id: c100ecb1beef4b0ada35e799116c5bda71719ee7
2020-11-12 08:49:35 -08:00
Andrew Kryczka
ec346da98c Always apply bottommost_compression_opts when enabled (#7633)
Summary:
Previously, even when `bottommost_compression_opts`'s `enabled` flag was set, it only took effect when
`bottommost_compression` was also set to something other than `kDisableCompressionOption`.
This wasn't documented and, if we kept the old behavior, it'd make
things complicated like the migration instructions in https://github.com/facebook/rocksdb/issues/7619. We can
simplify the API by making `bottommost_compression_opts` always take
effect when its `enabled` flag is set.

Fixes https://github.com/facebook/rocksdb/issues/7631.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7633

Reviewed By: ltamasi

Differential Revision: D24710358

Pulled By: ajkr

fbshipit-source-id: bbbdf9c1b53c63a4239d902cc3f5a11da1874647
2020-11-11 20:32:28 -08:00
Yanqin Jin
8b6b6aeb1a Refactor with VersionEditHandler (#6581)
Summary:
Added a few classes in the same class hierarchy to remove code duplication and
refactor the logic of reading and processing MANIFEST files.

New classes are as follows.
```
class VersionEditHandlerBase;
class ListColumnFamiliesHandler : VersionEditHandlerBase;
class FileChecksumRetriever : VersionEditHandlerBase;
class DumpManifestHandler : VersionEditHandler;
```
Classes that already existed before this PR are as follows.
```
class VersionEditHandler : VersionEditHandlerBase;
```

With these classes, refactored functions: `VersionSet::Recover()`,
`VersionSet::ListColumnFamilies()`, `VersionSet::DumpManifest()`,
`GetFileChecksumFromManifest()`.

Test Plan (devserver):
```
make check
COMPILE_WITH_ASAN=1 make check
```
These refactored code, especially recovery-related logic, will be tested intensively by
all existing unit tests and stress tests. For example, run
```
make crash_test
```
Verified 3 successful runs on devserver.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6581

Reviewed By: ajkr

Differential Revision: D20616217

Pulled By: riversand963

fbshipit-source-id: 048c7743aa4be2623ccd0cc3e61c0027e604e78b
2020-11-11 08:00:14 -08:00
Peter Dillinger
c57f914482 Use NPHash64 in more places (#7632)
Summary:
Since the hashes should not be persisted in output_validator
nor mock_env.

Also updated NPHash64 to use 64-bit seed, and comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7632

Test Plan:
make check, and new build setting that enables modification
to NPHash64, to check for behavior depending on specific values. Added
that setting to one of the CircleCI configurations.

Reviewed By: jay-zhuang

Differential Revision: D24833780

Pulled By: pdillinger

fbshipit-source-id: 02a57652ccf1ac105fbca79e77875bb7bf7c071f
2020-11-10 23:42:13 -08:00
Yanqin Jin
bcba372352 Report if unpinnable value encountered during backward iteration (#7618)
Summary:
There is an undocumented behavior about a certain combination of options and operations.
- inplace_update_support = true, and
- call `SeekForPrev()`, `SeekToLast()`, and/or `Prev()` on unflushed data.

We should stop the backward iteration and report an error of `Status::NotSupported`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7618

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D24769619

Pulled By: riversand963

fbshipit-source-id: 81d199fa55ed4739ab10e719cc345a992238ccbb
2020-11-10 17:17:39 -08:00
Jay Zhuang
18aee7db7e Fix a seek issue with prefix extractor and timestamp (#7644)
Summary:
During seek, prefix compare should not include timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7644

Test Plan: added unittest

Reviewed By: riversand963

Differential Revision: D24772066

Pulled By: jay-zhuang

fbshipit-source-id: 3982655a8bf8da256a738e8497b73b3d9bdac92e
2020-11-10 14:53:13 -08:00
Yanqin Jin
9f1c84ca47 Fix a bug in compaction iterator with timestamp (#7645)
Summary:
https://github.com/facebook/rocksdb/issues/7556 introduced support for compaction iterator to perform timestamp-aware garbage collection.
However, there was a bug. The comparison between `ikey_.user_key` and `current_user_key_` should happen
before `key_ = current_key_.SetInternalKey(key_, &ikey_);` (line 336 of compaction_iterator.cc).
Otherwise, after this line, `current_key_` is always the same as `ikey_.user_key`.

This PR also re-arranged the order of some data members because some of them are state variables of `CompactionIterator` while others are inputs from callers.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7645

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D24845028

Pulled By: riversand963

fbshipit-source-id: c7e79914832701462b86867e8463cd463b6c0c25
2020-11-09 18:23:31 -08:00
Cheng Chang
c3911f1a72 Track WAL in MANIFEST: Track deleted WALs in MANIFEST after recovering from the WALs (#7649)
Summary:
After replaying the WALs, the memtables are flushed synchronously to L0 instead of being flushed in background. Currently, we only track WAL obsoletion events in the code path of background flush jobs. This PR tracks these events in RecoverLogFiles.

After this change, we can enable `track_and_verify_wal_in_manifest` in `db_stress`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7649

Test Plan: `python tools/db_crashtest.py whitebox`

Reviewed By: riversand963

Differential Revision: D24824501

Pulled By: cheng-chang

fbshipit-source-id: 207129f7b845c50b333680ce6818a68a2fad54b9
2020-11-09 10:25:43 -08:00
Cheng Chang
5e794b0841 Fix a recovery corner case (#7621)
Summary:
Consider the following sequence of events:

1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST.
2. Syncing MANIFEST failed and db crashed.
3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file.
4. Db crashed again.
5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed.

It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5.

After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621

Test Plan:
1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery.
2. a new unit test is added in db_basic_test to simulate step 3.

Reviewed By: riversand963

Differential Revision: D24668144

Pulled By: cheng-chang

fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b
2020-11-07 22:23:27 -08:00
Cheng Chang
1e40696dd1 Track WAL in MANIFEST: LogAndApply WAL events to MANIFEST (#7601)
Summary:
When a WAL is synced, an edit is written to MANIFEST.
After flushing memtables, the obsoleted WALs are piggybacked to MANIFEST while writing the new L0 files to MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7601

Test Plan:
`track_and_verify_wals_in_manifest` is enabled by default for all tests extending `DBBasicTest`, and in db_stress_test.
Unit test `wal_edit_test`, `version_edit_test`, and `version_set_test` are also updated.
Watch all tests to pass.

Reviewed By: ltamasi

Differential Revision: D24553957

Pulled By: cheng-chang

fbshipit-source-id: 66a569ff1bdced38e22900bd240b73113906e040
2020-11-06 17:22:36 -08:00
Cheng Chang
1ce105d0ea Disable fsync in DBMergeOperatorTest to save test time (#7640)
Summary:
The test often times out in internal test infra.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7640

Test Plan: watch test to pass internally

Reviewed By: anand1976

Differential Revision: D24764928

Pulled By: cheng-chang

fbshipit-source-id: 587f2afc97f52909837943fd938a86ca94544b2c
2020-11-06 15:24:17 -08:00
Cheng Chang
cdc7ba3a32 DBTablePropertiesTest often times out in internal test infra (#7639)
Summary:
In this test, after flushing memtable, it will read directly from the sst files, so `env_do_fsync` was `true` to ensure that the flushed sst files can be read afterwards. Considering that the test does not last long, the data should be available in os buffer even without fsync, so this PR tries to disable fsync to reduce test time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7639

Test Plan: watch the test to pass in internal infra

Reviewed By: anand1976

Differential Revision: D24764689

Pulled By: cheng-chang

fbshipit-source-id: ef827611a3eaca04201e4280ae801d6c8e60c138
2020-11-06 14:25:14 -08:00
Cheng Chang
4c2aef04bd ColumnFamilyTest often times out in internal test infra (#7638)
Summary:
Tries to fix by skipping fsync.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7638

Test Plan: watch the tests to pass

Reviewed By: jay-zhuang

Differential Revision: D24764355

Pulled By: cheng-chang

fbshipit-source-id: 9c21b177709025ca1943066d94da89324ed47655
2020-11-06 10:25:20 -08:00
Cheng Chang
81543369e5 Disable fsync in db_range_del_test (#7637)
Summary:
This test often times out in internal test infra.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7637

Test Plan: watch test to pass

Reviewed By: ajkr

Differential Revision: D24763939

Pulled By: cheng-chang

fbshipit-source-id: 6564ee2ef637e9faf6688d4b6a5d74a72a51c5e8
2020-11-06 10:25:20 -08:00
Yanqin Jin
b6d8e36741 Compute NeedCompact() after table builder Finish() (#7627)
Summary:
In `BuildTable()`, we call `builder->Finish()` before evaluating `builder->NeedCompact()`.
However, we call `builder->NeedCompact()` before `builder->Finish()` in compaction job. This can be wrong because the table properties collectors may rely on the success of `Finish()` to provide correct result for `NeedCompact()`.

Test plan (on devserver):
make check

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7627

Reviewed By: ajkr

Differential Revision: D24728741

Pulled By: riversand963

fbshipit-source-id: 5a0dce244e14eb1106c4f87021e6bebca82b486e
2020-11-04 10:44:56 -08:00
Yanqin Jin
fde0cd7ced Add API to verify whole sst file checksum (#7578)
Summary:
Existing API `VerifyChecksum()` allows application to verify sst files' block checksums.
Since whole file, user-specified checksum is tracked in MANIFEST, we can expose a new
API to verify sst files' file checksums.

```
// Compute table file checksums if applicable and compare with MANIFEST.
// Returns OK if no file has mismatching whole-file checksum.
Status DB::VerifyFileChecksums(const ReadOptions& /*read_options*/);
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7578

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D24436783

Pulled By: riversand963

fbshipit-source-id: 52b51519b842f2b3c4e3351998a97c86cbec85b3
2020-11-03 20:34:56 -08:00
Yanqin Jin
0b94468bba Avoid skipping a test in db_wal_test (#7628)
Summary:
Recent test report shows that some tests have been skipped.

For DBWALTest that inherits from DBTestBase, the following will always be
true, since `env_` is an instance of `SpecialEnv`, not `Env::Default()`. Thus the test
will always be skipped.

```
if (options.env != Env::Default()) {
  ROCKSDB_GTEST_SKIP("Test requires default environment");
  return;
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7628

Test Plan:
./db_wal_test --gtest_filter=DBWALTest.TruncateLastLogAfterRecoverWithoutFlush
MEM_ENV=1 ./db_wal_test --gtest_filter=DBWALTest.TruncateLastLogAfterRecoverWithoutFlush
make check

Reviewed By: jay-zhuang

Differential Revision: D24693006

Pulled By: riversand963

fbshipit-source-id: 7f2a772492a0f11bff17bbf5e9f493e9e9a1c125
2020-11-03 09:48:16 -08:00
Jay Zhuang
881e0dcc09 Fix MultiGet unable to query timestamp data issue (#7589)
Summary:
The filter query key should not contain timestamp. The timestamp is
stripped for Get(), but not MultiGet().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7589

Reviewed By: riversand963

Differential Revision: D24494661

Pulled By: jay-zhuang

fbshipit-source-id: fc5ff40f9d683a89a760c6ff0ab3aed05a70c317
2020-11-03 09:45:41 -08:00
Yanqin Jin
c992eb118b Avoid skipping a test in db_test2 (#7629)
Summary:
Test report shows that this test has been skipped recently due to
a condition that will never meet. `env_` is not equal to
`Env::Default()` for DBTest2 that inherits from DBTestBase.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7629

Test Plan:
make check
./db_test2 --gtest_filter=DBTest2.PinnableSliceAndMmapReads

Reviewed By: jay-zhuang

Differential Revision: D24693317

Pulled By: riversand963

fbshipit-source-id: b1bbd5c1e05a6fa57c1de0d74462b69e3c2d5215
2020-11-02 19:48:23 -08:00
Andrew Kryczka
1adbceb581 Expand effect of dictionary settings in ColumnFamilyOptions::compression_opts (#7619)
Summary:
In dictionary compression's initial implementation, in order to save CPU overhead, we only enabled it
for bottom level under the assumption that the vast majority of data is
stored there. At that time, there was no
such thing as `ColumnFamilyOptions::bottommost_compression_opts`, so we just
hardcoded disabling dictionary compression in flush and compactions to
non-bottommost level. Now, we have users who generate all their files
through flush and are considering using dictionary compression.

To support such a use case, this PR expands the scope of `ColumnFamilyOptions::compression_opts` to
additionally include flushed files and files generated by compaction to
a non-bottommost level. Users can still get the old behavior by moving
their dictionary settings to `ColumnFamilyOptions::bottommost_compression_opts`
and explicitly enabling both that and `ColumnFamilyOptions::bottommost_compression`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7619

Reviewed By: ltamasi

Differential Revision: D24665610

Pulled By: ajkr

fbshipit-source-id: 656b90bce1033fe21c71e09af931ef5bde3e464c
2020-11-02 19:21:11 -08:00
Yanqin Jin
394210f280 Remove unused includes (#7604)
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D24579392

Pulled By: riversand963

fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
2020-10-28 23:22:27 -07:00
Zhichao Cao
ea347d80df Updated GenerateOneFileChecksum to use requested_checksum_func_name (#7586)
Summary:
CreateFileChecksumGenerator may uses requested_checksum_func_name in generator context to decide which generator will be used. GenerateOneFileChecksum has not being updated to use it, which will always get the generator when the name is empty. Fix it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7586

Test Plan: make check

Reviewed By: riversand963

Differential Revision: D24491989

Pulled By: zhichao-cao

fbshipit-source-id: d9fdfdd431240f0a9a2e781ddbd48a7d6c609aad
2020-10-28 16:47:12 -07:00
darionyaphet
793e9b7f5b Remove duplicate close (#7594)
Summary:
Because `Close()` have called in `Destroy()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7594

Reviewed By: zhichao-cao

Differential Revision: D24576407

Pulled By: riversand963

fbshipit-source-id: eba70d73375fd47dd78ca64c6a1fab3628448276
2020-10-28 10:48:53 -07:00
Ramkumar Vadivelu
9a690a74e1 In ParseInternalKey(), include corrupt key info in Status (#7515)
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497

When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`

Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.

Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test

Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515

Reviewed By: ajkr

Differential Revision: D24240264

Pulled By: ramvadiv

fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
2020-10-28 10:12:58 -07:00
mrambacher
f35f7f2704 Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:

1.  The MockFileSystem class was split out from the MockEnv.  This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one).  The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.

2.  Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set.  To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).

3.  Updated the test framework to have a ROCKSDB_GTEST_SKIP macro.  This can be used to flag tests that are skipped.  Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).

I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV,  both, and neither under both MacOS and RedHat.  A few tests were disabled/skipped for the MEM/ENCRYPTED cases.  The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem.  (I will also push a change to disable those tests soon).  There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.

Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.

Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale.  I do not know how to do that, so if someone could write that job, it would be appreciated :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566

Reviewed By: zhichao-cao

Differential Revision: D24408980

Pulled By: jay-zhuang

fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 10:33:09 -07:00
Yanqin Jin
6134ce6444 Perform post-flush updates of memtable list in a callback (#6069)
Summary:
Currently, the following interleaving of events can lead to SuperVersion containing both immutable memtables as well as the resulting L0. This can cause Get to return incorrect result if there are merge operands. This may also affect other operations such as single deletes.

```
  time  main_thr  bg_flush_thr  bg_compact_thr  compact_thr  set_opts_thr
0  |                                                         WriteManifest:0
1  |                                           issue compact
2  |                                 wait
3  |   Merge(counter)
4  |   issue flush
5  |                   wait
6  |                                                         WriteManifest:1
7  |                                 wake up
8  |                                 write manifest
9  |                  wake up
10 |  Get(counter)
11 |                  remove imm
   V
```

The reason behind is that: one bg flush thread's installing new `Version` can be batched and performed by another thread that is the "leader" MANIFEST writer. This bg thread removes the memtables from current super version only after `LogAndApply` returns. After the leader MANIFEST writer signals (releasing mutex) this bg flush thread, it is possible that another thread sees this cf with both memtables (whose data have been flushed to the newest L0) and the L0 before this bg flush thread removes the memtables.

To address this issue, each bg flush thread can pass a callback function to `LogAndApply`. The callback is responsible for removing the memtables. Therefore, the leader MANIFEST writer can call this callback and remove the memtables before releasing the mutex.

Test plan (devserver)
```
$make merge_test
$./merge_test --gtest_filter=MergeTest.MergeWithCompactionAndFlush
$make check
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6069

Reviewed By: cheng-chang

Differential Revision: D18790894

Pulled By: riversand963

fbshipit-source-id: e41bd600c0448b4f4b2deb3f7677f95e3076b4ed
2020-10-26 18:23:01 -07:00
Levi Tamasi
a7a04b6898 Integrate BlobFileBuilder into the compaction process (#7573)
Summary:
Similarly to how https://github.com/facebook/rocksdb/issues/7345
integrated blob file writing into the flush process,
the patch adds support for writing blob files to the compaction logic.
Namely, if `enable_blob_files` is set, large values encountered during
compaction are extracted to blob files and replaced with blob indexes.
The resulting blob files are then logged to the MANIFEST as part of the
compaction job's `VersionEdit` and added to the `Version` alongside any
table files written by the compaction. Any errors during blob file building fail
the compaction job.

There will be a separate follow-up patch to perform blob garbage collection
during compactions.

In addition, the patch continues to chip away at the mess around computing
various compaction related statistics by eliminating some code duplication
and by making the `num_output_files` and `bytes_written` stats more consistent
for flushes, compactions, and recovery.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7573

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24404696

Pulled By: ltamasi

fbshipit-source-id: 21216af3a172ad3ce8f85d11cd30923784ae426c
2020-10-26 13:51:55 -07:00
Yanqin Jin
6595267980 Allow compaction iterator to perform garbage collection (#7556)
Summary:
Add a threshold timestamp, full_history_ts_low_ of type `std::string*` to
`CompactionIterator`, so that RocksDB can also perform garbage collection during
compaction.
* If full_history_ts_low_ is nullptr, then compaction iterator does not perform
  GC, preserving all timestamp history for all keys. Compaction iterator will
treat user key with different timestamps as different user keys.
* If full_history_ts_low_ is not nullptr, then compaction iterator performs
  GC. GC will look at keys older than `*full_history_ts_low_` and determine their
  eligibility based on factors including snapshots.

Current rules of GC:
 * If an internal key is in the same snapshot as a previous counterpart
    with the same user key, and this key is eligible for GC, and the key is
    not single-delete or merge operand, then this key can be dropped. Note
    that the previous internal key cannot be a merge operand either.
 * If a tombstone is the most recent one in the earliest snapshot and it
    is eligible for GC, and keyNotExistsBeyondLevel() is true, then this
    tombstone can be dropped.
 * If a tombstone is the most recent one in a snapshot and it is eligible
    for GC, and the compaction is at bottommost level, then all other older
    internal keys of the same user key must also be eligible for GC, thus
    can be dropped
* Single-delete, delete-range and merge are not currently supported.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7556

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D24507728

Pulled By: riversand963

fbshipit-source-id: 3c09c7301f41eed76dfcf4d1527e68cf6e0a8bb3
2020-10-23 22:59:46 -07:00
Cheng Chang
1b224324b5 Track WAL in MANIFEST: persist WALs to and recover WALs from MANIFEST (#7256)
Summary:
This PR makes it able to `LogAndApply` `VersionEdit`s related to WALs, and also be able to `Recover` from MANIFEST with WAL related `VersionEdit`s.

The `VersionEdit`s related to WAL are treated similarly as those related to column family operations, they are not applied to versions, but can be in a commit group. Mixing WAL related `VersionEdit`s with other types of edits will make logic in `ProcessManifestWrite` more complicated, so `VersionEdit`s related to WAL can either be WAL additions or deletions, like column family add and drop.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7256

Test Plan: a set of unit tests are added in `version_set_test.cc`

Reviewed By: riversand963

Differential Revision: D23123238

Pulled By: cheng-chang

fbshipit-source-id: 246be2ed4744fd03fa2738aba408aaa611d0379c
2020-10-23 22:49:51 -07:00
Zhichao Cao
d8ec0a760a Make FileType Public and Replace kLogFile with kWalFile (#7580)
Summary:
As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in https://github.com/facebook/rocksdb/issues/7523

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7580

Test Plan: make check

Reviewed By: riversand963

Differential Revision: D24485420

Pulled By: zhichao-cao

fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799
2020-10-22 17:06:20 -07:00