Commit Graph

9812 Commits

Author SHA1 Message Date
fanrui03
67d72fb5dc Fix checkpoint stuck (#7921)
Summary:
## 1. Bug description:

When RocksDB Checkpoint, it may be stuck in `WaitUntilFlushWouldNotStallWrites` method.

## 2. Simple analysis of the reasons:

### 2.1 Configuration parameters:

```yaml
Compaction Style : Universal

max_write_buffer_number : 4
min_write_buffer_number_to_merge : 3
```

Checkpoint is usually very fast. When the Checkpoint is executed, `WaitUntilFlushWouldNotStallWrites` is called. If there are 2 Immutable MemTables, which are less than `min_write_buffer_number_to_merge`, they will not be flushed. But will enter this code.

```c++
// method: GetWriteStallConditionAndCause
if (mutable_cf_options.max_write_buffer_number> 3 &&
              num_unflushed_memtables >=
                  mutable_cf_options.max_write_buffer_number-1) {
     return {WriteStallCondition::kDelayed, WriteStallCause::kMemtableLimit};
}
```

code link: fbed72f03c/db/column_family.cc (L847)

Checkpoint thought there was a FlushJob, but it didn't. So will always wait.

### 2.2 solution:

Increase the restriction: the `number of Immutable MemTable` >= `min_write_buffer_number_to_merge will wait`.

If there are other better solutions, you can correct me.

### 2.3 Code that can reproduce the problem:

https://github.com/1996fanrui/fanrui-learning/blob/flink-1.12/module-java/src/main/java/com/dream/rocksdb/RocksDBCheckpointStuck.java

## 3. Interesting point

This bug will be triggered only when `the number of sorted runs >= level0_file_num_compaction_trigger`.

Because there is a break in WaitUntilFlushWouldNotStallWrites.

```c++
if (cfd->imm()->NumNotFlushed() <
        cfd->ioptions()->min_write_buffer_number_to_merge &&
    vstorage->l0_delay_trigger_count() <
        mutable_cf_options.level0_file_num_compaction_trigger) {
  break;
}
```

code link: fbed72f03c/db/db_impl/db_impl_compaction_flush.cc (L1974)

Universal may have `l0_delay_trigger_count() >= level0_file_num_compaction_trigger`, so this bug is triggered.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7921

Reviewed By: jay-zhuang

Differential Revision: D26900559

Pulled By: ajkr

fbshipit-source-id: 133c1252dad7393753f04a47590b68c7d8e670df
2021-03-09 02:21:25 -08:00
kshair
d2e9eab1ea Fix mis-spelling (#8001)
Summary:
concurrnet -> concurrent

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8001

Reviewed By: ajkr

Differential Revision: D26659381

Pulled By: riversand963

fbshipit-source-id: 890d102d1cf836ed3b183da66d3d56a3158017d0
2021-03-09 01:19:18 -08:00
jsteemann
02974c9437 make PerfStepTimer struct smaller by reordering members (#7931)
Summary:
On x86_64, this makes the struct 8 bytes smaller, so creating a PerfStepTimer on the stack will use slightly less stack space.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7931

Reviewed By: jay-zhuang

Differential Revision: D26529470

Pulled By: ajkr

fbshipit-source-id: bbe2e843167152ffa05a5946f1add6621c9849f7
2021-03-08 21:33:15 -08:00
Andrew Kryczka
ef392fb04e use LIB_MODE=shared on Travis make commands (#8043)
Summary:
We were seeing intermittent `ld` failures due to `No space left on device` such as https://travis-ci.org/github/facebook/rocksdb/jobs/761905070.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8043

Reviewed By: pdillinger

Differential Revision: D26889711

Pulled By: ajkr

fbshipit-source-id: 010b7617d339bddc30026586bfde41539632fb2d
2021-03-08 17:21:24 -08:00
Andrew Kryczka
0ff0b625a1 Deflake DBTest2.PartitionedIndexUserToInternalKey on ppc64le (#8044)
Summary:
For some reason I still cannot figure out, the manual flush in this test
was sometimes producing a third tiny file. I saw it a bunch of times on
ppc64le, but even running a qemu system with that architecture (and
playing with various other options) could not repro. However we did get
an instrumented Travis run to confirm the problem is indeed a third tiny
file - https://travis-ci.org/github/facebook/rocksdb/jobs/761986592. We
can avoid it by filling memtables less full and using manual flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8044

Reviewed By: akankshamahajan15

Differential Revision: D26892635

Pulled By: ajkr

fbshipit-source-id: 775c04176931cf01d07cc78fb82cfe3a11beebcf
2021-03-08 14:47:56 -08:00
Peter Dillinger
ce391ff84b Clarifying comments for Read() APIs (#8029)
Summary:
I recently discovered the confusing, undocumented semantics of
Read() functions in the FileSystem and Env APIs. I have added
clarification to the best of my reverse-engineered understanding, and
made a note in HISTORY.md for implementors to check their
implementations, as a subtly non-adherent implementation could lead to
RocksDB quietly ignoring some portion of a file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8029

Test Plan: no code changes

Reviewed By: anand1976

Differential Revision: D26831698

Pulled By: pdillinger

fbshipit-source-id: 208f97ff6037bc13bb2ef360b987c2640c79bd03
2021-03-05 14:42:19 -08:00
Levi Tamasi
cb25bc1128 Update compaction statistics to include the amount of data read from blob files (#8022)
Summary:
The patch does the following:
1) Exposes the amount of data (number of bytes) read from blob files from
`BlobFileReader::GetBlob` / `Version::GetBlob`.
2) Tracks the total number and size of blobs read from blob files during a
compaction (due to garbage collection or compaction filter usage) in
`CompactionIterationStats` and propagates this data to
`InternalStats::CompactionStats` / `CompactionJobStats`.
3) Updates the formulae for write amplification calculations to include the
amount of data read from blob files.
4) Extends the compaction stats dump with a new column `Rblob(GB)` and
a new line containing the total number and size of blob files in the current
`Version` to complement the information about the shape and size of the LSM tree
that's already there.
5) Updates `CompactionJobStats` so that the number of files and amount of data
written by a compaction are broken down per file type (i.e. table/blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022

Test Plan: Ran `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D26801199

Pulled By: ltamasi

fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2
2021-03-04 00:43:48 -08:00
matthewvon
4126bdc0e1 Feature: add SetBufferSize() so that managed size can be dynamic (#7961)
Summary:
This PR adds SetBufferSize() to the WriteBufferManager object.  This enables user code to adjust the global budget for write_buffers based upon other memory conditions such as growth in table reader memory as the dataset grows.

The buffer_size_ member variable is now atomic to match design of other changeable size_t members within WriteBufferManager.

This change is useful as is.  However, this change is also essential if someone decides they wanted to enable db_write_buffer_size modifications through the DB::SetOptions() API, i.e. no waste taking this as is.

Any format / spacing changes are due to clang-format as required by check-in automation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7961

Reviewed By: ajkr

Differential Revision: D26639075

Pulled By: akankshamahajan15

fbshipit-source-id: 0604348caf092d35f44e85715331dc920e5c1033
2021-03-03 14:22:11 -08:00
Yanqin Jin
72d1e258cd Possibly bump NUMBER_OF_RESEEKS_IN_ITERATION (#8015)
Summary:
When changing db iterator direction, we may perform a reseek.
Therefore, we should bump the NUMBER_OF_RESEEKS_IN_ITERATION counter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8015

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D26755415

Pulled By: riversand963

fbshipit-source-id: 211f51f1a454bcda768fc46c0dce51edeb7f05fe
2021-03-02 22:41:04 -08:00
Peter Dillinger
a9046f3c45 Revamp check_format_compatible.sh (#8012)
Summary:
* Adds backup/restore forward/backward compatibility testing
* Adds forward/backward compatibility testing to sst ingestion
* More structure sharing and comments for the lists of branches
comprising each group
* Less reliant on invariants between groups with de-duplication logic
* Restructured for n+1 branch checkout+build steps rather than something
like 3n. Should be much faster despite more checks.

And to make manual runs easier

* On success, restores working trees to original working branch (aborts
early if uncommitted changes) and deletes temporary branch & remote
* Adds SHORT_TEST=1 mode that uses only the oldest version for each
* Adds USE_SSH=1 to use ssh instead of https for github
group

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8012

Test Plan:
a number of manual tests, mostly with SHORT_TEST=1. Using one
version older for any of the groups (except I didn't check
db_backward_only_refs) fails. Changing default format_version to 5
(planned) without updating this script fails as it should, and passes
with appropriate update. Full local run passed (had to remove "2.7.fb.branch"
due to compiler issues, also before this change).

Reviewed By: riversand963

Differential Revision: D26735840

Pulled By: pdillinger

fbshipit-source-id: 1320c22de5674760657e385aa42df9fade8b6fff
2021-03-02 11:42:27 -08:00
Levi Tamasi
a46f080cce Break down the amount of data written during flushes/compactions per file type (#8013)
Summary:
The patch breaks down the "bytes written" (as well as the "number of output files")
compaction statistics into two, so the values are logged separately for table files
and blob files in the info log, and are shown in separate columns (`Write(GB)` for table
files, `Wblob(GB)` for blob files) when the compaction statistics are dumped.
This will also come in handy for fixing the write amplification statistics, which currently
do not consider the amount of data read from blob files during compaction. (This will
be fixed by an upcoming patch.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013

Test Plan: Ran `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D26742156

Pulled By: ltamasi

fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5
2021-03-02 09:48:00 -08:00
Akanksha Mahajan
f19612970d Support retrieving checksums for blob files from the MANIFEST when checkpointing (#8003)
Summary:
The checkpointing logic supports passing file level checksums
to the copy_file_cb callback function which is used by the backup code
for detecting corruption during file copies.
However, this is currently implemented only for table files.

This PR extends the checksum retrieval to blob files as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8003

Test Plan: Add new test units

Reviewed By: ltamasi

Differential Revision: D26680701

Pulled By: akankshamahajan15

fbshipit-source-id: 1bd1e2464df6e9aa31091d35b8c72786d94cd1c5
2021-03-01 20:07:07 -08:00
Yanqin Jin
1f11d07f24 Enable compact filter for blob in dbstress and dbbench (#8011)
Summary:
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8011

Test Plan:
```
./db_bench -enable_blob_files=1 -use_keep_filter=1 -disable_auto_compactions=1
/db_stress -enable_blob_files=1 -enable_compaction_filter=1 -acquire_snapshot_one_in=0 -compact_range_one_in=0 -iterpercent=0 -test_batches_snapshots=0 -readpercent=10 -prefixpercent=20 -writepercent=55 -delpercent=15 -continuous_verification_interval=0
```

Reviewed By: ltamasi

Differential Revision: D26736061

Pulled By: riversand963

fbshipit-source-id: 1c7834903c28431ce23324c4f259ed71255614e2
2021-03-01 17:24:47 -08:00
Yanqin Jin
9fdc9fbeea Still use SystemClock* instead of shared_ptr in StepPerfTimer (#8006)
Summary:
This is likely a temp fix before we figure out a better way.

PerfStepTimer is used intensively in certain benchmarking/testings. https://github.com/facebook/rocksdb/issues/7858 stores a `shared_ptr` to system clock in PerfStepTimer which gets created each time a `PerfStepTimer` object is created. The atomic operations in `shared_ptr` may add overhead in CPU cycles. Therefore, we change it back to a raw `SystemClock*` for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8006

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D26703560

Pulled By: riversand963

fbshipit-source-id: 519d0769b28da2334bea7d86c848fcc26ee8a17f
2021-02-26 20:57:18 -08:00
Peter Dillinger
a8b3b9a20c Refine Ribbon configuration, improve testing, add Homogeneous (#7879)
Summary:
This change only affects non-schema-critical aspects of the production candidate Ribbon filter. Specifically, it refines choice of internal configuration parameters based on inputs. The changes are minor enough that the schema tests in bloom_test, some of which depend on this, are unaffected. There are also some minor optimizations and refactorings.

This would be a schema change for "smash" Ribbon, to fix some known issues with small filters, but "smash" Ribbon is not accessible in public APIs. Unit test CompactnessAndBacktrackAndFpRate updated to test small and medium-large filters. Run with --thoroughness=100 or so for much better detection power (not appropriate for continuous regression testing).

Homogenous Ribbon:
This change adds internally a Ribbon filter variant we call Homogeneous Ribbon, in collaboration with Stefan Walzer. The expected "result" value for every key is zero, instead of computed from a hash. Entropy for queries not to be false positives comes from free variables ("overhead") in the solution structure, which are populated pseudorandomly. Construction is slightly faster for not tracking result values, and never fails. Instead, FP rate can jump up whenever and whereever entries are packed too tightly. For small structures, we can choose overhead to make this FP rate jump unlikely, as seen in updated unit test CompactnessAndBacktrackAndFpRate.

Unlike standard Ribbon, Homogeneous Ribbon seems to scale to arbitrary number of keys when accepting an FP rate penalty for small pockets of high FP rate in the structure. For example, 64-bit ribbon with 8 solution columns and 10% allocated space overhead for slots seems to achieve about 10.5% space overhead vs. information-theoretic minimum based on its observed FP rate with expected pockets of degradation. (FP rate is close to 1/256.) If targeting a higher FP rate with fewer solution columns, Homogeneous Ribbon can be even more space efficient, because the penalty from degradation is relatively smaller. If targeting a lower FP rate, Homogeneous Ribbon is less space efficient, as more allocated overhead is needed to keep the FP rate impact of degradation relatively under control. The new OptimizeHomogAtScale tool in ribbon_test helps to find these optimal allocation overheads for different numbers of solution columns. And Ribbon widths, with 128-bit Ribbon apparently cutting space overheads in half vs. 64-bit.

Other misc item specifics:
* Ribbon APIs in util/ribbon_config.h now provide configuration data for not just 5% construction failure rate (95% success), but also 50% and 0.1%.
  * Note that the Ribbon structure does not exhibit "threshold" behavior as standard Xor filter does, so there is a roughly fixed space penalty to cut construction failure rate in half. Thus, there isn't really an "almost sure" setting.
  * Although we can extrapolate settings for large filters, we don't have a good formula for configuring smaller filters (< 2^17 slots or so), and efforts to summarize with a formula have failed. Thus, small data is hard-coded from updated FindOccupancy tool.
* Enhances ApproximateNumEntries for public API Ribbon using more precise data (new API GetNumToAdd), thus a more accurate but not perfect reversal of CalculateSpace. (bloom_test updated to expect the greater precision)
* Move EndianSwapValue from coding.h to coding_lean.h to keep Ribbon code easily transferable from RocksDB
* Add some missing 'const' to member functions
* Small optimization to 128-bit BitParity
* Small refactoring of BandingStorage in ribbon_alg.h to support Homogeneous Ribbon
* CompactnessAndBacktrackAndFpRate now has an "expand" test: on construction failure, a possible alternative to re-seeding hash functions is simply to increase the number of slots (allocated space overhead) and try again with essentially the same hash values. (Start locations will be different roundings of the same scaled hash values--because fastrange not mod.) This seems to be as effective or more effective than re-seeding, as long as we increase the number of slots (m) by roughly m += m/w where w is the Ribbon width. This way, there is effectively an expansion by one slot for each ribbon-width window in the banding. (This approach assumes that getting "bad data" from your hash function is as unlikely as it naturally should be, e.g. no adversary.)
* 32-bit and 16-bit Ribbon configurations are added to ribbon_test for understanding their behavior, e.g. with FindOccupancy. They are not considered useful at this time and not tested with CompactnessAndBacktrackAndFpRate.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7879

Test Plan: unit test updates included

Reviewed By: jay-zhuang

Differential Revision: D26371245

Pulled By: pdillinger

fbshipit-source-id: da6600d90a3785b99ad17a88b2a3027710b4ea3a
2021-02-26 08:50:42 -08:00
Yanqin Jin
c370d8aa12 Remove unused/incorrect fwd declaration (#8002)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8002

Reviewed By: anand1976

Differential Revision: D26659354

Pulled By: riversand963

fbshipit-source-id: 6b464dbea9fd8240ead8cc5af393f0b78e8f9dd1
2021-02-25 23:07:31 -08:00
Yanqin Jin
cef4a6c49f Compaction filter support for (new) BlobDB (#7974)
Summary:
Allow applications to implement a custom compaction filter and pass it to BlobDB.

The compaction filter's custom logic can operate on blobs.
To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method.
Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely
on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in
db/blob/db_blob_compaction_test.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D26509280

Pulled By: riversand963

fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39
2021-02-25 16:32:35 -08:00
Akanksha Mahajan
2772eb7735 Update History.md for VerifyFileChecksums API supporting blob file (#7995)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7995

Reviewed By: ltamasi

Differential Revision: D26625766

Pulled By: akankshamahajan15

fbshipit-source-id: d83c9e77695f4193da979b1ce7103b43bc1dd46c
2021-02-24 10:25:03 -08:00
xinyuliu
b085ee13e0 Append all characters not captured by xsputn() in overflow() function (#7991)
Summary:
In the adapter class `WritableFileStringStreamAdapter`, which wraps WritableFile to be used for std::ostream, previouly only `std::endl` is considered a special case because `endl` is written by `os.put()` directly without going through `xsputn()`. `os.put()` will call `sputc()` and if we further check the internal implementation of `sputc()`, we will see it is
```
int_type __CLR_OR_THIS_CALL sputc(_Elem _Ch) {  // put a character
    return 0 < _Pnavail() ? _Traits::to_int_type(*_Pninc() = _Ch) : overflow(_Traits::to_int_type(_Ch));
```
As we explicitly disabled buffering, _Pnavail() is always 0. Thus every write, not captured by xsputn, becomes an overflow.

When I run tests on Windows, I found not only `std::endl` will drop into this case, writing an unsigned long long will also call `os.put()` then followed by `sputc()` and eventually call `overflow()`. Therefore, instead of only checking `std::endl`, we should try to append other characters as well unless the appending operation fails.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7991

Reviewed By: jay-zhuang

Differential Revision: D26615692

Pulled By: ajkr

fbshipit-source-id: 4c0003de1645b9531545b23df69b000e07014468
2021-02-23 21:44:48 -08:00
Akanksha Mahajan
cd79a00903 Make BlockBasedTable::kMaxAutoReadAheadSize configurable (#7951)
Summary:
RocksDB does auto-readahead for iterators on noticing more
than two reads for a table file. The readahead starts at 8KB and doubles on every
additional read upto BlockBasedTable::kMaxAutoReadAheadSize which is
256*1024.
This PR adds a new option BlockBasedTableOptions::max_auto_readahead_size which
replaces BlockBasedTable::kMaxAutoReadAheadSize and the new option can be
configured.
If max_auto_readahead_size is set 0 then no implicit auto prefetching will
be done. If max_auto_readahead_size provided is less than
8KB (which is initial readahead size used by rocksdb in case of
auto-readahead), readahead size will remain same as max_auto_readahead_size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7951

Test Plan: Add new unit test case.

Reviewed By: anand1976

Differential Revision: D26568085

Pulled By: akankshamahajan15

fbshipit-source-id: b6543520fc74e97d859f2002328d4c5254d417af
2021-02-23 16:54:08 -08:00
sherriiiliu
e017af15c1 Fix testcase failures on windows (#7992)
Summary:
Fixed 5 test case failures found on Windows 10/Windows Server 2016
1. In `flush_job_test`, the DestroyDir function fails in deconstructor because some file handles are still being held by VersionSet. This happens on Windows Server 2016, so need to manually reset versions_ pointer to release all file handles.
2. In `StatsHistoryTest.InMemoryStatsHistoryPurging` test, the capping memory cost of stats_history_size on Windows becomes 14000 bytes with latest changes, not just 13000 bytes.
3. In `SSTDumpToolTest.RawOutput` test, the output file handle is not closed at the end.
4. In `FullBloomTest.OptimizeForMemory` test, ROCKSDB_MALLOC_USABLE_SIZE is undefined on windows so `total_mem` is always equal to `total_size`. The internal memory fragmentation assertion does not apply in this case.
5. In `BlockFetcherTest.FetchAndUncompressCompressedDataBlock` test, XPRESS cannot reach 87.5% compression ratio with original CreateTable method, so I append extra zeros to the string value to enhance compression ratio. Beside, since XPRESS allocates memory internally, thus does not support for custom allocator verification, we will skip the allocator verification for XPRESS

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7992

Reviewed By: jay-zhuang

Differential Revision: D26615283

Pulled By: ajkr

fbshipit-source-id: 3632612f84b99e2b9c77c403b112b6bedf3b125d
2021-02-23 14:35:06 -08:00
sherriiiliu
75c6ffb9de Always expose WITH_GFLAGS option to user (#7990)
Summary:
WITH_GFLAGS option does not work on MSVC.

 I checked the usage of [CMAKE_DEPENDENT_OPTION](https://cmake.org/cmake/help/latest/module/CMakeDependentOption.html). It says if the `depends` condition is not true, it will set the `option` to the value given by `force` and hides the option from the user. Therefore, `CMAKE_DEPENDENT_OPTION(WITH_GFLAGS "build with GFlags" ON "NOT MSVC;NOT MINGW" OFF)` will hide WITH_GFLAGS option from user if it is running on MSVC or MINGW and always set WITH_GFLAGS to be OFF. To expose WITH_GFLAGS option to user, I removed CMAKE_DEPENDENT_OPTION and split the logic into if-else statements

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7990

Reviewed By: jay-zhuang

Differential Revision: D26615755

Pulled By: ajkr

fbshipit-source-id: 33ca39a73423d9516510c15aaf9efb5c4072cdf9
2021-02-23 14:31:27 -08:00
sherriiiliu
f91fd0c944 Extract test cases correctly in run_ci_db_test.ps1 script (#7989)
Summary:
Extract test cases correctly in run_ci_db_test.ps1 script.

There are some new test group that are ended with # comments. Previously in the script when trying to extract test groups and test cases, the regex rule did not apply to this case so the concatenation of some test group and test case failed, see examples in comments.

Also removed useless trailing whitespaces in the script.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7989

Reviewed By: jay-zhuang

Differential Revision: D26615909

Pulled By: ajkr

fbshipit-source-id: 8e68d599994f17d6fefde0daa925c3018179521a
2021-02-23 14:25:42 -08:00
Akanksha Mahajan
46cf5fbfdd Extend VerifyFileChecksums API for blob files (#7979)
Summary:
Extend VerifyFileChecksums API to verify blob files in case of
use_file_checksum.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7979

Test Plan: New unit test db_blob_corruption_test

Reviewed By: ltamasi

Differential Revision: D26534040

Pulled By: akankshamahajan15

fbshipit-source-id: 7dc5951a3df9d265ea1265e0122b43c966856ade
2021-02-22 22:09:22 -08:00
Andrew Kryczka
daca92c17a Pick samples for compression dictionary using prime number (#7987)
Summary:
The sample selection technique taken in https://github.com/facebook/rocksdb/issues/7970 was problematic
because it had two code paths for sample selection depending on the
number of data blocks, and one of those code paths involved an
allocation. Using prime numbers, we can consolidate into one code path
without allocation. The downside is there will be values of N (number of
data blocks buffered) that suffer from poor spread in the selected
samples.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7987

Test Plan: `make check -j48`

Reviewed By: pdillinger

Differential Revision: D26586147

Pulled By: ajkr

fbshipit-source-id: 62028e54336fadb6e2c7a7fe6747daa05a263d32
2021-02-22 17:43:03 -08:00
mrambacher
59d91796d2 Attempt to speed up tests by adding test to "slow" tests (#7973)
Summary:
I noticed tests frequently timing out on CircleCI when I submit a PR.  I did some investigation and found the SeqAdvanceConcurrentTest suite (OneWriteQueue, TwoWriteQueues) tests were all taking a long time to complete (30 tests each taking at least 15K ms).

This PR adds those test to the "slow reg" list in order to move them earlier in the execution sequence so that they are not the "long tail".

For completeness, other tests that were also slow are:
NumLevels/DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest : 12 tests all taking 12K+ ms
ReadSequentialFileTest with ReadaheadSize: 8 tests all 12K+ ms
WriteUnpreparedTransactionTest.RecoveryTest : 2 tests at 22K+ ms
DBBasicTest.EmptyFlush: 1 test at 35K+ ms
RateLimiterTest.Rate: 1 test at 23K+ ms
BackupableDBTest.ShareTableFilesWithChecksumsTransition: 1 test at 16K+ ms
MulitThreadedDBTest.MultitThreaded: 78 tests at 10K+ ms
TransactionStressTest.DeadlockStress: 7 tests at 11K+ ms
DBBasicTestDeadline.IteratorDeadline: 3 tests at 10K+ ms

No effort was made to determine why the tests were slow.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7973

Reviewed By: jay-zhuang

Differential Revision: D26519130

Pulled By: mrambacher

fbshipit-source-id: 11555c9115acc207e45e210a7fc7f879170a3853
2021-02-22 05:27:51 -08:00
Akanksha Mahajan
6790a983eb Fix for ASSERT_STATUS_CHECKED test failure (#7985)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7985

Test Plan: CircleCI ASSERT_STATUS_CHECKED test

Reviewed By: jay-zhuang

Differential Revision: D26568446

Pulled By: akankshamahajan15

fbshipit-source-id: bd0ab41f485942e313d82ce3895ce53e0967ba98
2021-02-20 19:13:55 -08:00
Yanqin Jin
7343eb4a74 Update HISTORY and bump version (#7984)
Summary:
Prepare to cut 6.18.fb branch

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7984

Reviewed By: ajkr

Differential Revision: D26557151

Pulled By: riversand963

fbshipit-source-id: 8c144c807090cdae67e6655e7a17056ce8c50bc0
2021-02-19 19:21:49 -08:00
Andrew Kryczka
d904233d2f Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.

However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.

Related changes include:

- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970

Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.

Reviewed By: pdillinger

Differential Revision: D26467994

Pulled By: ajkr

fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 14:09:54 -08:00
Max Neunhoeffer
cf14cb3e29 Avoid self-move-assign in pop operation of binary heap. (#7942)
Summary:
The current implementation of a binary heap in `util/heap.h` does a move-assign in the `pop` method. In the case that there is exactly one element stored in the heap, this ends up being a self-move-assign. This can cause trouble with certain classes, which are not prepared for this. Furthermore, it trips up the glibc STL debugger (`-D_GLIBCXX_DEBUG`), which produces an assertion failure in this case.

This PR addresses this problem by not doing the (unnecessary in this case) move-assign if there is only one element in the heap.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7942

Reviewed By: jay-zhuang

Differential Revision: D26528739

Pulled By: ajkr

fbshipit-source-id: 5ca570e0c4168f086b10308ad766dff84e6e2d03
2021-02-19 13:47:25 -08:00
tison
ec76f03168 gitignore cmake-build-* for CLion integration (#7933)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7933

Reviewed By: jay-zhuang

Differential Revision: D26529429

Pulled By: ajkr

fbshipit-source-id: 244344b70b1db161f9b224c25fe690c663264d7d
2021-02-19 13:43:15 -08:00
mrambacher
4bc9df9459 Fix handling of Mutable options; Allow DB::SetOptions to update mutable TableFactory Options (#7936)
Summary:
Added a "only_mutable_options" flag to the ConfigOptions.  When set, the Configurable methods will only look at/update options that are marked as kMutable.

Fixed DB::SetOptions to allow for the update of any mutable TableFactory options.  Fixes https://github.com/facebook/rocksdb/issues/7385.

Added tests for the new flag.  Updated HISTORY.md

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936

Reviewed By: akankshamahajan15

Differential Revision: D26389646

Pulled By: mrambacher

fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0
2021-02-19 10:29:02 -08:00
Zhichao Cao
b0fd1cc45a Introduce a new trace file format (v 0.2) for better extension (#7977)
Summary:
The trace file record and payload encode is fixed, which requires complex backward compatibility resolving. This PR introduce a new trace file format, which makes it easier to add new entries to the payload and does not have backward compatible issues. V 0.1 is still supported in this PR. Added the tracing for lower_bound and upper_bound for iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7977

Test Plan: make check. tested with old trace file in replay and analyzing.

Reviewed By: anand1976

Differential Revision: D26529948

Pulled By: zhichao-cao

fbshipit-source-id: ebb75a127ce3c07c25a1ccc194c551f917896a76
2021-02-18 23:05:35 -08:00
Sergei Petrunia
c9878baa87 Fix an assertion failure in range locking, locktree code. (#7938)
Summary:
Fix this scenario:
trx1> acquire shared lock on $key
trx2> acquire shared lock on the same $key
trx1> attempt to acquire a unique lock on $key.

Lock acquisition will fail, and deadlock detection will start.
It will call iterate_and_get_overlapping_row_locks() which will
produce a list with two locks (shared locks by trx1 and trx2).

However the code in lock_request::build_wait_graph() was not prepared
to find the lock by the same transaction in the list of conflicting
locks. Fix it to ignore it.

(One may suggest to fix iterate_and_get_overlapping_row_locks() to not
include locks by trx1. This is not a good idea, because that function
is also used to report all locks currently held)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7938

Reviewed By: zhichao-cao

Differential Revision: D26529374

Pulled By: ajkr

fbshipit-source-id: d89cbed008db1a97a8f2351b9bfb75310750d16a
2021-02-18 18:15:19 -08:00
vrqq
ad25b1afb9 Update win_logger.cc : assert failed when return value not checked. (-DROCKSDB_ASSERT_STATUS_CHECKED) (#7955)
Summary:
Ignore return value on WinLogger::CloseInternal() when build with -DROCKSDB_ASSERT_STATUS_CHECKED on windows.

It's a good way to ignore check here?

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7955

Reviewed By: jay-zhuang

Differential Revision: D26524145

Pulled By: ajkr

fbshipit-source-id: f2f643e94cde9772617c68b658fb529fffebd8ce
2021-02-18 16:34:10 -08:00
Zaiyang Li
69877ac4f2 c:h export rocksdb_transactiondb_open_column_families (#7967)
Summary:
Hi, I noticed a bug in rocksdb C API, where a function is not exported and created a fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7967

Reviewed By: jay-zhuang

Differential Revision: D26505722

Pulled By: ajkr

fbshipit-source-id: 05d676dbd59ec87fe32322cda9e39e405b07178d
2021-02-18 15:51:54 -08:00
stefan-zobel
251143f8fb rocksdbjni: Possible NPE in RocksDB.setOptions #7869 (#7909)
Summary:
Fix for https://github.com/facebook/rocksdb/issues/7869

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7909

Reviewed By: akankshamahajan15

Differential Revision: D26181440

Pulled By: ajkr

fbshipit-source-id: f323aec9d91e177fa873599b99801b391cf094b1
2021-02-18 15:48:39 -08:00
Ziyue Yang
0c2d71edba Fix typo: replace readadhead with readahead (#7953)
Summary:
This PR replaces several "readadhead" typos with "readahead".

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7953

Reviewed By: ajkr

Differential Revision: D26518903

Pulled By: jay-zhuang

fbshipit-source-id: 6f7dece0e39ec4f71c4a936399bcb2e02574f42a
2021-02-18 14:31:20 -08:00
Wilfried Goesgens
8a05c21e32 add string separation while composing error message (#7919)
Summary:
This will fix a missing string separation between `msg[n]` and `state_`.
Example of an error message how its looking now:
```
IO error: No space left on deviceWhile appending to file: /home/willi/src/stable-3.7/tmp/arangosh_CL6EFQ/shell_client/single1/data/engine-rocksdb/126426.sst: No space left on device
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7919

Reviewed By: ajkr

Differential Revision: D26242246

Pulled By: jay-zhuang

fbshipit-source-id: 5d9a0997a410aecfb3781478e57395d3d937bb84
2021-02-18 12:25:35 -08:00
Akanksha Mahajan
eacb14a10a Update history.md for bug fix of actual error returned in DB::OpenForReadOnly (#7978)
Summary:
Update history.md for bug fix of actual error returned in DB::OpenForReadOnly

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7978

Reviewed By: jay-zhuang

Differential Revision: D26519195

Pulled By: akankshamahajan15

fbshipit-source-id: 39fd2bcc12ab92a492e8254090b742efa377ed51
2021-02-18 11:42:05 -08:00
Jay Zhuang
59ba104e4a Fix txn MultiGet() return un-committed data with snapshot (#7963)
Summary:
TransactionDB uses read callback to filter out un-committed data before
a snapshot. But `MultiGet()` API doesn't use that at all, which causes
returning unwanted data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963

Test Plan: Added unittest to reproduce

Reviewed By: anand1976

Differential Revision: D26455851

Pulled By: jay-zhuang

fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55
2021-02-18 08:49:00 -08:00
Akanksha Mahajan
6a85aea5b1 Bug fix for status overridden by Status::NotFound in db_impl_readonly (#7972)
Summary:
Bug fix for status returned being overridden by Status::NotFound in
DBImpl::OpenForReadOnlyCheckExistence. This was casuing some service
owners to misinterpret the actual error and take appropriate steps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7972

Reviewed By: riversand963

Differential Revision: D26499598

Pulled By: akankshamahajan15

fbshipit-source-id: 05e9fedbe2a2e0e53135760f8ff578a2816d2b8e
2021-02-17 19:35:57 -08:00
Levi Tamasi
dab4fe5bcd Add checkpoint support to BlobDB (#7959)
Summary:
The patch adds checkpoint support to BlobDB. Blob files are hard linked or
copied, depending on whether the checkpoint directory is on the same filesystem
or not, similarly to table files.

TODO: Add support for blob files to `ExportColumnFamily` and to the checksum
verification logic used by backup/restore.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7959

Test Plan: Ran `make check` and the crash test for a while.

Reviewed By: riversand963

Differential Revision: D26434768

Pulled By: ltamasi

fbshipit-source-id: 994be55a8dc08133028250760fca440d2c7c4dc5
2021-02-17 12:42:36 -08:00
Levi Tamasi
0743eba0c4 Add support for the integrated BlobDB to db_bench (#7956)
Summary:
The patch adds the configuration options of the new BlobDB implementation
to `db_bench` and adjusts the help messages of the old (`StackableDB`-based)
BlobDB's options to make it clear which implementation they pertain to.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7956

Test Plan: Ran `make check` and `db_bench` with the new options.

Reviewed By: jay-zhuang

Differential Revision: D26384808

Pulled By: ltamasi

fbshipit-source-id: b4405bb2c56cfd3506d4c32e3329c08dfdf69c94
2021-02-17 11:10:18 -08:00
Levi Tamasi
ba8008c870 Mention the new BlobDB in HISTORY.md and remove the "under construction" signs (#7969)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7969

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D26467043

Pulled By: ltamasi

fbshipit-source-id: c69a725669d18af6e911743c998e3a1db75948c0
2021-02-16 16:20:22 -08:00
Akanksha Mahajan
ea8bb82fc7 Add support for IOTracing in blob files (#7958)
Summary:
Add support for IOTracing in blob files

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7958

Test Plan:
Add a new test and checked manually the trace_file for blob
files being recorded during read and write.

Reviewed By: ltamasi

Differential Revision: D26415950

Pulled By: akankshamahajan15

fbshipit-source-id: 49c2859b3a4f8307e7cb69a92704403a4da46d44
2021-02-16 09:49:10 -08:00
Jay Zhuang
9df78a94f1 Disable flaky error_handler_fs_test that could hang (#7964)
Summary:
The test is hang on 95013df278/db/error_handler_fs_test.cc (L947)
Seems db.mutex_ is lock twice in the test:
cf160b98e1/db/db_impl/db_impl_compaction_flush.cc (L3208)
0a9a05ae12/db/db_impl/db_impl.cc (L469)
As it's just a test issue, disable it for now until the test is fixed.

The hang could be reproduced by:
`gtest-parallel ./error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.CompactionWriteFileScopeError -r 1000`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7964

Reviewed By: zhichao-cao

Differential Revision: D26447325

Pulled By: jay-zhuang

fbshipit-source-id: 72f6a346458e059d10e9cc3347bd6bde040cf89e
2021-02-15 09:45:23 -08:00
Jay Zhuang
00519187a6 Update internal build script (#7957)
Summary:
For internal build.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7957

Test Plan: https://www.internalfb.com/intern/sandcastle/group/nonce/6483925578523975/

Reviewed By: ltamasi

Differential Revision: D26410919

Pulled By: jay-zhuang

fbshipit-source-id: a5f9516c91ea85c384a4208aa73331ecad833d01
2021-02-11 14:55:43 -08:00
Zhichao Cao
d1c510baec Handoff checksum Implementation (#7523)
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523

Test Plan: add new unit test, pass make check/ make asan_check

Reviewed By: pdillinger

Differential Revision: D24313271

Pulled By: zhichao-cao

fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
2021-02-10 22:20:32 -08:00
Peter Dillinger
e4f1e64c30 Add prefetching (batched MultiGet) for experimental Ribbon filter (#7889)
Summary:
Adds support for prefetching data in Ribbon queries,
which especially optimizes batched Ribbon queries for MultiGet
(~222ns/key to ~97ns/key) but also single key queries on cold memory
(~333ns to ~226ns) because many queries span more than one cache line.

This required some refactoring of the query algorithm, and there
does not appear to be a noticeable regression in "hot memory" query
times (perhaps from 48ns to 50ns).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7889

Test Plan:
existing unit tests, plus performance validation with
filter_bench:

Each data point is the best of two runs. I saturated the machine
CPUs with other filter_bench runs in the background.

Before:

    $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 125.86
    Number of filters: 1993
    Total size (MB): 168.166
    Reported total allocated memory (MB): 183.211
    Reported internal fragmentation: 8.94626%
    Bits/key stored: 7.05341
    Prelim FP rate %: 0.951827
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 48.0111
      Batched, prepared net ns/op: 222.384
      Batched, unprepared net ns/op: 343.908
      Skewed 50% in 1% net ns/op: 252.916
      Skewed 80% in 20% net ns/op: 320.579
      Random filter net ns/op: 332.957

After:

    $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 128.117
    Number of filters: 1993
    Total size (MB): 168.166
    Reported total allocated memory (MB): 183.211
    Reported internal fragmentation: 8.94626%
    Bits/key stored: 7.05341
    Prelim FP rate %: 0.951827
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 49.8812
      Batched, prepared net ns/op: 97.1514
      Batched, unprepared net ns/op: 222.025
      Skewed 50% in 1% net ns/op: 197.48
      Skewed 80% in 20% net ns/op: 212.457
      Random filter net ns/op: 226.464

Bloom comparison, for reference:

    $ ./filter_bench -impl=2 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 35.3042
    Number of filters: 1993
    Total size (MB): 238.488
    Reported total allocated memory (MB): 262.875
    Reported internal fragmentation: 10.2255%
    Bits/key stored: 10.0029
    Prelim FP rate %: 0.965327
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 9.09931
      Batched, prepared net ns/op: 34.21
      Batched, unprepared net ns/op: 88.8564
      Skewed 50% in 1% net ns/op: 139.75
      Skewed 80% in 20% net ns/op: 181.264
      Random filter net ns/op: 173.88

Reviewed By: jay-zhuang

Differential Revision: D26378710

Pulled By: pdillinger

fbshipit-source-id: 058428967c55ed763698284cd3b4bbe3351b6e69
2021-02-10 21:04:56 -08:00