Commit Graph

893 Commits

Author SHA1 Message Date
Peter Dillinger
ce391ff84b Clarifying comments for Read() APIs (#8029)
Summary:
I recently discovered the confusing, undocumented semantics of
Read() functions in the FileSystem and Env APIs. I have added
clarification to the best of my reverse-engineered understanding, and
made a note in HISTORY.md for implementors to check their
implementations, as a subtly non-adherent implementation could lead to
RocksDB quietly ignoring some portion of a file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8029

Test Plan: no code changes

Reviewed By: anand1976

Differential Revision: D26831698

Pulled By: pdillinger

fbshipit-source-id: 208f97ff6037bc13bb2ef360b987c2640c79bd03
2021-03-05 14:42:19 -08:00
Levi Tamasi
cb25bc1128 Update compaction statistics to include the amount of data read from blob files (#8022)
Summary:
The patch does the following:
1) Exposes the amount of data (number of bytes) read from blob files from
`BlobFileReader::GetBlob` / `Version::GetBlob`.
2) Tracks the total number and size of blobs read from blob files during a
compaction (due to garbage collection or compaction filter usage) in
`CompactionIterationStats` and propagates this data to
`InternalStats::CompactionStats` / `CompactionJobStats`.
3) Updates the formulae for write amplification calculations to include the
amount of data read from blob files.
4) Extends the compaction stats dump with a new column `Rblob(GB)` and
a new line containing the total number and size of blob files in the current
`Version` to complement the information about the shape and size of the LSM tree
that's already there.
5) Updates `CompactionJobStats` so that the number of files and amount of data
written by a compaction are broken down per file type (i.e. table/blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022

Test Plan: Ran `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D26801199

Pulled By: ltamasi

fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2
2021-03-04 00:43:48 -08:00
matthewvon
4126bdc0e1 Feature: add SetBufferSize() so that managed size can be dynamic (#7961)
Summary:
This PR adds SetBufferSize() to the WriteBufferManager object.  This enables user code to adjust the global budget for write_buffers based upon other memory conditions such as growth in table reader memory as the dataset grows.

The buffer_size_ member variable is now atomic to match design of other changeable size_t members within WriteBufferManager.

This change is useful as is.  However, this change is also essential if someone decides they wanted to enable db_write_buffer_size modifications through the DB::SetOptions() API, i.e. no waste taking this as is.

Any format / spacing changes are due to clang-format as required by check-in automation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7961

Reviewed By: ajkr

Differential Revision: D26639075

Pulled By: akankshamahajan15

fbshipit-source-id: 0604348caf092d35f44e85715331dc920e5c1033
2021-03-03 14:22:11 -08:00
Levi Tamasi
a46f080cce Break down the amount of data written during flushes/compactions per file type (#8013)
Summary:
The patch breaks down the "bytes written" (as well as the "number of output files")
compaction statistics into two, so the values are logged separately for table files
and blob files in the info log, and are shown in separate columns (`Write(GB)` for table
files, `Wblob(GB)` for blob files) when the compaction statistics are dumped.
This will also come in handy for fixing the write amplification statistics, which currently
do not consider the amount of data read from blob files during compaction. (This will
be fixed by an upcoming patch.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013

Test Plan: Ran `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D26742156

Pulled By: ltamasi

fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5
2021-03-02 09:48:00 -08:00
Akanksha Mahajan
f19612970d Support retrieving checksums for blob files from the MANIFEST when checkpointing (#8003)
Summary:
The checkpointing logic supports passing file level checksums
to the copy_file_cb callback function which is used by the backup code
for detecting corruption during file copies.
However, this is currently implemented only for table files.

This PR extends the checksum retrieval to blob files as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8003

Test Plan: Add new test units

Reviewed By: ltamasi

Differential Revision: D26680701

Pulled By: akankshamahajan15

fbshipit-source-id: 1bd1e2464df6e9aa31091d35b8c72786d94cd1c5
2021-03-01 20:07:07 -08:00
Yanqin Jin
cef4a6c49f Compaction filter support for (new) BlobDB (#7974)
Summary:
Allow applications to implement a custom compaction filter and pass it to BlobDB.

The compaction filter's custom logic can operate on blobs.
To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method.
Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely
on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in
db/blob/db_blob_compaction_test.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D26509280

Pulled By: riversand963

fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39
2021-02-25 16:32:35 -08:00
Akanksha Mahajan
2772eb7735 Update History.md for VerifyFileChecksums API supporting blob file (#7995)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7995

Reviewed By: ltamasi

Differential Revision: D26625766

Pulled By: akankshamahajan15

fbshipit-source-id: d83c9e77695f4193da979b1ce7103b43bc1dd46c
2021-02-24 10:25:03 -08:00
xinyuliu
b085ee13e0 Append all characters not captured by xsputn() in overflow() function (#7991)
Summary:
In the adapter class `WritableFileStringStreamAdapter`, which wraps WritableFile to be used for std::ostream, previouly only `std::endl` is considered a special case because `endl` is written by `os.put()` directly without going through `xsputn()`. `os.put()` will call `sputc()` and if we further check the internal implementation of `sputc()`, we will see it is
```
int_type __CLR_OR_THIS_CALL sputc(_Elem _Ch) {  // put a character
    return 0 < _Pnavail() ? _Traits::to_int_type(*_Pninc() = _Ch) : overflow(_Traits::to_int_type(_Ch));
```
As we explicitly disabled buffering, _Pnavail() is always 0. Thus every write, not captured by xsputn, becomes an overflow.

When I run tests on Windows, I found not only `std::endl` will drop into this case, writing an unsigned long long will also call `os.put()` then followed by `sputc()` and eventually call `overflow()`. Therefore, instead of only checking `std::endl`, we should try to append other characters as well unless the appending operation fails.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7991

Reviewed By: jay-zhuang

Differential Revision: D26615692

Pulled By: ajkr

fbshipit-source-id: 4c0003de1645b9531545b23df69b000e07014468
2021-02-23 21:44:48 -08:00
Akanksha Mahajan
cd79a00903 Make BlockBasedTable::kMaxAutoReadAheadSize configurable (#7951)
Summary:
RocksDB does auto-readahead for iterators on noticing more
than two reads for a table file. The readahead starts at 8KB and doubles on every
additional read upto BlockBasedTable::kMaxAutoReadAheadSize which is
256*1024.
This PR adds a new option BlockBasedTableOptions::max_auto_readahead_size which
replaces BlockBasedTable::kMaxAutoReadAheadSize and the new option can be
configured.
If max_auto_readahead_size is set 0 then no implicit auto prefetching will
be done. If max_auto_readahead_size provided is less than
8KB (which is initial readahead size used by rocksdb in case of
auto-readahead), readahead size will remain same as max_auto_readahead_size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7951

Test Plan: Add new unit test case.

Reviewed By: anand1976

Differential Revision: D26568085

Pulled By: akankshamahajan15

fbshipit-source-id: b6543520fc74e97d859f2002328d4c5254d417af
2021-02-23 16:54:08 -08:00
Yanqin Jin
7343eb4a74 Update HISTORY and bump version (#7984)
Summary:
Prepare to cut 6.18.fb branch

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7984

Reviewed By: ajkr

Differential Revision: D26557151

Pulled By: riversand963

fbshipit-source-id: 8c144c807090cdae67e6655e7a17056ce8c50bc0
2021-02-19 19:21:49 -08:00
Andrew Kryczka
d904233d2f Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.

However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.

Related changes include:

- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970

Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.

Reviewed By: pdillinger

Differential Revision: D26467994

Pulled By: ajkr

fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 14:09:54 -08:00
mrambacher
4bc9df9459 Fix handling of Mutable options; Allow DB::SetOptions to update mutable TableFactory Options (#7936)
Summary:
Added a "only_mutable_options" flag to the ConfigOptions.  When set, the Configurable methods will only look at/update options that are marked as kMutable.

Fixed DB::SetOptions to allow for the update of any mutable TableFactory options.  Fixes https://github.com/facebook/rocksdb/issues/7385.

Added tests for the new flag.  Updated HISTORY.md

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936

Reviewed By: akankshamahajan15

Differential Revision: D26389646

Pulled By: mrambacher

fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0
2021-02-19 10:29:02 -08:00
Zhichao Cao
b0fd1cc45a Introduce a new trace file format (v 0.2) for better extension (#7977)
Summary:
The trace file record and payload encode is fixed, which requires complex backward compatibility resolving. This PR introduce a new trace file format, which makes it easier to add new entries to the payload and does not have backward compatible issues. V 0.1 is still supported in this PR. Added the tracing for lower_bound and upper_bound for iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7977

Test Plan: make check. tested with old trace file in replay and analyzing.

Reviewed By: anand1976

Differential Revision: D26529948

Pulled By: zhichao-cao

fbshipit-source-id: ebb75a127ce3c07c25a1ccc194c551f917896a76
2021-02-18 23:05:35 -08:00
Akanksha Mahajan
eacb14a10a Update history.md for bug fix of actual error returned in DB::OpenForReadOnly (#7978)
Summary:
Update history.md for bug fix of actual error returned in DB::OpenForReadOnly

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7978

Reviewed By: jay-zhuang

Differential Revision: D26519195

Pulled By: akankshamahajan15

fbshipit-source-id: 39fd2bcc12ab92a492e8254090b742efa377ed51
2021-02-18 11:42:05 -08:00
Jay Zhuang
59ba104e4a Fix txn MultiGet() return un-committed data with snapshot (#7963)
Summary:
TransactionDB uses read callback to filter out un-committed data before
a snapshot. But `MultiGet()` API doesn't use that at all, which causes
returning unwanted data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963

Test Plan: Added unittest to reproduce

Reviewed By: anand1976

Differential Revision: D26455851

Pulled By: jay-zhuang

fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55
2021-02-18 08:49:00 -08:00
Levi Tamasi
ba8008c870 Mention the new BlobDB in HISTORY.md and remove the "under construction" signs (#7969)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7969

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D26467043

Pulled By: ltamasi

fbshipit-source-id: c69a725669d18af6e911743c998e3a1db75948c0
2021-02-16 16:20:22 -08:00
Zhichao Cao
d1c510baec Handoff checksum Implementation (#7523)
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523

Test Plan: add new unit test, pass make check/ make asan_check

Reviewed By: pdillinger

Differential Revision: D24313271

Pulled By: zhichao-cao

fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
2021-02-10 22:20:32 -08:00
Peter Dillinger
e4f1e64c30 Add prefetching (batched MultiGet) for experimental Ribbon filter (#7889)
Summary:
Adds support for prefetching data in Ribbon queries,
which especially optimizes batched Ribbon queries for MultiGet
(~222ns/key to ~97ns/key) but also single key queries on cold memory
(~333ns to ~226ns) because many queries span more than one cache line.

This required some refactoring of the query algorithm, and there
does not appear to be a noticeable regression in "hot memory" query
times (perhaps from 48ns to 50ns).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7889

Test Plan:
existing unit tests, plus performance validation with
filter_bench:

Each data point is the best of two runs. I saturated the machine
CPUs with other filter_bench runs in the background.

Before:

    $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 125.86
    Number of filters: 1993
    Total size (MB): 168.166
    Reported total allocated memory (MB): 183.211
    Reported internal fragmentation: 8.94626%
    Bits/key stored: 7.05341
    Prelim FP rate %: 0.951827
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 48.0111
      Batched, prepared net ns/op: 222.384
      Batched, unprepared net ns/op: 343.908
      Skewed 50% in 1% net ns/op: 252.916
      Skewed 80% in 20% net ns/op: 320.579
      Random filter net ns/op: 332.957

After:

    $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 128.117
    Number of filters: 1993
    Total size (MB): 168.166
    Reported total allocated memory (MB): 183.211
    Reported internal fragmentation: 8.94626%
    Bits/key stored: 7.05341
    Prelim FP rate %: 0.951827
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 49.8812
      Batched, prepared net ns/op: 97.1514
      Batched, unprepared net ns/op: 222.025
      Skewed 50% in 1% net ns/op: 197.48
      Skewed 80% in 20% net ns/op: 212.457
      Random filter net ns/op: 226.464

Bloom comparison, for reference:

    $ ./filter_bench -impl=2 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 35.3042
    Number of filters: 1993
    Total size (MB): 238.488
    Reported total allocated memory (MB): 262.875
    Reported internal fragmentation: 10.2255%
    Bits/key stored: 10.0029
    Prelim FP rate %: 0.965327
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 9.09931
      Batched, prepared net ns/op: 34.21
      Batched, unprepared net ns/op: 88.8564
      Skewed 50% in 1% net ns/op: 139.75
      Skewed 80% in 20% net ns/op: 181.264
      Random filter net ns/op: 173.88

Reviewed By: jay-zhuang

Differential Revision: D26378710

Pulled By: pdillinger

fbshipit-source-id: 058428967c55ed763698284cd3b4bbe3351b6e69
2021-02-10 21:04:56 -08:00
Andrew Kryczka
c16d5a4fda Makefile support to statically link external plugin code (#7918)
Summary:
Added support for detecting plugins linked in the "plugin/" directory and building them from our Makefile in a standardized way. See "plugin/README.md" for details. An example of a plugin that can be built in this way can be found in https://github.com/ajkr/dedupfs.

There will be more to do in terms of making this process more convenient and adding support for CMake.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7918

Test Plan: my own plugin (https://github.com/ajkr/dedupfs) and also heard this patch worked with ZenFS.

Reviewed By: pdillinger

Differential Revision: D26189969

Pulled By: ajkr

fbshipit-source-id: 6624d4357d0ffbaedb42f0d12a3fcb737c78f758
2021-02-10 08:35:34 -08:00
Jay Zhuang
cf160b98e1 Add full_history_ts_low option to compaction (#7884)
Summary:
The full_history_ts_low is used for user-defined timestamp GC
compaction, which is introduced in https://github.com/facebook/rocksdb/issues/7740, https://github.com/facebook/rocksdb/issues/7657 and https://github.com/facebook/rocksdb/issues/7655.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7884

Reviewed By: ltamasi

Differential Revision: D25982553

Pulled By: jay-zhuang

fbshipit-source-id: 36303d412d65b5d8166b6da24fa21ad85adbabee
2021-02-08 13:45:48 -08:00
Levi Tamasi
974458891c Revert "Turn on memtable bloom filter by default. (#6584)" (#7939)
Summary:
This reverts commit ee79a28963.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7939

Reviewed By: siying

Differential Revision: D26298564

Pulled By: ltamasi

fbshipit-source-id: 6d663516e82e6de436f8d5317932ca9a98e152bd
2021-02-06 22:34:30 -08:00
Andrew Kryczka
8d2bbdd04f Allow range deletions in *TransactionDB only when safe (#7929)
Summary:
Explicitly reject all range deletions on `TransactionDB` or `OptimisticTransactionDB`, except when the user provides sufficient promises that allow us to proceed safely. The necessary promises are described in the API doc for `TransactionDB::DeleteRange()`. There is currently no way to provide enough promises to make it safe in `OptimisticTransactionDB`.

Fixes https://github.com/facebook/rocksdb/issues/7913.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7929

Test Plan: unit tests covering the cases it's permitted/rejected

Reviewed By: ltamasi

Differential Revision: D26240254

Pulled By: ajkr

fbshipit-source-id: 2834a0ce64cc3e4c3799e35b885a5e79c2f4f6d9
2021-02-05 15:57:26 -08:00
sdong
ee79a28963 Turn on memtable bloom filter by default. (#6584)
Summary:
Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584

Test Plan: Run all existing tests.

Reviewed By: pdillinger

Differential Revision: D20626739

fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9
2021-02-05 12:59:46 -08:00
Andrew Kryczka
78ee8564ad Integrity protection for live updates to WriteBatch (#7748)
Summary:
This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.).

The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer.

When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748

Test Plan:
- an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught
- add to stress/crash test to verify it works in variety of configs/operations without intentional corruption
- [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc.

Reviewed By: pdillinger

Differential Revision: D25754492

Pulled By: ajkr

fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866
2021-01-29 12:18:58 -08:00
mrambacher
0a9a05ae12 Make builds reproducible (#7866)
Summary:
Closes https://github.com/facebook/rocksdb/issues/7035

Changed how build_version.cc was generated:
- Included the GIT tag/branch in the build_version file
- Changed the "Build Date" to be:
      - If the GIT branch is "clean" (no changes), the date of the last git commit
      - If the branch is not clean, the current date
 - Added APIs to access the "build information", rather than accessing the strings directly.

The build_version.cc file is now regenerated whenever the library objects are rebuilt.

Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866

Reviewed By: pdillinger

Differential Revision: D26086565

Pulled By: mrambacher

fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f
2021-01-28 17:42:16 -08:00
Zhichao Cao
95013df278 Do not set bg error for compaction in retryable IO Error case (#7899)
Summary:
When retryable IO error occurs during compaction, it is mapped to soft error and set the BG error. However, auto resume is not called to clean the soft error since compaction will reschedule by itself. In this change, When retryable IO error occurs during compaction, BG error is not set. User will be informed the error via EventHelper.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7899

Test Plan: tested with error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D26094097

Pulled By: zhichao-cao

fbshipit-source-id: c53424f11d237405592cd762f43cbbdf8da8234f
2021-01-27 17:58:12 -08:00
mrambacher
12f1137355 Add a SystemClock class to capture the time functions of an Env (#7858)
Summary:
Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.

Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.

There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).

Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858

Reviewed By: pdillinger

Differential Revision: D26006406

Pulled By: mrambacher

fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
2021-01-25 22:09:11 -08:00
Levi Tamasi
19076c95aa Update HISTORY.md for PR 7888 (#7890)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7890

Reviewed By: ajkr

Differential Revision: D26005509

Pulled By: ltamasi

fbshipit-source-id: e7eb732180d447900788d0e3a17dfd1c3f1e708a
2021-01-21 14:20:10 -08:00
Andrew Kryczka
e18a4df62a workaround race conditions during PeriodicWorkScheduler registration (#7888)
Summary:
This provides a workaround for two race conditions that will be fixed in
a more sophisticated way later. This PR:

(1) Makes the client serialize calls to `Timer::Start()` and `Timer::Shutdown()` (see https://github.com/facebook/rocksdb/issues/7711). The long-term fix will be to make those functions thread-safe.
(2) Makes `PeriodicWorkScheduler` atomically add/cancel work together with starting/shutting down its `Timer`. The long-term fix will be for `Timer` API to offer more specialized APIs so the client will not need to synchronize.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7888

Test Plan: ran the repro provided in https://github.com/facebook/rocksdb/issues/7881

Reviewed By: jay-zhuang

Differential Revision: D25990891

Pulled By: ajkr

fbshipit-source-id: a97fdaebbda6d7db7ddb1b146738b68c16c5be38
2021-01-21 08:50:38 -08:00
Cheng Chang
b0c43e7081 Update HISTORY.md (#7887)
Summary:
Mention the forward compatibility fix for WAL related version edits.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7887

Reviewed By: ltamasi

Differential Revision: D25982494

Pulled By: cheng-chang

fbshipit-source-id: 4be292aa4bf7fbc8a27c0bef1e7a98ad3ea8e1fa
2021-01-20 14:33:59 -08:00
Cheng Chang
928dea0e32 Update HISTORY.md (#7874)
Summary:
I find that the `track_and_verify_wals_in_manifest` option was only removed from 6.15 branch's HISTORY, but still appears under 6.15 in master branch's HISTORY. It should be moved to 6.16 since that's when the feature should be available.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7874

Reviewed By: jay-zhuang

Differential Revision: D25935971

Pulled By: cheng-chang

fbshipit-source-id: fe8bf1ec111597f9207e109aa3be65f8f919f1fd
2021-01-19 16:10:13 -08:00
Levi Tamasi
ffe4906192 Update version to 6.17 (#7871)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7871

Test Plan: `make check`

Reviewed By: jay-zhuang

Differential Revision: D25932233

Pulled By: ltamasi

fbshipit-source-id: 8b80b0638a4f34f21a27ba80b3eda7d75410b2e8
2021-01-15 18:53:00 -08:00
anand76
8e7b068ecc Make ldb load column family options from OPTIONS file (#7847)
Summary:
When the --try_load_options is used in conjunction with the
--column_family option, ldb incorrectly sets the ColumnFamilyOptions for
that column family to defaults. This PR fixes that by retaining from the
OPTIONS file and applying command line overrides.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7847

Test Plan: Add a unit test in ldb_cmd_test

Reviewed By: ajkr

Differential Revision: D25874720

Pulled By: anand1976

fbshipit-source-id: 04bcf23b55e5a30b5b6a59b0e5cb4faef3da7429
2021-01-11 20:56:34 -08:00
Cheng Chang
fdbebdf484 Add note for PR 7789 in history (#7855)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7855

Reviewed By: ajkr

Differential Revision: D25872797

Pulled By: cheng-chang

fbshipit-source-id: 82159a13f897aaaad5f3c70c7dfa822e073bc623
2021-01-11 13:34:15 -08:00
Adam Retter
4926b33742 Improvements to Env::GetChildren (#7819)
Summary:
The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html

There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source.

Also some minor fixes to `Env::GetChildren`:
* Improve error handling in POSIX implementation
* Remove unnecessary array allocation on Windows
* Fix struct name for Windows Non-UTF-8 API

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819

Reviewed By: ajkr

Differential Revision: D25837394

Pulled By: jay-zhuang

fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e
2021-01-09 09:44:34 -08:00
Akanksha Mahajan
8ed680bdb0 Add new API to report dummy entries size in cache in WriteBufferManager (#7837)
Summary:
Add new API WriteBufferManager::dummy_entries_in_cache_usage() which reports the dummy entries size stored in cache to account for DataBlocks in WriteBufferManager.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7837

Test Plan: Updated test ./write_buffer_manager_test

Reviewed By: ajkr

Differential Revision: D25794312

Pulled By: akankshamahajan15

fbshipit-source-id: 197f5e8701e3dc57a7df72dab1735624f90daf4b
2021-01-08 13:26:24 -08:00
Zhichao Cao
48c0843e69 Treat File Scope Write IO Error the same as Retryable IO Error (#7840)
Summary:
In RocksDB, when IO error happens, the flags of IOStatus can be set. If the IOStatus is set as "File Scope IO Error", it indicate that the error is constrained in the file level. Since RocksDB does not continues write data to a file when any IO Error happens, File Scope IO Error can be treated the same as Retryable IO Error. Adding the logic to ErrorHandler::SetBGError to include the file scope IO Error in its error handling logic, which is the same as retryable IO Error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7840

Test Plan: added new unit tests in error_handler_fs_test. make check

Reviewed By: anand1976

Differential Revision: D25820481

Pulled By: zhichao-cao

fbshipit-source-id: 69cabd3d010073e064d6142ce1cabf341b8a6806
2021-01-07 16:31:33 -08:00
Adam Retter
6e0f62f2b6 Add more tests to ASSERT_STATUS_CHECKED (3), API change (#7715)
Summary:
Third batch of adding more tests to ASSERT_STATUS_CHECKED.

* db_compaction_filter_test
* db_compaction_test
* db_dynamic_level_test
* db_inplace_update_test
* db_sst_test
* db_tailing_iter_test
* db_io_failure_test

Also update GetApproximateSizes APIs to all return Status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7715

Reviewed By: jay-zhuang

Differential Revision: D25806896

Pulled By: pdillinger

fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262
2021-01-06 14:15:02 -08:00
Andrew Kryczka
225abffd8f Verify file checksum generator name (#7824)
Summary:
Previously we only had a debug assertion to check the right generator was being used for verification. However a user hit a problem in production where their factory was creating the wrong generator for some files, leading to checksum mismatches. It would have been easier to debug if we verified in optimized builds that the generator with the proper name is used. This PR adds such verification.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7824

Reviewed By: zhichao-cao

Differential Revision: D25740254

Pulled By: ajkr

fbshipit-source-id: a6231521747605021bad3231484b5d4f99f4044f
2021-01-04 11:51:50 -08:00
Jay Zhuang
a8aeefd0fd Update release version to 6.16 (#7782)
Summary:
Update release version to 6.8

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7782

Reviewed By: siying

Differential Revision: D25648579

Pulled By: jay-zhuang

fbshipit-source-id: c536d606868b95c5fb2ae8f19c17eb259d67bc51
2020-12-19 12:39:21 -08:00
Peter Dillinger
4d1ac19e3d aggregated-table-properties with GetMapProperty (#7779)
Summary:
So that we can more easily get aggregate live table data such
as total filter, index, and data sizes.

Also adds ldb support for getting properties

Also fixed some missing/inaccurate related comments in db.h

For example:

    $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties
    rocksdb.aggregated-table-properties.data_size: 102871
    rocksdb.aggregated-table-properties.filter_size: 0
    rocksdb.aggregated-table-properties.index_partitions: 0
    rocksdb.aggregated-table-properties.index_size: 2232
    rocksdb.aggregated-table-properties.num_data_blocks: 100
    rocksdb.aggregated-table-properties.num_deletions: 0
    rocksdb.aggregated-table-properties.num_entries: 15000
    rocksdb.aggregated-table-properties.num_merge_operands: 0
    rocksdb.aggregated-table-properties.num_range_deletions: 0
    rocksdb.aggregated-table-properties.raw_key_size: 288890
    rocksdb.aggregated-table-properties.raw_value_size: 198890
    rocksdb.aggregated-table-properties.top_level_index_size: 0
    $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1
    rocksdb.aggregated-table-properties-at-level1.data_size: 80909
    rocksdb.aggregated-table-properties-at-level1.filter_size: 0
    rocksdb.aggregated-table-properties-at-level1.index_partitions: 0
    rocksdb.aggregated-table-properties-at-level1.index_size: 1787
    rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81
    rocksdb.aggregated-table-properties-at-level1.num_deletions: 0
    rocksdb.aggregated-table-properties-at-level1.num_entries: 12466
    rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0
    rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0
    rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210
    rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414
    rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0
    $

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779

Test Plan: Added a test to ldb_test.py

Reviewed By: jay-zhuang

Differential Revision: D25653103

Pulled By: pdillinger

fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 08:00:14 -08:00
Peter Dillinger
239d17a19c Support optimize_filters_for_memory for Ribbon filter (#7774)
Summary:
Primarily this change refactors the optimize_filters_for_memory
code for Bloom filters, based on malloc_usable_size, to also work for
Ribbon filters.

This change also replaces the somewhat slow but general
BuiltinFilterBitsBuilder::ApproximateNumEntries with
implementation-specific versions for Ribbon (new) and Legacy Bloom
(based on a recently deleted version). The reason is to emphasize
speed in ApproximateNumEntries rather than 100% accuracy.

Justification: ApproximateNumEntries (formerly CalculateNumEntry) is
only used by RocksDB for range-partitioned filters, called each time we
start to construct one. (In theory, it should be possible to reuse the
estimate, but the abstractions provided by FilterPolicy don't really
make that workable.) But this is only used as a heuristic estimate for
hitting a desired partitioned filter size because of alignment to data
blocks, which have various numbers of unique keys or prefixes. The two
factors lead us to prioritize reasonable speed over 100% accuracy.

optimize_filters_for_memory adds extra complication, because precisely
calculating num_entries for some allowed number of bytes depends on state
with optimize_filters_for_memory enabled. And the allocator-agnostic
implementation of optimize_filters_for_memory, using malloc_usable_size,
means we would have to actually allocate memory, many times, just to
precisely determine how many entries (keys) could be added and stay below
some size budget, for the current state. (In a draft, I got this
working, and then realized the balance of speed vs. accuracy was all
wrong.)

So related to that, I have made CalculateSpace, an internal-only API
only used for testing, non-authoritative also if
optimize_filters_for_memory is enabled. This simplifies some code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7774

Test Plan:
unit test updated, and for FilterSize test, range of tested
values is greatly expanded (still super fast)

Also tested `db_bench -benchmarks=fillrandom,stats -bloom_bits=10 -num=1000000 -partition_index_and_filters -format_version=5 [-optimize_filters_for_memory] [-use_ribbon_filter]` with temporary debug output of generated filter sizes.

Bloom+optimize_filters_for_memory:

      1 Filter size: 197 (224 in memory)
    134 Filter size: 3525 (3584 in memory)
    107 Filter size: 4037 (4096 in memory)
    Total on disk: 904,506
    Total in memory: 918,752

Ribbon+optimize_filters_for_memory:

      1 Filter size: 3061 (3072 in memory)
    110 Filter size: 3573 (3584 in memory)
     58 Filter size: 4085 (4096 in memory)
    Total on disk: 633,021 (-30.0%)
    Total in memory: 634,880 (-30.9%)

Bloom (no offm):

      1 Filter size: 261 (320 in memory)
      1 Filter size: 3333 (3584 in memory)
    240 Filter size: 3717 (4096 in memory)
    Total on disk: 895,674 (-1% on disk vs. +offm; known tolerable overhead of offm)
    Total in memory: 986,944 (+7.4% vs. +offm)

Ribbon (no offm):

      1 Filter size: 2949 (3072 in memory)
      1 Filter size: 3381 (3584 in memory)
    167 Filter size: 3701 (4096 in memory)
    Total on disk: 624,397 (-30.3% vs. Bloom)
    Total in memory: 690,688 (-30.0% vs. Bloom)

Note that optimize_filters_for_memory is even more effective for Ribbon filter than for cache-local Bloom, because it can close the unused memory gap even tighter than Bloom filter, because of 16 byte increments for Ribbon vs. 64 byte increments for Bloom.

Reviewed By: jay-zhuang

Differential Revision: D25592970

Pulled By: pdillinger

fbshipit-source-id: 606fdaa025bb790d7e9c21601e8ea86e10541912
2020-12-18 14:31:03 -08:00
Burton Li
2021392e25 Do not full scan obsolete files on compaction busy (#7739)
Summary:
When ConcurrentTaskLimiter is enabled and there are too many outstanding compactions, BackgroundCompaction returns Status::Busy(), which shouldn't be treat as compaction failure.
This caused performance issue when outstanding compactions reached the limit.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7739

Reviewed By: cheng-chang

Differential Revision: D25508319

Pulled By: ltamasi

fbshipit-source-id: 3b181b16ada0ca3393cfa3a7412985764e79c719
2020-12-15 13:51:10 -08:00
Peter Dillinger
003e72b201 Use size_t for filter APIs, protect against overflow (#7726)
Summary:
Deprecate CalculateNumEntry and replace with
ApproximateNumEntries (better name) using size_t instead of int and
uint32_t, to minimize confusing casts and bad overflow behavior
(possible though probably not realistic). Bloom sizes are now explicitly
capped at max size supported by implementations: just under 4GiB for
fv=5 Bloom, and just under 512MiB for fv<5 Legacy Bloom. This
hardening could help to set up for fuzzing.

Also, since RocksDB only uses this information as an approximation
for trying to hit certain sizes for partitioned filters, it's more important
that the function be reasonably fast than for it to be completely
accurate. It's hard enough to be 100% accurate for Ribbon (currently
reversing CalculateSpace) that adding optimize_filters_for_memory
into the mix is just not worth trying to be 100% accurate for num
entries for bytes.

Also:
- Cleaned up filter_policy.h to remove MSVC warning handling and
potentially unsafe use of exception for "not implemented"
- Correct the number of entries limit beyond which current Ribbon
implementation falls back on Bloom instead.
- Consistently use "num_entries" rather than "num_entry"
- Remove LegacyBloomBitsBuilder::CalculateNumEntry as it's essentially
obsolete from general implementation
BuiltinFilterBitsBuilder::CalculateNumEntries.
- Fix filter_bench to skip some tests that don't make sense when only
one or a small number of filters has been generated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7726

Test Plan:
expanded existing unit tests for CalculateSpace /
ApproximateNumEntries. Also manually used filter_bench to verify Legacy and
fv=5 Bloom size caps work (much too expensive for unit test). Note that
the actual bits per key is below requested due to space cap.

    $ ./filter_bench -impl=0 -bits_per_key=20 -average_keys_per_filter=256000000 -vary_key_count_ratio=0 -m_keys_total_max=256 -allow_bad_fp_rate
    ...
    Total size (MB): 511.992
    Bits/key stored: 16.777
    ...
    $ ./filter_bench -impl=2 -bits_per_key=20 -average_keys_per_filter=2000000000 -vary_key_count_ratio=0 -m_keys_total_max=2000
    ...
    Total size (MB): 4096
    Bits/key stored: 17.1799
    ...
    $

Reviewed By: jay-zhuang

Differential Revision: D25239800

Pulled By: pdillinger

fbshipit-source-id: f94e6d065efd31e05ec630ae1a82e6400d8390c4
2020-12-11 22:18:12 -08:00
anand76
8a1488efbf Ensure that MultiGet works properly with compressed cache (#7756)
Summary:
Ensure that when direct IO is enabled and a compressed block cache is
configured, MultiGet inserts compressed data blocks into the compressed
block cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7756

Test Plan: Add unit test to db_basic_test

Reviewed By: cheng-chang

Differential Revision: D25416240

Pulled By: anand1976

fbshipit-source-id: 75d57526370c9c0a45ff72651f3278dbd8a9086f
2020-12-09 17:01:13 -08:00
Cheng Chang
70f2e0916a Write min_log_number_to_keep to MANIFEST during atomic flush under 2 phase commit (#7570)
Summary:
When 2 phase commit is enabled, if there are prepared data in a WAL, the WAL should be kept, the minimum log number for such a WAL is written to MANIFEST during flush. In atomic flush, such information is not written to MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7570

Test Plan: Added a new unit test `DBAtomicFlushTest.ManualFlushUnder2PC`, this test fails in atomic flush without this PR, after this PR, it succeeds.

Reviewed By: riversand963

Differential Revision: D24394222

Pulled By: cheng-chang

fbshipit-source-id: 60ce74b21b704804943be40c8de01b41269cf116
2020-12-03 19:22:24 -08:00
Jay Zhuang
7fec715db4 Make CompactRange and GetApproximateSizes work with timestamp (#7684)
Summary:
Add timestamp to the `CompactRange()` and `GetApproximateSizes` range keys if needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7684

Test Plan: make check

Reviewed By: riversand963

Differential Revision: D25015421

Pulled By: jay-zhuang

fbshipit-source-id: 51ca0756087eb053a3b11801e5c7ce1c6e2d38a9
2020-12-02 13:00:53 -08:00
Jay Zhuang
9e1640403a Exclude timestamp from prefix extractor (#7668)
Summary:
Timestamp should not be included in prefix extractor, as we discussed here: https://github.com/facebook/rocksdb/pull/7589#discussion_r511068586

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7668

Test Plan: added unittest

Reviewed By: riversand963

Differential Revision: D24966265

Pulled By: jay-zhuang

fbshipit-source-id: 0dae618c333d4b7942a40d556535a1795e060aea
2020-12-01 14:07:15 -08:00
Andrew Kryczka
eb65d673fe Fix kPointInTimeRecovery handling of truncated WAL (#7701)
Summary:
WAL may be truncated to an incomplete record due to crash while writing
the last record or corruption. In the former case, no hole will be
produced since no ACK'd data was lost. In the latter case, a hole could
be produced without this PR since we proceeded to recover the next WAL
as if nothing happened. This PR changes the record reading code to
always report a corruption for incomplete records in
`kPointInTimeRecovery` mode, and the upper layer will only ignore them
if the next WAL has consecutive seqnum (i.e., we are guaranteed no
hole).

While this solves the hole problem for the case of incomplete
records, the possibility is still there if the WAL is corrupted by
truncation to an exact record boundary. This PR also regresses how much data
can be recovered when writes are mixed with/without
`WriteOptions::disableWAL`, as then we can not distinguish between a
seqnum gap caused by corruption and a seqnum gap caused by a `disableWAL` write.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7701

Test Plan:
Interestingly there already was a test for this case
(`DBWALTestWithParams.kPointInTimeRecovery`); it just had a typo bug in
the verification that prevented it from noticing holes in recovery.

Reviewed By: anand1976

Differential Revision: D25111765

Pulled By: ajkr

fbshipit-source-id: 5e330b13b1ee2b5be096cea9d0ff6075843e57b6
2020-11-30 18:11:38 -08:00
Cheng Chang
5c585e1908 Ship the track WAL in MANIFEST feature (#7689)
Summary:
Updates the option description and HISTORY.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7689

Test Plan: N/A

Reviewed By: zhichao-cao

Differential Revision: D25056238

Pulled By: cheng-chang

fbshipit-source-id: 6af1ef6f8dcf2173cbc0fccadc0e06cefd92bcae
2020-11-19 14:45:54 -08:00