Commit Graph

1011 Commits

Author SHA1 Message Date
sherriiiliu
e6534900bd Fix possible hang issue in ~DBImpl() when flush is scheduled in LOW pool (#8125)
Summary:
In DBImpl::CloseHelper, we wait for bg_compaction_scheduled_
and bg_flush_scheduled_ to drop to 0. Unschedule is called prior
to cancel any unscheduled flushes/compactions. It is assumed that
anything in the high priority is a flush, and anything in the low
priority pool is a compaction. This assumption, however, is broken when
the high-pri pool is full.
As a result, bg_compaction_scheduled_ can go < 0 and bg_flush_scheduled_
will remain > 0 and DB can be in hang state.
The fix is, we decrement the `bg_{flush,compaction,bottom_compaction}_scheduled_`
inside the `Unschedule{Flush,Compaction,BottomCompaction}Callback()`s. DB
`mutex_` will make the counts atomic in `Unschedule`.
Related discussion: https://github.com/facebook/rocksdb/issues/7928

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8125

Test Plan: Added new test case which hangs without the fix.

Reviewed By: jay-zhuang

Differential Revision: D27390043

Pulled By: ajkr

fbshipit-source-id: 78a367fba9a59ac5607ad24bd1c46dc16d5ec110
2021-03-30 18:35:20 -07:00
Peter Dillinger
ec11c23caa Add thread safety to BackupEngine, explain more (#8115)
Summary:
BackupEngine previously had unclear but strict concurrency
requirements that the API user must follow for safe use. Now we make
that clear, by separating operations into "Read," "Append," and "Write"
operations, and specifying which combinations are safe across threads on
the same BackupEngine object (previously none; now all, using a
read-write lock), and which are safe across different BackupEngine
instances open on the same backup_dir.

The changes to backupable_db.h should be backward compatible. It is
mostly about eliminating copies of what should be the same function and
(unsurprisingly) useful documentation comments were often placed on
only one of the two copies. With the re-organization, we are also
grouping different categories of operations. In the future we might add
BackupEngineReadAppendOnly, but that didn't seem necessary.

To mark API Read operations 'const', I had to mark some implementation
functions 'const' and some fields mutable.

Functional changes:
* Added RWMutex locking around public API functions to implement thread
safety on a single object. To avoid future bugs, this is another
internal class layered on top (removing many "override" in
BackupEngineImpl). It would be possible to allow more concurrency
between operations, rather than mutual exclusion, but IMHO not worth the
work.
* Fixed a race between Open() (Initialize()) and CreateNewBackup() for
different objects on the same backup_dir, where Initialize() could
delete the temporary meta file created during CreateNewBackup().
(This was found by the new test.)

Also cleaned up a couple of "status checked" TODOs, and improved a
checksum mismatch error message to include involved files.

Potential follow-up work:
* CreateNewBackup has an API wart because it doesn't tell you the
BackupID it just created, which makes it of limited use in a multithreaded
setting.
* We could also consider a Refresh() function to catch up to
changes made from another BackupEngine object to the same dir.
* Use a lock file to prevent multiple writer BackupEngines, but this
won't work on remote filesystems not supporting lock files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8115

Test Plan:
new mini-stress test in backup unit tests, run with gcc,
clang, ASC, TSAN, and UBSAN, 100 iterations each.

Reviewed By: ajkr

Differential Revision: D27347589

Pulled By: pdillinger

fbshipit-source-id: 28d82ed2ac672e44085a739ddb19d297dad14b15
2021-03-29 22:41:51 -07:00
Jay Zhuang
a037bb35e9 Compaction should not move data to up level (#8116)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8116

Reviewed By: ajkr, mrambacher

Differential Revision: D27353828

Pulled By: jay-zhuang

fbshipit-source-id: 42703fb01b04d92cc097d7979e64798448852e88
2021-03-29 17:10:42 -07:00
wolfkdy
63748c2204 On ARM platform, use yield op to relax CPU. See issue 7376 (#7438)
Summary:
see https://github.com/facebook/rocksdb/issues/7376.
The `wfe` op on ARM platform is not suitable to relax CPU. Use `yield` op.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7438

Reviewed By: riversand963

Differential Revision: D24063427

Pulled By: jay-zhuang

fbshipit-source-id: b0ebc5590d7555bd21b30f15cd59f84dc006367a
2021-03-26 18:13:24 -07:00
Andrew Kryczka
c20a7cd6c7 Apply sample_for_compression to all block-based tables (#8105)
Summary:
Previously it only applied to block-based tables generated by flush. This restriction
was undocumented and blocked a new use case. Now compression sampling
applies to all block-based tables we generate when it is enabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8105

Test Plan: new unit test

Reviewed By: riversand963

Differential Revision: D27317275

Pulled By: ajkr

fbshipit-source-id: cd9fcc5178d6515e8cb59c6facb5ac01893cb5b0
2021-03-25 15:00:45 -07:00
Jay Zhuang
45c65d6dcf Use thread-safe strerror_r() to get error message (#8087)
Summary:
`strerror()` is not thread-safe, using `strerror_r()` instead. The API could be different on the different platforms, used the code from 0deef031cb/folly/String.cpp (L457)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8087

Reviewed By: mrambacher

Differential Revision: D27267151

Pulled By: jay-zhuang

fbshipit-source-id: 4b8856d1ec069d5f239b764750682c56e5be9ddb
2021-03-24 23:07:27 -07:00
Zhichao Cao
7457c7cd00 Update release version to 6.19 (#8083)
Summary:
Update release version to 6.19

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8083

Test Plan: no code change

Reviewed By: riversand963

Differential Revision: D27222083

Pulled By: zhichao-cao

fbshipit-source-id: 94b49997019347e6e6a9e341837f4f9d3149428c
2021-03-21 18:33:46 -07:00
Zhichao Cao
dd0447ae2c Add new Append API with DataVerificationInfo to Env WritableFile (#8071)
Summary:
Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071

Test Plan: make check, added new unit test.

Reviewed By: anand1976

Differential Revision: D27177043

Pulled By: zhichao-cao

fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e
2021-03-19 11:44:13 -07:00
Zhichao Cao
c810947184 Separate handling of WAL Sync io error with SST flush io error (#8049)
Summary:
In previous codebase, if WAL is used, all the retryable IO Error will be treated as hard error. So write is stalled. In this PR, the retryable IO error from WAL sync is separated from SST file flush io error. If WAL Sync is ok and retryable IO Error only happens during SST flush, the error is mapped to soft error. So user can continue insert to Memtable and append to WAL.

Resolve the bug that if WAL sync fails, the memtable status does not roll back due to calling PickMemtable early than calling and checking SyncClosedLog.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8049

Test Plan: added new unit test, make check

Reviewed By: anand1976

Differential Revision: D26965529

Pulled By: zhichao-cao

fbshipit-source-id: f5fecb66602212523c92ee49d7edcb6065982410
2021-03-18 14:33:16 -07:00
Peter Dillinger
e7a60d01b2 Revamp WriteController (#8064)
Summary:
WriteController had a number of issues:
* It could introduce a delay of 1ms even if the write rate never exceeded the
configured delayed_write_rate.
* The DB-wide delayed_write_rate could be exceeded in a number of ways
with multiple column families:
  * Wiping all pending delay "debts" when another column family joins
  the delay with GetDelayToken().
  * Resetting last_refill_time_ to (now + sleep amount) means each
  column family can write with delayed_write_rate for large writes.
  * Updating bytes_left_ for a partial refill without updating
  last_refill_time_ would essentially give out random bonuses,
  especially to medium-sized writes.

Now the code is much simpler, with these issues fixed. See comments in
the new code and new (replacement) tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8064

Test Plan: new tests, better than old tests

Reviewed By: mrambacher

Differential Revision: D27064936

Pulled By: pdillinger

fbshipit-source-id: 497c23fe6819340b8f3d440bd634d8a2bc47323f
2021-03-18 09:47:31 -07:00
Zhichao Cao
08ec5e7321 Add the statistics and info log for Error handler (#8050)
Summary:
Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8050

Test Plan: make check and add test to error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D26990565

Pulled By: zhichao-cao

fbshipit-source-id: 49f71e8ea4e9db8b189943976404205b56ab883f
2021-03-17 22:38:13 -07:00
Akanksha Mahajan
27d57a035e Use SST file manager to track blob files as well (#8037)
Summary:
Extend support to track blob files in SST File manager.
 This PR notifies SstFileManager whenever a new blob file is created,
 via OnAddFile and  an obsolete blob file deleted via OnDeleteFile
 and delete file via ScheduleFileDeletion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037

Test Plan: Add new unit tests

Reviewed By: ltamasi

Differential Revision: D26891237

Pulled By: akankshamahajan15

fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6
2021-03-17 20:44:49 -07:00
Mark Callaghan
326670d265 Add new db_bench --benchmarks options for controlling compaction (#8027)
Summary:
The new options are:
* compact0 - compact L0 into L1 using one thread
* compact1 - compact L1 into L2 using one thread
* flush - flush memtable
* waitforcompaction - wait for compaction to finish

These are useful for reproducible benchmarks to help get the LSM tree shape
into a deterministic state. I wrote about this at:
http://smalldatum.blogspot.com/2021/02/read-only-benchmarks-with-lsm-are.html

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8027

Reviewed By: riversand963

Differential Revision: D27053861

Pulled By: ajkr

fbshipit-source-id: 1646f35584a3db03740fbeb47d91c3f00fb35d6e
2021-03-17 09:12:27 -07:00
Peter Dillinger
01c2ec3fcb Add ROCKSDB_GTEST_BYPASS (#8048)
Summary:
This is for cases that do not meet the Facebook criteria for
SKIP (see new comments). Also made ROCKSDB_GTEST_{SKIP,BYPASS} print the
message because gtest doesn't ever seem to.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8048

Test Plan: manual inspection of ./ribbon_test output, CI

Reviewed By: mrambacher

Differential Revision: D26953688

Pulled By: pdillinger

fbshipit-source-id: c914eaffe7d419db6ab90a193d474531e23582e5
2021-03-12 16:02:06 -08:00
Peter Dillinger
589ea6bec2 Add BackupEngine API for backup file details (#8042)
Summary:
This API can be used for things like determining how much space
can be freed up by deleting a particular backup, etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8042

Test Plan:
validation of the API added to many existing backup unit
tests

Reviewed By: mrambacher

Differential Revision: D26936577

Pulled By: pdillinger

fbshipit-source-id: f0bbd90f0917b9781a6837652fb4616d9247816a
2021-03-12 11:03:54 -08:00
Yanqin Jin
82b3888433 Enable backward iterator for keys with user-defined timestamp (#8035)
Summary:
This PR does the following:

- Enable backward iteration for keys with user-defined timestamp. Note that merge, single delete, range delete are not supported yet.
- Introduces a new helper API `Comparator::EqualWithoutTimestamp()`.
- Fix a typo in `SetTimestamp()`.
- Add/update unit tests

Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced for CPU-intensive workloads with a lot of `Prev()`. Also provided results of iterating keys with timestamps.

1. Disable timestamp, run:
```
./db_bench -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Results:
> Baseline
> - seekrandom [AVG    6 runs] : 96115 ops/sec;   53.2 MB/sec
> - seekrandom [MEDIAN 6 runs] : 98075 ops/sec;   54.2 MB/sec
>
> This PR
> - seekrandom [AVG    6 runs] : 95521 ops/sec;   52.8 MB/sec
> - seekrandom [MEDIAN 6 runs] : 96338 ops/sec;   53.3 MB/sec

2. Enable timestamp, run:
```
./db_bench -user_timestamp_size=8  -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Result:
> Baseline: not supported
>
> This PR
> - seekrandom [AVG    6 runs] : 90514 ops/sec;   50.1 MB/sec
> - seekrandom [MEDIAN 6 runs] : 90834 ops/sec;   50.2 MB/sec

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8035

Reviewed By: ltamasi

Differential Revision: D26926668

Pulled By: riversand963

fbshipit-source-id: 95330cc2242397c03e09d29e5417dfb0adc98ef5
2021-03-10 11:15:46 -08:00
Peter Dillinger
847ca9f964 Make default share_files_with_checksum=true (#8020)
Summary:
New comment for share_files_with_checksum:
// Only used if share_table_files is set to true. Setting to false is
// DEPRECATED and potentially dangerous because in that case BackupEngine
// can lose data if backing up databases with distinct or divergent
// history, for example if restoring from a backup other than the latest,
// writing to the DB, and creating another backup. Setting to true (default)
// prevents these issues by ensuring that different table files (SSTs) with
// the same number are treated as distinct. See
// share_files_with_checksum_naming and ShareFilesNaming.

I have also removed interim option kFlagMatchInterimNaming, which is no
longer needed and was never needed for correct+compatible operation
(just performance).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8020

Test Plan:
tests updated. Backward+forward compatibility verified with
SHORT_TEST=1 check_format_compatible.sh. ldb uses default backup
options, and I manually verified shared_checksum in
/tmp/rocksdb_format_compatible_peterd/bak/current/ after run.

Reviewed By: ajkr

Differential Revision: D26786331

Pulled By: pdillinger

fbshipit-source-id: 36f968dfef1f5cacbd65154abe1d846151a55130
2021-03-09 16:27:13 -08:00
Peter Dillinger
0028e3398b Make format_version=5 new default (#8017)
Summary:
Haven't seen any production issues with new Bloom filter and
it's now > 1 year old (added in 6.6.0).

Updated check_format_compatible.sh and HISTORY.md

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8017

Test Plan: tests updated (or prior bugs fixed)

Reviewed By: ajkr

Differential Revision: D26762197

Pulled By: pdillinger

fbshipit-source-id: 0e755c46b443087c1544da0fd545beb9c403d1c2
2021-03-09 12:42:53 -08:00
Peter Dillinger
ce391ff84b Clarifying comments for Read() APIs (#8029)
Summary:
I recently discovered the confusing, undocumented semantics of
Read() functions in the FileSystem and Env APIs. I have added
clarification to the best of my reverse-engineered understanding, and
made a note in HISTORY.md for implementors to check their
implementations, as a subtly non-adherent implementation could lead to
RocksDB quietly ignoring some portion of a file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8029

Test Plan: no code changes

Reviewed By: anand1976

Differential Revision: D26831698

Pulled By: pdillinger

fbshipit-source-id: 208f97ff6037bc13bb2ef360b987c2640c79bd03
2021-03-05 14:42:19 -08:00
Levi Tamasi
cb25bc1128 Update compaction statistics to include the amount of data read from blob files (#8022)
Summary:
The patch does the following:
1) Exposes the amount of data (number of bytes) read from blob files from
`BlobFileReader::GetBlob` / `Version::GetBlob`.
2) Tracks the total number and size of blobs read from blob files during a
compaction (due to garbage collection or compaction filter usage) in
`CompactionIterationStats` and propagates this data to
`InternalStats::CompactionStats` / `CompactionJobStats`.
3) Updates the formulae for write amplification calculations to include the
amount of data read from blob files.
4) Extends the compaction stats dump with a new column `Rblob(GB)` and
a new line containing the total number and size of blob files in the current
`Version` to complement the information about the shape and size of the LSM tree
that's already there.
5) Updates `CompactionJobStats` so that the number of files and amount of data
written by a compaction are broken down per file type (i.e. table/blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022

Test Plan: Ran `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D26801199

Pulled By: ltamasi

fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2
2021-03-04 00:43:48 -08:00
matthewvon
4126bdc0e1 Feature: add SetBufferSize() so that managed size can be dynamic (#7961)
Summary:
This PR adds SetBufferSize() to the WriteBufferManager object.  This enables user code to adjust the global budget for write_buffers based upon other memory conditions such as growth in table reader memory as the dataset grows.

The buffer_size_ member variable is now atomic to match design of other changeable size_t members within WriteBufferManager.

This change is useful as is.  However, this change is also essential if someone decides they wanted to enable db_write_buffer_size modifications through the DB::SetOptions() API, i.e. no waste taking this as is.

Any format / spacing changes are due to clang-format as required by check-in automation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7961

Reviewed By: ajkr

Differential Revision: D26639075

Pulled By: akankshamahajan15

fbshipit-source-id: 0604348caf092d35f44e85715331dc920e5c1033
2021-03-03 14:22:11 -08:00
Levi Tamasi
a46f080cce Break down the amount of data written during flushes/compactions per file type (#8013)
Summary:
The patch breaks down the "bytes written" (as well as the "number of output files")
compaction statistics into two, so the values are logged separately for table files
and blob files in the info log, and are shown in separate columns (`Write(GB)` for table
files, `Wblob(GB)` for blob files) when the compaction statistics are dumped.
This will also come in handy for fixing the write amplification statistics, which currently
do not consider the amount of data read from blob files during compaction. (This will
be fixed by an upcoming patch.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013

Test Plan: Ran `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D26742156

Pulled By: ltamasi

fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5
2021-03-02 09:48:00 -08:00
Akanksha Mahajan
f19612970d Support retrieving checksums for blob files from the MANIFEST when checkpointing (#8003)
Summary:
The checkpointing logic supports passing file level checksums
to the copy_file_cb callback function which is used by the backup code
for detecting corruption during file copies.
However, this is currently implemented only for table files.

This PR extends the checksum retrieval to blob files as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8003

Test Plan: Add new test units

Reviewed By: ltamasi

Differential Revision: D26680701

Pulled By: akankshamahajan15

fbshipit-source-id: 1bd1e2464df6e9aa31091d35b8c72786d94cd1c5
2021-03-01 20:07:07 -08:00
Yanqin Jin
cef4a6c49f Compaction filter support for (new) BlobDB (#7974)
Summary:
Allow applications to implement a custom compaction filter and pass it to BlobDB.

The compaction filter's custom logic can operate on blobs.
To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method.
Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely
on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in
db/blob/db_blob_compaction_test.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D26509280

Pulled By: riversand963

fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39
2021-02-25 16:32:35 -08:00
Akanksha Mahajan
2772eb7735 Update History.md for VerifyFileChecksums API supporting blob file (#7995)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7995

Reviewed By: ltamasi

Differential Revision: D26625766

Pulled By: akankshamahajan15

fbshipit-source-id: d83c9e77695f4193da979b1ce7103b43bc1dd46c
2021-02-24 10:25:03 -08:00
xinyuliu
b085ee13e0 Append all characters not captured by xsputn() in overflow() function (#7991)
Summary:
In the adapter class `WritableFileStringStreamAdapter`, which wraps WritableFile to be used for std::ostream, previouly only `std::endl` is considered a special case because `endl` is written by `os.put()` directly without going through `xsputn()`. `os.put()` will call `sputc()` and if we further check the internal implementation of `sputc()`, we will see it is
```
int_type __CLR_OR_THIS_CALL sputc(_Elem _Ch) {  // put a character
    return 0 < _Pnavail() ? _Traits::to_int_type(*_Pninc() = _Ch) : overflow(_Traits::to_int_type(_Ch));
```
As we explicitly disabled buffering, _Pnavail() is always 0. Thus every write, not captured by xsputn, becomes an overflow.

When I run tests on Windows, I found not only `std::endl` will drop into this case, writing an unsigned long long will also call `os.put()` then followed by `sputc()` and eventually call `overflow()`. Therefore, instead of only checking `std::endl`, we should try to append other characters as well unless the appending operation fails.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7991

Reviewed By: jay-zhuang

Differential Revision: D26615692

Pulled By: ajkr

fbshipit-source-id: 4c0003de1645b9531545b23df69b000e07014468
2021-02-23 21:44:48 -08:00
Akanksha Mahajan
cd79a00903 Make BlockBasedTable::kMaxAutoReadAheadSize configurable (#7951)
Summary:
RocksDB does auto-readahead for iterators on noticing more
than two reads for a table file. The readahead starts at 8KB and doubles on every
additional read upto BlockBasedTable::kMaxAutoReadAheadSize which is
256*1024.
This PR adds a new option BlockBasedTableOptions::max_auto_readahead_size which
replaces BlockBasedTable::kMaxAutoReadAheadSize and the new option can be
configured.
If max_auto_readahead_size is set 0 then no implicit auto prefetching will
be done. If max_auto_readahead_size provided is less than
8KB (which is initial readahead size used by rocksdb in case of
auto-readahead), readahead size will remain same as max_auto_readahead_size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7951

Test Plan: Add new unit test case.

Reviewed By: anand1976

Differential Revision: D26568085

Pulled By: akankshamahajan15

fbshipit-source-id: b6543520fc74e97d859f2002328d4c5254d417af
2021-02-23 16:54:08 -08:00
Yanqin Jin
7343eb4a74 Update HISTORY and bump version (#7984)
Summary:
Prepare to cut 6.18.fb branch

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7984

Reviewed By: ajkr

Differential Revision: D26557151

Pulled By: riversand963

fbshipit-source-id: 8c144c807090cdae67e6655e7a17056ce8c50bc0
2021-02-19 19:21:49 -08:00
Andrew Kryczka
d904233d2f Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.

However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.

Related changes include:

- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970

Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.

Reviewed By: pdillinger

Differential Revision: D26467994

Pulled By: ajkr

fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 14:09:54 -08:00
mrambacher
4bc9df9459 Fix handling of Mutable options; Allow DB::SetOptions to update mutable TableFactory Options (#7936)
Summary:
Added a "only_mutable_options" flag to the ConfigOptions.  When set, the Configurable methods will only look at/update options that are marked as kMutable.

Fixed DB::SetOptions to allow for the update of any mutable TableFactory options.  Fixes https://github.com/facebook/rocksdb/issues/7385.

Added tests for the new flag.  Updated HISTORY.md

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936

Reviewed By: akankshamahajan15

Differential Revision: D26389646

Pulled By: mrambacher

fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0
2021-02-19 10:29:02 -08:00
Zhichao Cao
b0fd1cc45a Introduce a new trace file format (v 0.2) for better extension (#7977)
Summary:
The trace file record and payload encode is fixed, which requires complex backward compatibility resolving. This PR introduce a new trace file format, which makes it easier to add new entries to the payload and does not have backward compatible issues. V 0.1 is still supported in this PR. Added the tracing for lower_bound and upper_bound for iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7977

Test Plan: make check. tested with old trace file in replay and analyzing.

Reviewed By: anand1976

Differential Revision: D26529948

Pulled By: zhichao-cao

fbshipit-source-id: ebb75a127ce3c07c25a1ccc194c551f917896a76
2021-02-18 23:05:35 -08:00
Akanksha Mahajan
eacb14a10a Update history.md for bug fix of actual error returned in DB::OpenForReadOnly (#7978)
Summary:
Update history.md for bug fix of actual error returned in DB::OpenForReadOnly

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7978

Reviewed By: jay-zhuang

Differential Revision: D26519195

Pulled By: akankshamahajan15

fbshipit-source-id: 39fd2bcc12ab92a492e8254090b742efa377ed51
2021-02-18 11:42:05 -08:00
Jay Zhuang
59ba104e4a Fix txn MultiGet() return un-committed data with snapshot (#7963)
Summary:
TransactionDB uses read callback to filter out un-committed data before
a snapshot. But `MultiGet()` API doesn't use that at all, which causes
returning unwanted data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963

Test Plan: Added unittest to reproduce

Reviewed By: anand1976

Differential Revision: D26455851

Pulled By: jay-zhuang

fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55
2021-02-18 08:49:00 -08:00
Levi Tamasi
ba8008c870 Mention the new BlobDB in HISTORY.md and remove the "under construction" signs (#7969)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7969

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D26467043

Pulled By: ltamasi

fbshipit-source-id: c69a725669d18af6e911743c998e3a1db75948c0
2021-02-16 16:20:22 -08:00
Zhichao Cao
d1c510baec Handoff checksum Implementation (#7523)
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523

Test Plan: add new unit test, pass make check/ make asan_check

Reviewed By: pdillinger

Differential Revision: D24313271

Pulled By: zhichao-cao

fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
2021-02-10 22:20:32 -08:00
Peter Dillinger
e4f1e64c30 Add prefetching (batched MultiGet) for experimental Ribbon filter (#7889)
Summary:
Adds support for prefetching data in Ribbon queries,
which especially optimizes batched Ribbon queries for MultiGet
(~222ns/key to ~97ns/key) but also single key queries on cold memory
(~333ns to ~226ns) because many queries span more than one cache line.

This required some refactoring of the query algorithm, and there
does not appear to be a noticeable regression in "hot memory" query
times (perhaps from 48ns to 50ns).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7889

Test Plan:
existing unit tests, plus performance validation with
filter_bench:

Each data point is the best of two runs. I saturated the machine
CPUs with other filter_bench runs in the background.

Before:

    $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 125.86
    Number of filters: 1993
    Total size (MB): 168.166
    Reported total allocated memory (MB): 183.211
    Reported internal fragmentation: 8.94626%
    Bits/key stored: 7.05341
    Prelim FP rate %: 0.951827
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 48.0111
      Batched, prepared net ns/op: 222.384
      Batched, unprepared net ns/op: 343.908
      Skewed 50% in 1% net ns/op: 252.916
      Skewed 80% in 20% net ns/op: 320.579
      Random filter net ns/op: 332.957

After:

    $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 128.117
    Number of filters: 1993
    Total size (MB): 168.166
    Reported total allocated memory (MB): 183.211
    Reported internal fragmentation: 8.94626%
    Bits/key stored: 7.05341
    Prelim FP rate %: 0.951827
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 49.8812
      Batched, prepared net ns/op: 97.1514
      Batched, unprepared net ns/op: 222.025
      Skewed 50% in 1% net ns/op: 197.48
      Skewed 80% in 20% net ns/op: 212.457
      Random filter net ns/op: 226.464

Bloom comparison, for reference:

    $ ./filter_bench -impl=2 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 35.3042
    Number of filters: 1993
    Total size (MB): 238.488
    Reported total allocated memory (MB): 262.875
    Reported internal fragmentation: 10.2255%
    Bits/key stored: 10.0029
    Prelim FP rate %: 0.965327
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 9.09931
      Batched, prepared net ns/op: 34.21
      Batched, unprepared net ns/op: 88.8564
      Skewed 50% in 1% net ns/op: 139.75
      Skewed 80% in 20% net ns/op: 181.264
      Random filter net ns/op: 173.88

Reviewed By: jay-zhuang

Differential Revision: D26378710

Pulled By: pdillinger

fbshipit-source-id: 058428967c55ed763698284cd3b4bbe3351b6e69
2021-02-10 21:04:56 -08:00
Andrew Kryczka
c16d5a4fda Makefile support to statically link external plugin code (#7918)
Summary:
Added support for detecting plugins linked in the "plugin/" directory and building them from our Makefile in a standardized way. See "plugin/README.md" for details. An example of a plugin that can be built in this way can be found in https://github.com/ajkr/dedupfs.

There will be more to do in terms of making this process more convenient and adding support for CMake.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7918

Test Plan: my own plugin (https://github.com/ajkr/dedupfs) and also heard this patch worked with ZenFS.

Reviewed By: pdillinger

Differential Revision: D26189969

Pulled By: ajkr

fbshipit-source-id: 6624d4357d0ffbaedb42f0d12a3fcb737c78f758
2021-02-10 08:35:34 -08:00
Jay Zhuang
cf160b98e1 Add full_history_ts_low option to compaction (#7884)
Summary:
The full_history_ts_low is used for user-defined timestamp GC
compaction, which is introduced in https://github.com/facebook/rocksdb/issues/7740, https://github.com/facebook/rocksdb/issues/7657 and https://github.com/facebook/rocksdb/issues/7655.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7884

Reviewed By: ltamasi

Differential Revision: D25982553

Pulled By: jay-zhuang

fbshipit-source-id: 36303d412d65b5d8166b6da24fa21ad85adbabee
2021-02-08 13:45:48 -08:00
Levi Tamasi
974458891c Revert "Turn on memtable bloom filter by default. (#6584)" (#7939)
Summary:
This reverts commit ee79a28963.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7939

Reviewed By: siying

Differential Revision: D26298564

Pulled By: ltamasi

fbshipit-source-id: 6d663516e82e6de436f8d5317932ca9a98e152bd
2021-02-06 22:34:30 -08:00
Andrew Kryczka
8d2bbdd04f Allow range deletions in *TransactionDB only when safe (#7929)
Summary:
Explicitly reject all range deletions on `TransactionDB` or `OptimisticTransactionDB`, except when the user provides sufficient promises that allow us to proceed safely. The necessary promises are described in the API doc for `TransactionDB::DeleteRange()`. There is currently no way to provide enough promises to make it safe in `OptimisticTransactionDB`.

Fixes https://github.com/facebook/rocksdb/issues/7913.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7929

Test Plan: unit tests covering the cases it's permitted/rejected

Reviewed By: ltamasi

Differential Revision: D26240254

Pulled By: ajkr

fbshipit-source-id: 2834a0ce64cc3e4c3799e35b885a5e79c2f4f6d9
2021-02-05 15:57:26 -08:00
sdong
ee79a28963 Turn on memtable bloom filter by default. (#6584)
Summary:
Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584

Test Plan: Run all existing tests.

Reviewed By: pdillinger

Differential Revision: D20626739

fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9
2021-02-05 12:59:46 -08:00
Andrew Kryczka
78ee8564ad Integrity protection for live updates to WriteBatch (#7748)
Summary:
This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.).

The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer.

When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748

Test Plan:
- an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught
- add to stress/crash test to verify it works in variety of configs/operations without intentional corruption
- [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc.

Reviewed By: pdillinger

Differential Revision: D25754492

Pulled By: ajkr

fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866
2021-01-29 12:18:58 -08:00
mrambacher
0a9a05ae12 Make builds reproducible (#7866)
Summary:
Closes https://github.com/facebook/rocksdb/issues/7035

Changed how build_version.cc was generated:
- Included the GIT tag/branch in the build_version file
- Changed the "Build Date" to be:
      - If the GIT branch is "clean" (no changes), the date of the last git commit
      - If the branch is not clean, the current date
 - Added APIs to access the "build information", rather than accessing the strings directly.

The build_version.cc file is now regenerated whenever the library objects are rebuilt.

Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866

Reviewed By: pdillinger

Differential Revision: D26086565

Pulled By: mrambacher

fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f
2021-01-28 17:42:16 -08:00
Zhichao Cao
95013df278 Do not set bg error for compaction in retryable IO Error case (#7899)
Summary:
When retryable IO error occurs during compaction, it is mapped to soft error and set the BG error. However, auto resume is not called to clean the soft error since compaction will reschedule by itself. In this change, When retryable IO error occurs during compaction, BG error is not set. User will be informed the error via EventHelper.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7899

Test Plan: tested with error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D26094097

Pulled By: zhichao-cao

fbshipit-source-id: c53424f11d237405592cd762f43cbbdf8da8234f
2021-01-27 17:58:12 -08:00
mrambacher
12f1137355 Add a SystemClock class to capture the time functions of an Env (#7858)
Summary:
Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.

Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.

There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).

Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858

Reviewed By: pdillinger

Differential Revision: D26006406

Pulled By: mrambacher

fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
2021-01-25 22:09:11 -08:00
Levi Tamasi
19076c95aa Update HISTORY.md for PR 7888 (#7890)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7890

Reviewed By: ajkr

Differential Revision: D26005509

Pulled By: ltamasi

fbshipit-source-id: e7eb732180d447900788d0e3a17dfd1c3f1e708a
2021-01-21 14:20:10 -08:00
Andrew Kryczka
e18a4df62a workaround race conditions during PeriodicWorkScheduler registration (#7888)
Summary:
This provides a workaround for two race conditions that will be fixed in
a more sophisticated way later. This PR:

(1) Makes the client serialize calls to `Timer::Start()` and `Timer::Shutdown()` (see https://github.com/facebook/rocksdb/issues/7711). The long-term fix will be to make those functions thread-safe.
(2) Makes `PeriodicWorkScheduler` atomically add/cancel work together with starting/shutting down its `Timer`. The long-term fix will be for `Timer` API to offer more specialized APIs so the client will not need to synchronize.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7888

Test Plan: ran the repro provided in https://github.com/facebook/rocksdb/issues/7881

Reviewed By: jay-zhuang

Differential Revision: D25990891

Pulled By: ajkr

fbshipit-source-id: a97fdaebbda6d7db7ddb1b146738b68c16c5be38
2021-01-21 08:50:38 -08:00
Cheng Chang
b0c43e7081 Update HISTORY.md (#7887)
Summary:
Mention the forward compatibility fix for WAL related version edits.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7887

Reviewed By: ltamasi

Differential Revision: D25982494

Pulled By: cheng-chang

fbshipit-source-id: 4be292aa4bf7fbc8a27c0bef1e7a98ad3ea8e1fa
2021-01-20 14:33:59 -08:00
Cheng Chang
928dea0e32 Update HISTORY.md (#7874)
Summary:
I find that the `track_and_verify_wals_in_manifest` option was only removed from 6.15 branch's HISTORY, but still appears under 6.15 in master branch's HISTORY. It should be moved to 6.16 since that's when the feature should be available.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7874

Reviewed By: jay-zhuang

Differential Revision: D25935971

Pulled By: cheng-chang

fbshipit-source-id: fe8bf1ec111597f9207e109aa3be65f8f919f1fd
2021-01-19 16:10:13 -08:00
Levi Tamasi
ffe4906192 Update version to 6.17 (#7871)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7871

Test Plan: `make check`

Reviewed By: jay-zhuang

Differential Revision: D25932233

Pulled By: ltamasi

fbshipit-source-id: 8b80b0638a4f34f21a27ba80b3eda7d75410b2e8
2021-01-15 18:53:00 -08:00
anand76
8e7b068ecc Make ldb load column family options from OPTIONS file (#7847)
Summary:
When the --try_load_options is used in conjunction with the
--column_family option, ldb incorrectly sets the ColumnFamilyOptions for
that column family to defaults. This PR fixes that by retaining from the
OPTIONS file and applying command line overrides.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7847

Test Plan: Add a unit test in ldb_cmd_test

Reviewed By: ajkr

Differential Revision: D25874720

Pulled By: anand1976

fbshipit-source-id: 04bcf23b55e5a30b5b6a59b0e5cb4faef3da7429
2021-01-11 20:56:34 -08:00
Cheng Chang
fdbebdf484 Add note for PR 7789 in history (#7855)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7855

Reviewed By: ajkr

Differential Revision: D25872797

Pulled By: cheng-chang

fbshipit-source-id: 82159a13f897aaaad5f3c70c7dfa822e073bc623
2021-01-11 13:34:15 -08:00
Adam Retter
4926b33742 Improvements to Env::GetChildren (#7819)
Summary:
The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html

There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source.

Also some minor fixes to `Env::GetChildren`:
* Improve error handling in POSIX implementation
* Remove unnecessary array allocation on Windows
* Fix struct name for Windows Non-UTF-8 API

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819

Reviewed By: ajkr

Differential Revision: D25837394

Pulled By: jay-zhuang

fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e
2021-01-09 09:44:34 -08:00
Akanksha Mahajan
8ed680bdb0 Add new API to report dummy entries size in cache in WriteBufferManager (#7837)
Summary:
Add new API WriteBufferManager::dummy_entries_in_cache_usage() which reports the dummy entries size stored in cache to account for DataBlocks in WriteBufferManager.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7837

Test Plan: Updated test ./write_buffer_manager_test

Reviewed By: ajkr

Differential Revision: D25794312

Pulled By: akankshamahajan15

fbshipit-source-id: 197f5e8701e3dc57a7df72dab1735624f90daf4b
2021-01-08 13:26:24 -08:00
Zhichao Cao
48c0843e69 Treat File Scope Write IO Error the same as Retryable IO Error (#7840)
Summary:
In RocksDB, when IO error happens, the flags of IOStatus can be set. If the IOStatus is set as "File Scope IO Error", it indicate that the error is constrained in the file level. Since RocksDB does not continues write data to a file when any IO Error happens, File Scope IO Error can be treated the same as Retryable IO Error. Adding the logic to ErrorHandler::SetBGError to include the file scope IO Error in its error handling logic, which is the same as retryable IO Error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7840

Test Plan: added new unit tests in error_handler_fs_test. make check

Reviewed By: anand1976

Differential Revision: D25820481

Pulled By: zhichao-cao

fbshipit-source-id: 69cabd3d010073e064d6142ce1cabf341b8a6806
2021-01-07 16:31:33 -08:00
Adam Retter
6e0f62f2b6 Add more tests to ASSERT_STATUS_CHECKED (3), API change (#7715)
Summary:
Third batch of adding more tests to ASSERT_STATUS_CHECKED.

* db_compaction_filter_test
* db_compaction_test
* db_dynamic_level_test
* db_inplace_update_test
* db_sst_test
* db_tailing_iter_test
* db_io_failure_test

Also update GetApproximateSizes APIs to all return Status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7715

Reviewed By: jay-zhuang

Differential Revision: D25806896

Pulled By: pdillinger

fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262
2021-01-06 14:15:02 -08:00
Andrew Kryczka
225abffd8f Verify file checksum generator name (#7824)
Summary:
Previously we only had a debug assertion to check the right generator was being used for verification. However a user hit a problem in production where their factory was creating the wrong generator for some files, leading to checksum mismatches. It would have been easier to debug if we verified in optimized builds that the generator with the proper name is used. This PR adds such verification.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7824

Reviewed By: zhichao-cao

Differential Revision: D25740254

Pulled By: ajkr

fbshipit-source-id: a6231521747605021bad3231484b5d4f99f4044f
2021-01-04 11:51:50 -08:00
Jay Zhuang
a8aeefd0fd Update release version to 6.16 (#7782)
Summary:
Update release version to 6.8

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7782

Reviewed By: siying

Differential Revision: D25648579

Pulled By: jay-zhuang

fbshipit-source-id: c536d606868b95c5fb2ae8f19c17eb259d67bc51
2020-12-19 12:39:21 -08:00
Peter Dillinger
4d1ac19e3d aggregated-table-properties with GetMapProperty (#7779)
Summary:
So that we can more easily get aggregate live table data such
as total filter, index, and data sizes.

Also adds ldb support for getting properties

Also fixed some missing/inaccurate related comments in db.h

For example:

    $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties
    rocksdb.aggregated-table-properties.data_size: 102871
    rocksdb.aggregated-table-properties.filter_size: 0
    rocksdb.aggregated-table-properties.index_partitions: 0
    rocksdb.aggregated-table-properties.index_size: 2232
    rocksdb.aggregated-table-properties.num_data_blocks: 100
    rocksdb.aggregated-table-properties.num_deletions: 0
    rocksdb.aggregated-table-properties.num_entries: 15000
    rocksdb.aggregated-table-properties.num_merge_operands: 0
    rocksdb.aggregated-table-properties.num_range_deletions: 0
    rocksdb.aggregated-table-properties.raw_key_size: 288890
    rocksdb.aggregated-table-properties.raw_value_size: 198890
    rocksdb.aggregated-table-properties.top_level_index_size: 0
    $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1
    rocksdb.aggregated-table-properties-at-level1.data_size: 80909
    rocksdb.aggregated-table-properties-at-level1.filter_size: 0
    rocksdb.aggregated-table-properties-at-level1.index_partitions: 0
    rocksdb.aggregated-table-properties-at-level1.index_size: 1787
    rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81
    rocksdb.aggregated-table-properties-at-level1.num_deletions: 0
    rocksdb.aggregated-table-properties-at-level1.num_entries: 12466
    rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0
    rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0
    rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210
    rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414
    rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0
    $

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779

Test Plan: Added a test to ldb_test.py

Reviewed By: jay-zhuang

Differential Revision: D25653103

Pulled By: pdillinger

fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 08:00:14 -08:00
Peter Dillinger
239d17a19c Support optimize_filters_for_memory for Ribbon filter (#7774)
Summary:
Primarily this change refactors the optimize_filters_for_memory
code for Bloom filters, based on malloc_usable_size, to also work for
Ribbon filters.

This change also replaces the somewhat slow but general
BuiltinFilterBitsBuilder::ApproximateNumEntries with
implementation-specific versions for Ribbon (new) and Legacy Bloom
(based on a recently deleted version). The reason is to emphasize
speed in ApproximateNumEntries rather than 100% accuracy.

Justification: ApproximateNumEntries (formerly CalculateNumEntry) is
only used by RocksDB for range-partitioned filters, called each time we
start to construct one. (In theory, it should be possible to reuse the
estimate, but the abstractions provided by FilterPolicy don't really
make that workable.) But this is only used as a heuristic estimate for
hitting a desired partitioned filter size because of alignment to data
blocks, which have various numbers of unique keys or prefixes. The two
factors lead us to prioritize reasonable speed over 100% accuracy.

optimize_filters_for_memory adds extra complication, because precisely
calculating num_entries for some allowed number of bytes depends on state
with optimize_filters_for_memory enabled. And the allocator-agnostic
implementation of optimize_filters_for_memory, using malloc_usable_size,
means we would have to actually allocate memory, many times, just to
precisely determine how many entries (keys) could be added and stay below
some size budget, for the current state. (In a draft, I got this
working, and then realized the balance of speed vs. accuracy was all
wrong.)

So related to that, I have made CalculateSpace, an internal-only API
only used for testing, non-authoritative also if
optimize_filters_for_memory is enabled. This simplifies some code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7774

Test Plan:
unit test updated, and for FilterSize test, range of tested
values is greatly expanded (still super fast)

Also tested `db_bench -benchmarks=fillrandom,stats -bloom_bits=10 -num=1000000 -partition_index_and_filters -format_version=5 [-optimize_filters_for_memory] [-use_ribbon_filter]` with temporary debug output of generated filter sizes.

Bloom+optimize_filters_for_memory:

      1 Filter size: 197 (224 in memory)
    134 Filter size: 3525 (3584 in memory)
    107 Filter size: 4037 (4096 in memory)
    Total on disk: 904,506
    Total in memory: 918,752

Ribbon+optimize_filters_for_memory:

      1 Filter size: 3061 (3072 in memory)
    110 Filter size: 3573 (3584 in memory)
     58 Filter size: 4085 (4096 in memory)
    Total on disk: 633,021 (-30.0%)
    Total in memory: 634,880 (-30.9%)

Bloom (no offm):

      1 Filter size: 261 (320 in memory)
      1 Filter size: 3333 (3584 in memory)
    240 Filter size: 3717 (4096 in memory)
    Total on disk: 895,674 (-1% on disk vs. +offm; known tolerable overhead of offm)
    Total in memory: 986,944 (+7.4% vs. +offm)

Ribbon (no offm):

      1 Filter size: 2949 (3072 in memory)
      1 Filter size: 3381 (3584 in memory)
    167 Filter size: 3701 (4096 in memory)
    Total on disk: 624,397 (-30.3% vs. Bloom)
    Total in memory: 690,688 (-30.0% vs. Bloom)

Note that optimize_filters_for_memory is even more effective for Ribbon filter than for cache-local Bloom, because it can close the unused memory gap even tighter than Bloom filter, because of 16 byte increments for Ribbon vs. 64 byte increments for Bloom.

Reviewed By: jay-zhuang

Differential Revision: D25592970

Pulled By: pdillinger

fbshipit-source-id: 606fdaa025bb790d7e9c21601e8ea86e10541912
2020-12-18 14:31:03 -08:00
Burton Li
2021392e25 Do not full scan obsolete files on compaction busy (#7739)
Summary:
When ConcurrentTaskLimiter is enabled and there are too many outstanding compactions, BackgroundCompaction returns Status::Busy(), which shouldn't be treat as compaction failure.
This caused performance issue when outstanding compactions reached the limit.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7739

Reviewed By: cheng-chang

Differential Revision: D25508319

Pulled By: ltamasi

fbshipit-source-id: 3b181b16ada0ca3393cfa3a7412985764e79c719
2020-12-15 13:51:10 -08:00
Peter Dillinger
003e72b201 Use size_t for filter APIs, protect against overflow (#7726)
Summary:
Deprecate CalculateNumEntry and replace with
ApproximateNumEntries (better name) using size_t instead of int and
uint32_t, to minimize confusing casts and bad overflow behavior
(possible though probably not realistic). Bloom sizes are now explicitly
capped at max size supported by implementations: just under 4GiB for
fv=5 Bloom, and just under 512MiB for fv<5 Legacy Bloom. This
hardening could help to set up for fuzzing.

Also, since RocksDB only uses this information as an approximation
for trying to hit certain sizes for partitioned filters, it's more important
that the function be reasonably fast than for it to be completely
accurate. It's hard enough to be 100% accurate for Ribbon (currently
reversing CalculateSpace) that adding optimize_filters_for_memory
into the mix is just not worth trying to be 100% accurate for num
entries for bytes.

Also:
- Cleaned up filter_policy.h to remove MSVC warning handling and
potentially unsafe use of exception for "not implemented"
- Correct the number of entries limit beyond which current Ribbon
implementation falls back on Bloom instead.
- Consistently use "num_entries" rather than "num_entry"
- Remove LegacyBloomBitsBuilder::CalculateNumEntry as it's essentially
obsolete from general implementation
BuiltinFilterBitsBuilder::CalculateNumEntries.
- Fix filter_bench to skip some tests that don't make sense when only
one or a small number of filters has been generated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7726

Test Plan:
expanded existing unit tests for CalculateSpace /
ApproximateNumEntries. Also manually used filter_bench to verify Legacy and
fv=5 Bloom size caps work (much too expensive for unit test). Note that
the actual bits per key is below requested due to space cap.

    $ ./filter_bench -impl=0 -bits_per_key=20 -average_keys_per_filter=256000000 -vary_key_count_ratio=0 -m_keys_total_max=256 -allow_bad_fp_rate
    ...
    Total size (MB): 511.992
    Bits/key stored: 16.777
    ...
    $ ./filter_bench -impl=2 -bits_per_key=20 -average_keys_per_filter=2000000000 -vary_key_count_ratio=0 -m_keys_total_max=2000
    ...
    Total size (MB): 4096
    Bits/key stored: 17.1799
    ...
    $

Reviewed By: jay-zhuang

Differential Revision: D25239800

Pulled By: pdillinger

fbshipit-source-id: f94e6d065efd31e05ec630ae1a82e6400d8390c4
2020-12-11 22:18:12 -08:00
anand76
8a1488efbf Ensure that MultiGet works properly with compressed cache (#7756)
Summary:
Ensure that when direct IO is enabled and a compressed block cache is
configured, MultiGet inserts compressed data blocks into the compressed
block cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7756

Test Plan: Add unit test to db_basic_test

Reviewed By: cheng-chang

Differential Revision: D25416240

Pulled By: anand1976

fbshipit-source-id: 75d57526370c9c0a45ff72651f3278dbd8a9086f
2020-12-09 17:01:13 -08:00
Cheng Chang
70f2e0916a Write min_log_number_to_keep to MANIFEST during atomic flush under 2 phase commit (#7570)
Summary:
When 2 phase commit is enabled, if there are prepared data in a WAL, the WAL should be kept, the minimum log number for such a WAL is written to MANIFEST during flush. In atomic flush, such information is not written to MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7570

Test Plan: Added a new unit test `DBAtomicFlushTest.ManualFlushUnder2PC`, this test fails in atomic flush without this PR, after this PR, it succeeds.

Reviewed By: riversand963

Differential Revision: D24394222

Pulled By: cheng-chang

fbshipit-source-id: 60ce74b21b704804943be40c8de01b41269cf116
2020-12-03 19:22:24 -08:00
Jay Zhuang
7fec715db4 Make CompactRange and GetApproximateSizes work with timestamp (#7684)
Summary:
Add timestamp to the `CompactRange()` and `GetApproximateSizes` range keys if needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7684

Test Plan: make check

Reviewed By: riversand963

Differential Revision: D25015421

Pulled By: jay-zhuang

fbshipit-source-id: 51ca0756087eb053a3b11801e5c7ce1c6e2d38a9
2020-12-02 13:00:53 -08:00
Jay Zhuang
9e1640403a Exclude timestamp from prefix extractor (#7668)
Summary:
Timestamp should not be included in prefix extractor, as we discussed here: https://github.com/facebook/rocksdb/pull/7589#discussion_r511068586

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7668

Test Plan: added unittest

Reviewed By: riversand963

Differential Revision: D24966265

Pulled By: jay-zhuang

fbshipit-source-id: 0dae618c333d4b7942a40d556535a1795e060aea
2020-12-01 14:07:15 -08:00
Andrew Kryczka
eb65d673fe Fix kPointInTimeRecovery handling of truncated WAL (#7701)
Summary:
WAL may be truncated to an incomplete record due to crash while writing
the last record or corruption. In the former case, no hole will be
produced since no ACK'd data was lost. In the latter case, a hole could
be produced without this PR since we proceeded to recover the next WAL
as if nothing happened. This PR changes the record reading code to
always report a corruption for incomplete records in
`kPointInTimeRecovery` mode, and the upper layer will only ignore them
if the next WAL has consecutive seqnum (i.e., we are guaranteed no
hole).

While this solves the hole problem for the case of incomplete
records, the possibility is still there if the WAL is corrupted by
truncation to an exact record boundary. This PR also regresses how much data
can be recovered when writes are mixed with/without
`WriteOptions::disableWAL`, as then we can not distinguish between a
seqnum gap caused by corruption and a seqnum gap caused by a `disableWAL` write.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7701

Test Plan:
Interestingly there already was a test for this case
(`DBWALTestWithParams.kPointInTimeRecovery`); it just had a typo bug in
the verification that prevented it from noticing holes in recovery.

Reviewed By: anand1976

Differential Revision: D25111765

Pulled By: ajkr

fbshipit-source-id: 5e330b13b1ee2b5be096cea9d0ff6075843e57b6
2020-11-30 18:11:38 -08:00
Cheng Chang
5c585e1908 Ship the track WAL in MANIFEST feature (#7689)
Summary:
Updates the option description and HISTORY.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7689

Test Plan: N/A

Reviewed By: zhichao-cao

Differential Revision: D25056238

Pulled By: cheng-chang

fbshipit-source-id: 6af1ef6f8dcf2173cbc0fccadc0e06cefd92bcae
2020-11-19 14:45:54 -08:00
Cheng Chang
8a97f35619 Call out a bug in HISTORY (#7690)
Summary:
It's worth mentioning the corner case bug fixed in PR 7621.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7690

Test Plan: N/A

Reviewed By: zhichao-cao

Differential Revision: D25056678

Pulled By: cheng-chang

fbshipit-source-id: 1ab42ec080f3ffe21f5d97acf65ee0af993112ba
2020-11-18 14:54:22 -08:00
Yanqin Jin
84a700819e Fix the logic of setting read_amp_bytes_per_bit from OPTIONS file (#7680)
Summary:
Instead of using `EncodeFixed32` which always serialize a integer to
little endian, we should use the local machine's endianness when
populating a native data structure during options parsing.
Without this fix, `read_amp_bytes_per_bit` may be populated incorrectly
on big-endian machines.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7680

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D24999166

Pulled By: riversand963

fbshipit-source-id: dc603cff6e17f8fa32479ce6df93b93082e6b0c4
2020-11-17 00:44:30 -08:00
Andrew Kryczka
1c5f13f2a5 Fail early when merge_operator not configured (#7667)
Summary:
An application may accidentally write merge operands without properly configuring `merge_operator`. We should alert them as early as possible that there's an API misuse. Previously RocksDB only notified them when a query or background operation needed to merge but couldn't. With this PR, RocksDB notifies them of the problem before applying the merge operand to the memtable (although it may already be in WAL, which seems it'd cause a crash loop until they enable `merge_operator`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7667

Reviewed By: riversand963

Differential Revision: D24933360

Pulled By: ajkr

fbshipit-source-id: 3a4a2ceb0b7aed184113dd03b8efd735a8332f7f
2020-11-16 20:39:01 -08:00
Ramkumar Vadivelu
5bd1258381 Update release history to 6.15 (#7673)
Summary:
Update release history to 6.15

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7673

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D24971069

Pulled By: ramvadiv

fbshipit-source-id: 5cb3f5cbc1b19beb580ea8095acdef72cc092905
2020-11-15 12:37:24 -08:00
Peter Dillinger
60af964372 Experimental (production candidate) SST schema for Ribbon filter (#7658)
Summary:
Added experimental public API for Ribbon filter:
NewExperimentalRibbonFilterPolicy(). This experimental API will
take a "Bloom equivalent" bits per key, and configure the Ribbon
filter for the same FP rate as Bloom would have but ~30% space
savings. (Note: optimize_filters_for_memory is not yet implemented
for Ribbon filter. That can be added with no effect on schema.)

Internally, the Ribbon filter is configured using a "one_in_fp_rate"
value, which is 1 over desired FP rate. For example, use 100 for 1%
FP rate. I'm expecting this will be used in the future for configuring
Bloom-like filters, as I expect people to more commonly hold constant
the filter accuracy and change the space vs. time trade-off, rather than
hold constant the space (per key) and change the accuracy vs. time
trade-off, though we might make that available.

### Benchmarking

```
$ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 34.1341
Number of filters: 1993
Total size (MB): 238.488
Reported total allocated memory (MB): 262.875
Reported internal fragmentation: 10.2255%
Bits/key stored: 10.0029
----------------------------
Mixed inside/outside queries...
  Single filter net ns/op: 18.7508
  Random filter net ns/op: 258.246
    Average FP rate %: 0.968672
----------------------------
Done. (For more info, run with -legend or -help.)
$ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 130.851
Number of filters: 1993
Total size (MB): 168.166
Reported total allocated memory (MB): 183.211
Reported internal fragmentation: 8.94626%
Bits/key stored: 7.05341
----------------------------
Mixed inside/outside queries...
  Single filter net ns/op: 58.4523
  Random filter net ns/op: 363.717
    Average FP rate %: 0.952978
----------------------------
Done. (For more info, run with -legend or -help.)
```

168.166 / 238.488 = 0.705  -> 29.5% space reduction

130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction)

### Working around a hashing "flaw"

bloom_test discovered a flaw in the simple hashing applied in
StandardHasher when num_starts == 1 (num_slots == 128), showing an
excessively high FP rate.  The problem is that when many entries, on the
order of number of hash bits or kCoeffBits, are associated with the same
start location, the correlation between the CoeffRow and ResultRow (for
efficiency) can lead to a solution that is "universal," or nearly so, for
entries mapping to that start location. (Normally, variance in start
location breaks the effective association between CoeffRow and
ResultRow; the same value for CoeffRow is effectively different if start
locations are different.) Without kUseSmash and with num_starts > 1 (thus
num_starts ~= num_slots), this flaw should be completely irrelevant.  Even
with 10M slots, the chances of a single slot having just 16 (or more)
entries map to it--not enough to cause an FP problem, which would be local
to that slot if it happened--is 1 in millions. This spreadsheet formula
shows that: =1/(10000000*(1 - POISSON(15, 1, TRUE)))

As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is
intended for CPU efficiency of filters with many more entries/slots than
kCoeffBits, a very reasonable work-around is to disallow num_starts==1
when !kUseSmash, by making the minimum non-zero number of slots
2*kCoeffBits. This is the work-around I've applied. This also means that
the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not
space-efficient for less than a few hundred entries. Because of this, I
have made it fall back on constructing a Bloom filter, under existing
schema, when that is more space efficient for small filters. (We can
change this in the future if we want.)

TODO: better unit tests for this case in ribbon_test, and probably
update StandardHasher for kUseSmash case so that it can scale nicely to
small filters.

### Other related changes

* Add Ribbon filter to stress/crash test
* Add Ribbon filter to filter_bench as -impl=3
* Add option string support, as in "filter_policy=experimental_ribbon:5.678;"
where 5.678 is the Bloom equivalent bits per key.
* Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom
* Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on
binary searching CalculateSpace (inefficient), so that subclasses
(especially experimental ones) don't have to provide an efficient
implementation inverting CalculateSpace.
* Minor refactor FastLocalBloomBitsBuilder for new base class
XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder,
which allows the latter to fall back on Bloom construction in some
extreme cases.
* Mostly updated bloom_test for Ribbon filter, though a test like
FullBloomTest::Schema is a next TODO to ensure schema stability
(in case this becomes production-ready schema as it is).
* Add some APIs to ribbon_impl.h for configuring Ribbon filters.
Although these are reasonably covered by bloom_test, TODO more unit
tests in ribbon_test
* Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data
for constructing the linear approximations in GetNumSlotsFor95PctSuccess.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658

Test Plan:
Some unit tests updated but other testing is left TODO. This
is considered experimental but laying down schema compatibility as early
as possible in case it proves production-quality. Also tested in
stress/crash test.

Reviewed By: jay-zhuang

Differential Revision: D24899349

Pulled By: pdillinger

fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8
2020-11-12 20:46:14 -08:00
Yanqin Jin
2400cd69e3 Update HISTORY.md for PR6069 (#7663)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7663

Reviewed By: ajkr

Differential Revision: D24913081

Pulled By: riversand963

fbshipit-source-id: 704f427812f2b4f92e16d6cbc93be64d730d1cf9
2020-11-12 08:38:41 -08:00
Andrew Kryczka
ec346da98c Always apply bottommost_compression_opts when enabled (#7633)
Summary:
Previously, even when `bottommost_compression_opts`'s `enabled` flag was set, it only took effect when
`bottommost_compression` was also set to something other than `kDisableCompressionOption`.
This wasn't documented and, if we kept the old behavior, it'd make
things complicated like the migration instructions in https://github.com/facebook/rocksdb/issues/7619. We can
simplify the API by making `bottommost_compression_opts` always take
effect when its `enabled` flag is set.

Fixes https://github.com/facebook/rocksdb/issues/7631.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7633

Reviewed By: ltamasi

Differential Revision: D24710358

Pulled By: ajkr

fbshipit-source-id: bbbdf9c1b53c63a4239d902cc3f5a11da1874647
2020-11-11 20:32:28 -08:00
mrambacher
c442f6809f Create a Customizable class to load classes and configurations (#6590)
Summary:
The Customizable class is an extension of the Configurable class and allows instances to be created by a name/ID.  Classes that extend customizable can define their Type (e.g. "TableFactory", "Cache") and  a method to instantiate them (TableFactory::CreateFromString).  Customizable objects can be registered with the ObjectRegistry and created dynamically.

Future PRs will make more types of objects extend Customizable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6590

Reviewed By: cheng-chang

Differential Revision: D24841553

Pulled By: zhichao-cao

fbshipit-source-id: d0c2132bd932e971cbfe2c908ca2e5db30c5e155
2020-11-11 15:10:41 -08:00
Jay Zhuang
18aee7db7e Fix a seek issue with prefix extractor and timestamp (#7644)
Summary:
During seek, prefix compare should not include timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7644

Test Plan: added unittest

Reviewed By: riversand963

Differential Revision: D24772066

Pulled By: jay-zhuang

fbshipit-source-id: 3982655a8bf8da256a738e8497b73b3d9bdac92e
2020-11-10 14:53:13 -08:00
Yanqin Jin
b6d8e36741 Compute NeedCompact() after table builder Finish() (#7627)
Summary:
In `BuildTable()`, we call `builder->Finish()` before evaluating `builder->NeedCompact()`.
However, we call `builder->NeedCompact()` before `builder->Finish()` in compaction job. This can be wrong because the table properties collectors may rely on the success of `Finish()` to provide correct result for `NeedCompact()`.

Test plan (on devserver):
make check

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7627

Reviewed By: ajkr

Differential Revision: D24728741

Pulled By: riversand963

fbshipit-source-id: 5a0dce244e14eb1106c4f87021e6bebca82b486e
2020-11-04 10:44:56 -08:00
Yanqin Jin
fde0cd7ced Add API to verify whole sst file checksum (#7578)
Summary:
Existing API `VerifyChecksum()` allows application to verify sst files' block checksums.
Since whole file, user-specified checksum is tracked in MANIFEST, we can expose a new
API to verify sst files' file checksums.

```
// Compute table file checksums if applicable and compare with MANIFEST.
// Returns OK if no file has mismatching whole-file checksum.
Status DB::VerifyFileChecksums(const ReadOptions& /*read_options*/);
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7578

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D24436783

Pulled By: riversand963

fbshipit-source-id: 52b51519b842f2b3c4e3351998a97c86cbec85b3
2020-11-03 20:34:56 -08:00
Jay Zhuang
881e0dcc09 Fix MultiGet unable to query timestamp data issue (#7589)
Summary:
The filter query key should not contain timestamp. The timestamp is
stripped for Get(), but not MultiGet().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7589

Reviewed By: riversand963

Differential Revision: D24494661

Pulled By: jay-zhuang

fbshipit-source-id: fc5ff40f9d683a89a760c6ff0ab3aed05a70c317
2020-11-03 09:45:41 -08:00
Andrew Kryczka
1adbceb581 Expand effect of dictionary settings in ColumnFamilyOptions::compression_opts (#7619)
Summary:
In dictionary compression's initial implementation, in order to save CPU overhead, we only enabled it
for bottom level under the assumption that the vast majority of data is
stored there. At that time, there was no
such thing as `ColumnFamilyOptions::bottommost_compression_opts`, so we just
hardcoded disabling dictionary compression in flush and compactions to
non-bottommost level. Now, we have users who generate all their files
through flush and are considering using dictionary compression.

To support such a use case, this PR expands the scope of `ColumnFamilyOptions::compression_opts` to
additionally include flushed files and files generated by compaction to
a non-bottommost level. Users can still get the old behavior by moving
their dictionary settings to `ColumnFamilyOptions::bottommost_compression_opts`
and explicitly enabling both that and `ColumnFamilyOptions::bottommost_compression`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7619

Reviewed By: ltamasi

Differential Revision: D24665610

Pulled By: ajkr

fbshipit-source-id: 656b90bce1033fe21c71e09af931ef5bde3e464c
2020-11-02 19:21:11 -08:00
Andrew Kryczka
a388c8cc6b Add recent fixes to HISTORY.md (#7617)
Summary:
The recently reverted behavior changes were released to at least one
place internally, so we should mention the reverts in release notes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7617

Reviewed By: akankshamahajan15

Differential Revision: D24654343

Pulled By: ajkr

fbshipit-source-id: eb64b2797d8508cd95a2dc2698122c1be29ce817
2020-10-30 14:03:35 -07:00
Yanqin Jin
6134ce6444 Perform post-flush updates of memtable list in a callback (#6069)
Summary:
Currently, the following interleaving of events can lead to SuperVersion containing both immutable memtables as well as the resulting L0. This can cause Get to return incorrect result if there are merge operands. This may also affect other operations such as single deletes.

```
  time  main_thr  bg_flush_thr  bg_compact_thr  compact_thr  set_opts_thr
0  |                                                         WriteManifest:0
1  |                                           issue compact
2  |                                 wait
3  |   Merge(counter)
4  |   issue flush
5  |                   wait
6  |                                                         WriteManifest:1
7  |                                 wake up
8  |                                 write manifest
9  |                  wake up
10 |  Get(counter)
11 |                  remove imm
   V
```

The reason behind is that: one bg flush thread's installing new `Version` can be batched and performed by another thread that is the "leader" MANIFEST writer. This bg thread removes the memtables from current super version only after `LogAndApply` returns. After the leader MANIFEST writer signals (releasing mutex) this bg flush thread, it is possible that another thread sees this cf with both memtables (whose data have been flushed to the newest L0) and the L0 before this bg flush thread removes the memtables.

To address this issue, each bg flush thread can pass a callback function to `LogAndApply`. The callback is responsible for removing the memtables. Therefore, the leader MANIFEST writer can call this callback and remove the memtables before releasing the mutex.

Test plan (devserver)
```
$make merge_test
$./merge_test --gtest_filter=MergeTest.MergeWithCompactionAndFlush
$make check
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6069

Reviewed By: cheng-chang

Differential Revision: D18790894

Pulled By: riversand963

fbshipit-source-id: e41bd600c0448b4f4b2deb3f7677f95e3076b4ed
2020-10-26 18:23:01 -07:00
Akanksha Mahajan
eef27d0048 Bug fix to remove function calling in assert statement (#7581)
Summary:
Remove function calling in assert statement as assert is a no
op in opt build and that function might not be called. This causes hang
in closing RocksDB when refit level is set.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7581

Test Plan: make check -j64

Reviewed By: riversand963

Differential Revision: D24466420

Pulled By: akankshamahajan15

fbshipit-source-id: 97db4ec5a95ae693c3290e176a3c12a9b1ad2f6d
2020-10-21 20:18:06 -07:00
anand76
00751e4292 Add a host location property to TableProperties (#7479)
Summary:
This PR adds support for writing a location identifier of the DB host to SST files as a table property. By default, the hostname is used, but can be overridden by the user. There have been some recent corruptions in files written by ```SstFileWriter``` before checksumming, so this property can be used to trace it back to the writing host and checking the host for hardware isues.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7479

Test Plan: Add new unit tests

Reviewed By: pdillinger

Differential Revision: D24340671

Pulled By: anand1976

fbshipit-source-id: 2038949fd8d160c0633ccb4f9da77740f19fa2a2
2020-10-19 11:38:48 -07:00
Jay Zhuang
c87c3a48af Add a missing bug fix in HISTORY.md (#7549)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7549

Reviewed By: ajkr, zhichao-cao

Differential Revision: D24292032

Pulled By: jay-zhuang

fbshipit-source-id: 0442283386ae20d10410a8d013a431d7cd282b22
2020-10-13 18:00:17 -07:00
Andrew Kryczka
3dc823212d add missing release notes to HISTORY.md (#7545)
Summary:
These notes existed on the release branches where they were backported, but were never added on master branch. Added them now and mentioned what minor release the fix originally appeared.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7545

Reviewed By: riversand963

Differential Revision: D24281759

Pulled By: ajkr

fbshipit-source-id: 7422e984b667793d6260dd32a7492afcb2ff1c4b
2020-10-13 12:13:47 -07:00
Andrew Kryczka
75d3b6fdf0 Redesign block cache pinning API (#7520)
Summary:
The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520

Test Plan:
- new unit test
- added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0`

Reviewed By: pdillinger

Differential Revision: D24200034

Pulled By: ajkr

fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e
2020-10-11 14:58:24 -07:00
Akanksha Mahajan
9dd25487cc Update release history 6.14 (#7525)
Summary:
Update release history for 6.14

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7525

Test Plan: No code change

Reviewed By: jay-zhuang

Differential Revision: D24224690

Pulled By: akankshamahajan15

fbshipit-source-id: 95441aefde96672fea5a6af5d7e67cdafb1ebdd2
2020-10-09 16:05:22 -07:00
Akanksha Mahajan
38d0a365e3 Add Stats for MultiGet (#7366)
Summary:
Add following stats for MultiGet in Histogram to get more insight on MultiGet.
    1. Number of index and filter blocks read from file as part of MultiGet
    request per level.
    2. Number of data blocks read from file per level.
    3. Number of SST files loaded from file system per level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7366

Reviewed By: anand1976

Differential Revision: D24127040

Pulled By: akankshamahajan15

fbshipit-source-id: e63a003056b833729b277edc0639c08fb432756b
2020-10-07 13:28:48 -07:00
Jay Zhuang
8891e9a0eb Disallow trivial move if BottommostLevelCompaction is kForce* (#7368)
Summary:
If `BottommostLevelCompaction.kForce*` is set, compaction should avoid
trivial move and always compact the sst to the target size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7368

Reviewed By: ajkr

Differential Revision: D23629525

Pulled By: jay-zhuang

fbshipit-source-id: 79f23c79ecb31587e0593b28cce43131107bbcd0
2020-10-07 13:19:31 -07:00
Peter Dillinger
9082771b86 Add is_full_compaction to CompactionJobStats, cleanup (#7451)
Summary:
This exposes to the listener interface whether a compaction was
full or not. Also cleaned up API comment for CompactionJobInfo::stats,
which is not of a nullable type. And since CompactionJob is always
created with non-null CompactionJobStats, removed conditionals on it
being nullptr and instead assert non-null.

TODO later: update C and Java interfaces

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7451

Test Plan: updated existing unit tests to check new field, make check

Reviewed By: ltamasi

Differential Revision: D23977796

Pulled By: pdillinger

fbshipit-source-id: 1ae7e26cb949631c2b2fb9e696710daf53cc378d
2020-10-01 12:52:58 -07:00
sdong
7508175558 Introduce options.check_flush_compaction_key_order (#7467)
Summary:
Introduce an new option options.check_flush_compaction_key_order, by default set to true, which checks key order of flush and compaction, and fail the operation if the order is violated.
Also did minor refactor hash checking code, which consolidates the hashing logic to a vlidation class, where the key ordering logic is added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7467

Test Plan: Add unit tests to validate the check can catch reordering in flush and compaction, and can be properly disabled.

Reviewed By: riversand963

Differential Revision: D24010683

fbshipit-source-id: 8dd6292d2cda8006054e9ded7cfa4bf405f0527c
2020-10-01 10:10:26 -07:00
Peter Dillinger
ddbc5dad05 Enable force_consistency_checks by default (#7446)
Summary:
This has been running in production on some key workloads, so
we believe it to be safe and extremely low cost. Nevertheless, I've
added code to ensure that "force_consistency_checks" is mentioned in
any corruption reports so that people know how to disable in case of
false positive corruption reports.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7446

Test Plan:
make check, CI, temporary debug print new message with
./version_builder_test

Reviewed By: ajkr

Differential Revision: D23972101

Pulled By: pdillinger

fbshipit-source-id: 9623e400f3752577c0ecf977e6d0915562cf9968
2020-09-30 11:57:32 -07:00
Akanksha Mahajan
9d212d3f0e Provide users with option to opt-in to get corrupt data in logs/messages (#7420)
Summary:
Add a new Option "allow_data_in_errors". When it's set by users, it allows them to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data.
By default value is set false to prevent users data to be exposed in the messages.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7420

Test Plan:
1. make check -j64
           2. Add a new test case

Reviewed By: ajkr

Differential Revision: D23835028

Pulled By: akankshamahajan15

fbshipit-source-id: 8d2eba8fb898e79fcf1fccc07295065a75eb59b1
2020-09-29 23:17:45 -07:00
Ramkumar Vadivelu
c203e01773 reset refitting_level_ flag to false in error paths (#7403)
Summary:
Reset refitting_level_ flag to false in error paths in DBImpl::ReFitLevel()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7403

Reviewed By: ajkr

Differential Revision: D23909028

Pulled By: ramvadiv

fbshipit-source-id: 521ad9aadc1b734bef9ef9119d1e1ee1fa8126e9
2020-09-28 11:37:00 -07:00
Peter Dillinger
9d8eb77c4d Less I/O for incremental backups, slightly better corruption detection (#7413)
Summary:
Two relatively simple functional changes to incremental backup
behavior, integrated with a minor refactoring to reduce code redundancy and
improve error/log message. There are nuances to the impact of these changes,
but I believe they are fundamentally good and generally safe. Those functional
changes:

* Incremental backups no longer read DB table files that are already saved to a
shared part of the backup directory, unless `share_files_with_checksum` is used
with `kLegacyCrc32cAndFileSize` naming (discouraged) where crc32c full file
checksums are needed to determine file naming.
  * Justification: incremental backups should not need to read the whole DB,
especially without rate limiting. (Although other BackupEngine reads are not
rate limited either, other non-trivial reads are generally limited by a
corresponding write, as in copying files.) Also, the fact that this is not
already fixed was arguably a bug/oversight in the implementation of https://github.com/facebook/rocksdb/issues/7110.

* When considering whether a table file is already backed up in a shared part
of backup directory, BackupEngine would already query the sizes of source (DB)
and pre-existing destination (backup) files. BackupEngine now uses these file
sizes to detect corruption, as at least one of (a) old backup, (b) backup in
progress, or (c) current DB is corrupt if there's a size mismatch.
  * Justification: a random related fix that also helps to cover a small hole
in corruption checking uncovered by the other functional change:
  * For `share_table_files` without "checksum" (not recommended), the other
change regresses in detecting fundamentally unsafe use of this option
combination: when you might generate different versions of same SST file
number. As demonstrated by `BackupableDBTest.FailOverwritingBackups,` this
regression is greatly mitigated by the new file size checking. Nevertheless,
almost no reason to use `share_files_with_checksum=false` should remain, and
comments are updated appropriately.

Also, this change renames internal function `CalculateChecksum` to
`ReadFileAndComputeChecksum` to make the performance impact of this function
clear in code reviews.

It is not clear what 'same_path' is for in backupable_db.cc, and I suspect it
cannot be true for a DB with unique file names (like DBImpl). Nevertheless,
I've tried to keep its functionality intact when `true` to minimize risk for
now, despite having no unit tests for which it is true.

Select impact details (much more in unit tests): For
`share_files_with_checksum`, I am confident there is no regression (vs.
pre-6.12) in detecting DB or backup corruption at backup creation time, mostly
because the old design did not leverage this extra checksum computation for
detecting inconsistencies at backup creation time. (With computed checksums in
names, a recently corrupted file just looked like a different file vs. what was
already backed up.)

Even in the hypothetical case of DB session id collision (~100 bits entropy
collision), file size in name and/or our file size check add an extra layer of
protection against false success in creating an accurate new backup. (Unit test
included.)

`DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking
are still able to catch corruptions that `CreateNewBackup` does not. Note that
when custom file checksum support is added to BackupEngine, that will
essentially give the same power as `DB::VerifyChecksum` into `CreateNewBackup`.
We could add options for `CreateNewBackup` to cover some of what would be
caught by `VerifyBackup` with checksum checking.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7413

Test Plan:
Two new unit tests included, both of which fail without these
changes. Although we don't test the I/O improvement directly, we test it
indirectly in DB corruption detection power that was inadvertently unlocked
with new backup file naming PLUS computing current content checksums (now
removed). (I don't think that case of DB corruption detection justifies reading
the whole DB on incremental backup.)

Reviewed By: zhichao-cao

Differential Revision: D23818480

Pulled By: pdillinger

fbshipit-source-id: 148aff16f001af5b9fd4b22f155311c2461f1bac
2020-09-21 16:19:24 -07:00
Peter Dillinger
52691703fc Update HISTORY.md for #7346 (#7417)
Summary:
Copied from Andrew's entry for 6.12.3. Inserted here
retroactive to 6.12

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7417

Test Plan: no code change

Reviewed By: jay-zhuang

Differential Revision: D23815980

Pulled By: pdillinger

fbshipit-source-id: 3c8a052cdb61be1215d311556c9487f9ea5c8cb0
2020-09-21 09:47:36 -07:00
Peter Dillinger
b475a83f9d Postponing custom checksum support in BackupEngine (#7411)
Summary:
This change reverts BackupEngine to 6.12 state to accommodate a
higher-priority fix that does not easily merge with this custom checksum
support. We intend to reinstate this support soon, by merging a revert
of this change.

For backupable_db_test, I've removed the tests depending on this
feature.

I've also removed relevant HISTORY.md entry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7411

Test Plan: unit tests

Reviewed By: ajkr

Differential Revision: D23793835

Pulled By: pdillinger

fbshipit-source-id: 7e861436539584799b13d1a8ae559b81b6d08052
2020-09-18 15:27:03 -07:00
Zhichao Cao
c268628c25 Map retryable IO error during Flush without WAL to soft error and no switch memtable during resume (#7310)
Summary:
In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310

Test Plan: adding new unit test, pass make check.

Reviewed By: anand1976

Differential Revision: D23710892

Pulled By: zhichao-cao

fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9
2020-09-17 20:25:45 -07:00
anand76
b9750c7c3c Update HISTORY.md with IO fencing error code (#7402)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7402

Reviewed By: pdillinger

Differential Revision: D23761689

Pulled By: anand1976

fbshipit-source-id: 59e10f0aaa80f6c0f5a46dc99467138c4cee0511
2020-09-17 11:31:24 -07:00
Peter Dillinger
93719fc953 Restore file size in backup table file names (and other cleanup) (#7400)
Summary:
Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '\_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.

This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.

We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).

We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.

Two final auxiliary notes:
Recognizing that the backup file names have become a de facto part of
the backup meta schema, this change makes them easier to parse and
extend by putting a distinct marker, 's', before DB session ids embedded
in the name. When we extend this to allow custom checksums in the name,
they can get their own marker to ensure safe parsing. For backward
compatibility, file size does not get a marker but is assumed for
`_[0-9]+[.]`

Another change from initial 6.12 default behavior is never including
file custom checksum in the file name. Looking ahead to 6.13, we do not
want the default behavior to cause backup space amplification for
someone turning on file custom checksum checking in BackupEngine; we
want that to be an easy decision. When implemented, including file
custom checksums in backup file names will be a non-default option.

Actual file name patterns and priorities, as regexes:

    kLegacyCrc32cAndFileSize OR pre-6.12 SST file ->
      [0-9]+_[0-9]+_[0-9]+[.]sst
    kFlagMatchInterimNaming set (default) AND early 6.12 SST file ->
      [0-9]+_[0-9a-fA-F-]+[.]sst
    kUseDbSessionId AND NOT kFlagIncludeFileSize ->
      [0-9]+_s[0-9A-Z]{20}[.]sst
    kUseDbSessionId AND kFlagIncludeFileSize (default) ->
      [0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst

We might add opt-in options for more '\_' separated data in the name,
but embedded file size, if present, will always be after last '\_' and
before '.sst'.

This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400

Test Plan:
unit tests included. Sync point callbacks are used to mimic
previous version SST files.

Reviewed By: ajkr

Differential Revision: D23759587

Pulled By: pdillinger

fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025
2020-09-17 10:24:22 -07:00
Peter Dillinger
7780a360eb Fix HISTORY.md and check_format_compatible.sh for 6.13 branch (#7401)
Summary:
Make "unreleased" section for HISTORY.md with things misplaced
into 6.12 and 6.13

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7401

Test Plan: see how it goes, and `git diff origin/6.13.fb HISTORY.md`

Reviewed By: jay-zhuang

Differential Revision: D23759740

Pulled By: pdillinger

fbshipit-source-id: fc441916c7ff2bbb8d5384137653b340d4c47674
2020-09-17 09:00:13 -07:00
mrambacher
67bd5401e9 Changes to EncryptedEnv public API (#7279)
Summary:
Cleaned up the public API to use the EncryptedEnv.  This change will allow providers to be developed and added to the system easier in the future.  It will also allow better integration in the future with the OPTIONS file.

- The internal classes were moved out of the public API into an internal "env_encryption_ctr.h" header.  Short-cut constructors were added to provide the original API functionality.
- The APIs to the constructors were changed to take shared_ptr, rather than raw pointers or references to allow better memory management and alternative implementations.
- CreateFromString methods were added to allow future expansion to other provider and cipher implementations through a standard API.

Additionally, there was a code duplication in the NewXXXFile methods.  This common code was moved under a templatized function.

A first-pass at structuring the code was made to potentially allow multiple EncryptionProviders in a single EncryptedEnv.  The idea was that different providers may use different cipher keys or different versions/algorithms.  The EncryptedEnv should have some means of picking different providers based on information.  The groundwork was started for this (the use of the provider_ member variable was localized) but the work has not been completed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7279

Reviewed By: jay-zhuang

Differential Revision: D23709440

Pulled By: zhichao-cao

fbshipit-source-id: 0e845fff0e03a52603eb9672b4ade32d063ff2f2
2020-09-15 17:14:10 -07:00
mrambacher
7d472accdc Bring the Configurable options together (#5753)
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts

Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753

Reviewed By: ajkr

Differential Revision: D23385030

Pulled By: zhichao-cao

fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
2020-09-14 17:01:01 -07:00
Peter Dillinger
ecc8ffe17b Update master to version 6.13 (#7378)
Summary:
for release fork

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7378

Test Plan: make check + CI

Reviewed By: jay-zhuang

Differential Revision: D23669163

Pulled By: pdillinger

fbshipit-source-id: 14cbf95b32717c28418c71cc8e10f06733bbc49f
2020-09-12 13:18:09 -07:00
Yanqin Jin
205e577694 Cancel tombstone skipping during bottommost compaction (#7356)
Summary:
During bottommost compaction, RocksDB cannot simply drop a tombstone if
this tombstone is not in the earliest snapshot. The current behavior is: RocksDB
skips other internal keys (of the same user key) in the same snapshot range. In
the meantime, RocksDB should check for the `shutting_down` flag. Otherwise, it
is possible for a bottommost compaction that has already started running to take
a long time to finish, even if the application has tried to cancel all background jobs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7356

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D23663241

Pulled By: riversand963

fbshipit-source-id: 25f8e9b51bc3bfa3353cdf87557800f9d90ee0b5
2020-09-11 17:45:43 -07:00
Peter Dillinger
92639b93a6 Fix checkpoint file deletion race with avoid_unnecessary_blocking_io (#7369)
Summary:
https://github.com/facebook/rocksdb/issues/3341 guaranteed that upon return of `GetSortedWalFiles` after
`DisableFileDeletions`, all pending purges of previously obsolete WAL
files will have finished. However, the addition of
avoid_unnecessary_blocking_io in https://github.com/facebook/rocksdb/issues/5043 opened a hole in the code making
that assurance, which can lead to files to be copied for checkpoint or
backup going missing before being copied, with that option enabled.

This change patches the hole.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7369

Test Plan:
apparent fix to backups in crash test observed. Will work
on a unit test for another commit

Reviewed By: ajkr

Differential Revision: D23620258

Pulled By: pdillinger

fbshipit-source-id: bea36b461a5b719c3e3ef802f967bc3e8ae71614
2020-09-10 22:35:25 -07:00
Yanqin Jin
8307d4400c Update HISTORY.md for PR7329 (#7355)
Summary:
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7355

Reviewed By: pdillinger

Differential Revision: D23566635

Pulled By: riversand963

fbshipit-source-id: f8d846bcff637e7617b764b7bfb9a948ea18d195
2020-09-08 11:10:25 -07:00
Andrew Kryczka
5746767387 add ldb unsafe_remove_sst_file subcommand (#7335)
Summary:
This is adapted from https://github.com/facebook/rocksdb/issues/6678 but takes a different approach, avoiding opening a read-write DB and avoiding the `DeleteFile()` API.

First, this PR refactors how options variables are initialized in `ldb` so it can be reused in a subcommand that doesn't open a DB:

- Separated remaining option initialization logic out of `OpenDB()`. The new `PrepareOptions()` function initializes the full options state.
- Fixed an old TODO about applying the subcommand CF option overrides to the proper `ColumnFamilyOptions` object.

Second, this PR adds the `ldb unsafe_remove_sst_file` subcommand. It uses the `VersionSet`-level APIs to remove the file with the specified number.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7335

Test Plan: played with interactive python and this file removal command. Verified openability/correct results in case of multiple column families, multiple levels, etc.

Reviewed By: pdillinger

Differential Revision: D23454575

Pulled By: ajkr

fbshipit-source-id: 039b7a8cbfc42fd123dcb25821eef51d61148afe
2020-09-03 16:54:51 -07:00
Andrew Kryczka
40e97b02be add warning on DeleteFile() API (#7337)
Summary:
Since we can't land https://github.com/facebook/rocksdb/issues/7336 until the next major release, added a strong warning against the `DeleteFile()` API in the meantime.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7337

Reviewed By: pdillinger

Differential Revision: D23459728

Pulled By: ajkr

fbshipit-source-id: 326cb9b18190386080c35c761a8736d8a877dafb
2020-09-03 16:42:01 -07:00
Andrew Kryczka
af54c4092a fix SstFileWriter with dictionary compression (#7323)
Summary:
In block-based table builder, the cut-over from buffered to unbuffered
mode involves sampling the buffered blocks and generating a dictionary.
There was a bug where `SstFileWriter` passed zero as the `target_file_size`
causing the cutover to happen immediately, so there were no samples
available for generating the dictionary.

This PR changes the meaning of `target_file_size == 0` to mean buffer
the whole file before cutting over. It also adds dictionary compression
support to `sst_dump --command=recompress` for easy evaluation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7323

Reviewed By: cheng-chang

Differential Revision: D23412158

Pulled By: ajkr

fbshipit-source-id: 3b232050e70ef3c2ee85a4b5f6fadb139c569873
2020-09-03 15:49:57 -07:00
Hiep
d0c1a01c1b Avoid converting MERGES to PUTS when allow_ingest_behind is true (#7166)
Summary:
- Closes https://github.com/facebook/rocksdb/issues/6490
- Currently MERGEs are converted to PUTs at bottom or compaction has reached the beginning of the key, this can wrongly cover a PUT future base case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7166

Test Plan:
- Automated: `make all check`
- Manual: With `allow_ingest_behind = true`, add Merge operations to a key then run compaction. Then run ingesting external files to make sure the base case is probably compacted with existing Merges.

Reviewed By: cheng-chang

Differential Revision: D23325425

Pulled By: ajkr

fbshipit-source-id: 3eb415eb7b381b5453e45245393566153b1abb68
2020-09-03 14:39:58 -07:00
Andrew Kryczka
177f8bd063 Bound L0->Lbase fanout in dynamic leveled compaction (#7325)
Summary:
L0 score is based on size target and number of files. The size target
used is `max_bytes_for_level_base`. However, the base level's size can
dynamically expand in write burst mode. In fact, it can expand so much
that L0->Lbase becomes the highest fanout in target sizes. This doesn't
make sense from an efficiency perspective, so this PR bounds the
L0->Lbase fanout to the smoothed level multiplier. The L0 scoring based
on file count remains unchanged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7325

Test Plan:
contrived benchmark that exhibits the problem:

```
$ TEST_TMPDIR=/data/users/andrewkr/ ./db_bench -benchmarks=filluniquerandom,readrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -level0_file_num_compaction_trigger=4 -level_compaction_dynamic_level_bytes=true -compression_type=none -max_background_jobs=12 -rate_limiter_bytes_per_sec=104857600 -benchmark_write_rate_limit=10485760 -num=100000000
```

Results:

- "Burst W-Amp" is the write-amp near the end of the fillrandom benchmark
- "Total W-Amp" is the write-amp after readrandom has run a while and all levels no longer need compaction

Branch | Burst W-Amp | Total W-Amp | fillrandom (MB/s)
-- | -- | -- | --
master | 20.2 | 21.5 | 4.7
dynamic-l0-score | 12.6 | 14.1 | 7.2

Reviewed By: siying

Differential Revision: D23412935

Pulled By: ajkr

fbshipit-source-id: f91f2067188e432dd39deab02f1c56f195057a0e
2020-09-01 19:34:01 -07:00
Akanksha Mahajan
963314ffd6 Add unit test for max_write_buffer_size_to_maintain (#7311)
Summary:
Add a unit test case to check memory usage when
max_write_buffer_size_to_maintain is set if flushed immutable memtables are
trimmed timely or not.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7311

Test Plan: Compared the results with before bug fix.

Reviewed By: ltamasi

Differential Revision: D23321702

Pulled By: akankshamahajan15

fbshipit-source-id: da04ee21137d641a07fd499a9e2749eb036fcb1e
2020-08-28 17:38:05 -07:00
Jay Zhuang
c2485f2d81 Add buffer prefetch support for non directIO usecase (#7312)
Summary:
A new file interface `SupportPrefetch()` is added. When the user overrides it to `false`, an internal prefetch buffer will be used for readahead. Useful for non-directIO but FS doesn't have readahead support.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7312

Reviewed By: anand1976

Differential Revision: D23329847

Pulled By: jay-zhuang

fbshipit-source-id: 71cd4ce6f4a820840294e4e6aec111ab76175527
2020-08-27 18:16:53 -07:00
Peter Dillinger
9aad24da55 Real fix for race in backup custom checksum checking (#7309)
Summary:
This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294.
To get DB checksum info for live files, we now read the manifest file
that will become part of the checkpoint/backup. This requires a little
extra handling in taking a custom checkpoint, including only reading the
manifest file up to the size prescribed by the checkpoint.

This moves GetFileChecksumsFromManifest from backup code to
file_checksum_helper.{h,cc} and removes apparently unnecessary checking
related to column families.

Updated HISTORY.md and warned potential future users of
DB::GetLiveFilesChecksumInfo()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309

Test Plan: updated unit test, before and after

Reviewed By: ajkr

Differential Revision: D23311994

Pulled By: pdillinger

fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5
2020-08-26 10:39:20 -07:00
sdong
722814e357 Get() to fail with underlying failures in PartitionIndexReader::CacheDependencies() (#7297)
Summary:
Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297

Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it.

Reviewed By: anand1976

Differential Revision: D23257950

fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03
2020-08-25 19:01:05 -07:00
Zhichao Cao
d51f88c9e4 Pass SST file checksum information through OnTableFileCreated (#7108)
Summary:
When SST file is created, application is able to know the file information through OnTableFileCreated callback in LogAndNotifyTableFileCreationFinished. Since file checksum information can be useful for application when the SST file is created, we add file_checksum and file_checksum_func_name information to TableFileCreationInfo, which will be passed through OnTableFileCreated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7108

Test Plan: make check, listener_test.

Reviewed By: ajkr

Differential Revision: D22470240

Pulled By: zhichao-cao

fbshipit-source-id: 92c20344d9b986eadfe3480f3769bf4add0dbaae
2020-08-25 10:46:11 -07:00
Connor1996
416943bf28 Eliminates a no-op compaction upon snapshot release when disabling auto compactions (#7267)
Summary:
After releasing a snapshot, it checks whether it is suitable to trigger bottom compactions.
When disabling auto compactions, it may still schedule compaction when releasing a snapshot. Whereas no compaction job will be actually handled, so the state of LSM is not changed and compaction will be triggered again and again every time releasing a snapshot.

Too frequent compactions lead to high CPU usage and high db_mutex lock contention which affects foreground write duration finally.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7267

Test Plan:
- make check
- manual test

Reviewed By: akankshamahajan15

Differential Revision: D23252880

Pulled By: ajkr

fbshipit-source-id: 4431e071a35d9912a2a3592875db27bae521434b
2020-08-24 22:06:45 -07:00
Akanksha Mahajan
3844612625 Bug Fix for memtables not trimmed down. (#7296)
Summary:
When a memtable is trimmed in MemTableListVersion, the memtable
is only added to delete list if it is
the last reference. However it is not the last reference as it is held
by the super version. But the super version would not be switched if the
delete list is empty. So the memtable is never destroyed and memory
usage increases beyond write_buffer_size +
max_write_buffer_size_to_maintain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7296

Test Plan:
1.  ./db_bench -benchmarks=randomtransaction
-optimistic_transaction_db=1 -statistics -stats_interval_seconds=1
-duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000
--transaction_set_snapshot

Reviewed By: ltamasi

Differential Revision: D23267395

Pulled By: akankshamahajan15

fbshipit-source-id: 3a8d437fe9f4015f851ff84c0e29528aa946b650
2020-08-21 13:29:05 -07:00
Peter Dillinger
a1b5484811 Work around a backup bug with DB custom checksums (#7294)
Summary:
On a read-write DB configured with
DBOptions::file_checksum_gen_factory, BackupEngine::CreateNewBackup can
fail intermittently, with non-OK status. This is due to a race between
GetLiveFiles and GetLiveFilesChecksumInfo in creating backups.

For patching 6.12 release (as this commit is intended for, except this is a
forward-merged version), we can simply treat files for which we falsely failed
to get checksum info as legacy files lacking checksum info.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7294

Test Plan: unit test reproducer included

Reviewed By: ajkr

Differential Revision: D23253489

Pulled By: pdillinger

fbshipit-source-id: 9e4945dad120b776ad3e753be10b962f61f28e14
2020-08-21 08:16:04 -07:00
Andrew Kryczka
5d5ff82408 Disable recycle_log_file_num with kTolerateCorruptedTailRecords (#7271)
Summary:
The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, `WALRecoveryMode::kTolerateCorruptedTailRecords` must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7271

Reviewed By: riversand963

Differential Revision: D23169923

Pulled By: ajkr

fbshipit-source-id: 2cf8a3bcd2c9a0ecb0055a84725047a10fd4db50
2020-08-17 18:21:10 -07:00
Yanqin Jin
92593d511a Add a new EntryType for deletion with timestamp (#7195)
Summary:
Add `kEntryDeleteWithTimestamp` to `EntryType` which is a public API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7195

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22914704

Pulled By: riversand963

fbshipit-source-id: 886f73c6b70c527cad1c8fc9fc8d3afe60e1ea39
2020-08-17 16:26:06 -07:00
sdong
1760637539 CompactRange() refit level should confirm destination level is not empty (#7261)
Summary:
There is potential data race related CompactRange() with level refitting. After the compaction step and refitting step, some automatic compaction could put data to the destination level and cause the DB to be corrupted. Fix the bug by checking the target level to be empty.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7261

Test Plan: Add a unit test, which would fail with "Corruption: L1 have overlapping ranges '666F6F' seq:6, type:1 vs. '626172' seq:2, type:1", and now it succeeds.

Reviewed By: ajkr

Differential Revision: D23142269

fbshipit-source-id: 28bc14d5ac934c192260b23a4ce3f10a95e3ee91
2020-08-17 14:21:53 -07:00
Jay Zhuang
69760b4d05 Introduce a global StatsDumpScheduler for stats dumping (#7223)
Summary:
Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223

Reviewed By: riversand963

Differential Revision: D23056737

Pulled By: jay-zhuang

fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba
2020-08-14 20:12:44 -07:00
Andrew Kryczka
a1aa3f8385 Disable manual compaction during ReFitLevel() (#7250)
Summary:
Manual compaction with `CompactRangeOptions::change_levels` set could
refit to a level targeted by another manual compaction. If
force_consistency_checks were disabled, it could be possible for
overlapping files to be written at that target level.

This PR prevents the possibility by calling `DisableManualCompaction()`
prior to `ReFitLevel()`. It also improves the manual compaction disabling
mechanism to wait for pending manual compactions to complete before
returning, and support disabling from multiple threads.

Fixes https://github.com/facebook/rocksdb/issues/6432.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250

Test Plan:
crash test command that repro'd the bug reliably:

```
$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0
```

Reviewed By: ltamasi

Differential Revision: D23090800

Pulled By: ajkr

fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13
2020-08-14 11:29:52 -07:00
Zitan Chen
b578ca2e4d BackupEngine supports custom file checksums (#7085)
Summary:
A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups.

Tests are added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085

Test Plan: Passed make check

Reviewed By: pdillinger

Differential Revision: D22390756

Pulled By: gg814

fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f
2020-08-12 13:31:09 -07:00
sdong
41c328fe57 Fix a perf regression that caused every key to go through upper bound check (#7209)
Summary:
https://github.com/facebook/rocksdb/pull/5289 introduces a performance regression that caused an upper bound check within every BlockBasedTableIterator::Next(). This is unnecessary if we've checked the boundary key for current block and it is within upper bound.

Fix the bug. Also rename the boolean to a enum so that the code is slightly better readable. The original regression was probably to fix a bug that the block upper bound check status is not reset after a new block is created. Fix it bug so that the regression can be avoided without hitting the bug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7209

Test Plan: Run all existing tests. Will run atomic black box crash test for a while.

Reviewed By: anand1976

Differential Revision: D22859246

fbshipit-source-id: cbdad1f5e656c55fd8b71726d5a4f6cb53ff9140
2020-08-04 11:30:09 -07:00
Yanqin Jin
a38f04ac26 Update HISTORY and version for 6.12 release (#7194)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7194

Reviewed By: gg814

Differential Revision: D22810654

Pulled By: riversand963

fbshipit-source-id: 01f13089fa2b7e31b827da3e30c90e5c62c41380
2020-07-29 10:13:21 -07:00
Tomas Kolda
cd4592c220 SST Partitioner interface that allows to split SST files (#6957)
Summary:
SST Partitioner interface that allows to split SST files during compactions.

It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957

Reviewed By: ajkr

Differential Revision: D22461239

fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e
2020-07-24 13:44:49 -07:00
Cheng Chang
7af1fab443 Update HISTORY (#7158)
Summary:
Mention the MultiRead bug in HISTORY.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7158

Test Plan: N/A

Reviewed By: siying

Differential Revision: D22670565

Pulled By: cheng-chang

fbshipit-source-id: 16abf0192957be66511f6a08e00157bfd37b189f
2020-07-23 08:47:13 -07:00
Jay Zhuang
b0c5ecd6b3 Make max_subcompactions dynamically changeable (#7159)
Summary:
Make `max-subcompactions` dynamically changeable by passing the `DBOption` to Compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7159

Reviewed By: siying

Differential Revision: D22671238

Pulled By: jay-zhuang

fbshipit-source-id: 311ca9f6bb606965544d8708616d358cfed5be42
2020-07-22 18:32:52 -07:00
mrambacher
d44cbc5314 Add hash of key/value checks when paranoid_file_checks=true (#7134)
Summary:
When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file.  If the values do not match, the SST file is rejected.

Code put in place for the check for both flush and compaction jobs.  Corresponding test added to corruption_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134

Reviewed By: cheng-chang

Differential Revision: D22646149

fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f
2020-07-22 11:04:40 -07:00
Haosen Wen
dbc51adbac Use steady_clock instead of system_clock in FileOperationInfo::TimePoint (#7153)
Summary:
Issue https://github.com/facebook/rocksdb/issues/7133 reported that using `system_clock` in `FileOperationInfo::TimePoint` causes the duration of file flush operation (which can be a noop on MacOS in some scenarios) appears to be 0 and fail an assertion in listener_test. Using `steady_clock` supposedly fixed the problem.
`steady_clock` actually fits better into the use cases of `FileOperationInfo::TimePoint` as all usages care about durations but not wall clock time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7153

Test Plan: make check.

Reviewed By: riversand963

Differential Revision: D22654136

Pulled By: roghnin

fbshipit-source-id: 5980b1080734bdae496a18071a2c2b5887c67d85
2020-07-22 08:55:02 -07:00
Zitan Chen
b923dc720b BackupEngine computes table checksums only once if db session ids are available (#7110)
Summary:
BackupEngine requires computing table checksums twice when backing up table files to the `shared_checksum` directory.

The repeated computation can be avoided by utilizing the db session id stored as a part of the table properties.

Filenames of table files in the `shared_checksum` directory depend on the following conditions:
1. the naming scheme is `kOptionalChecksumAndDbSessionId`,
2. `db_session_id` is not empty,
3. checksum is available in the DB manifest.

If 1,2,3 are satisfied, then the filenames will be of the form `<file_number>_<checksum>_<db_session_id>.sst`.
If 1,2 are satisfied, then the filenames will be of the form `<file_number>_<db_session_id>.sst`.
In all other cases, the filenames are of the form `<file_number>_<checksum>_<size>.sst`.

Additionally, if `kOptionalChecksumAndDbSessionId` is used (and not falling back to `kChecksumAndFileSize`), the `<checksum>` appeared in the filenames is hexadecimally encoded, instead of being plain `uint32_t` value.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7110

Test Plan: backupable_db_test and manual tests.

Reviewed By: ajkr

Differential Revision: D22508992

Pulled By: gg814

fbshipit-source-id: 5669f0ea9ad5a097f69f6d87aca4abba15032389
2020-07-21 10:35:40 -07:00
Andrew Kryczka
9a83fd21e6 stagger first DumpMallocStats after opening DB (#7145)
Summary:
Previously when running `db_bench` with large value for `num_multi_dbs` and enabled `Options::dump_malloc_stats`, we would see most CPU spent in jemalloc locking. After this PR that no longer shows up at the top of the profile.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7145

Reviewed By: riversand963

Differential Revision: D22593031

Pulled By: ajkr

fbshipit-source-id: 3b3fc91f93249c6afee53f59f34c487c3fc5add6
2020-07-17 16:13:26 -07:00
Zhichao Cao
a10f12eda1 Auto resume the DB from Retryable IO Error (#6765)
Summary:
In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765

Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check

Reviewed By: anand1976

Differential Revision: D21916789

Pulled By: zhichao-cao

fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
2020-07-15 11:03:58 -07:00
Yanqin Jin
27735dea9a Report corrupted keys during compaction (#7124)
Summary:
Currently, RocksDB lets compaction to go through even in case of
corrupted keys, the number of which is reported in CompactionJobStats.
However, RocksDB does not check this value. We should let compaction run
in a stricter mode.

Temporarily disable two tests that allow corrupted keys in compaction.
With this PR, the two tests will assert(false) and terminate. Still need
to investigate what is the recommended google-test way of doing it.
Death test (EXPECT_DEATH) in gtest has warnings now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7124

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22530722

Pulled By: riversand963

fbshipit-source-id: 6a5a6a992028c6d4f92cb74693c92db462ae4ad6
2020-07-14 17:18:17 -07:00
Andrew Kryczka
82611ee25a save key comparisons in BlockIter::BinarySeek (#7068)
Summary:
This is a followup to https://github.com/facebook/rocksdb/issues/6646. In that PR, for simplicity I just appended a comparison against the 0th restart key in case `BinarySeek()`'s binary search landed at index 0. As a result there were `2/(N+1) + log_2(N)` key comparisons. This PR does it differently. Now we expand the binary search range by one so it also covers the case where target is at or before the restart key at index 0. As a result, it involves `log_2(N+1)` key comparisons.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7068

Test Plan:
ran readrandom with mostly default settings and counted key comparisons
using `PerfContext`.

before: `user_key_comparison_count = 28881965`
after: `user_key_comparison_count = 27823245`

setup command:

```
$ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000
```

benchmark command:

```
$ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3
```

Reviewed By: anand1976

Differential Revision: D22357032

Pulled By: ajkr

fbshipit-source-id: 8b01e9c1c2a4e9d02fc9dfe16c1cc0327f8bdf24
2020-07-09 12:27:20 -07:00
Akanksha Mahajan
54f171fe90 Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7096)
Summary:
When format_version is high enough to support user-key and
there are index entries for same user key that spans multiple data
blocks then it changes from user-key mode to internal-key mode. But the
flush policy is not reset to point to Block Builder of internal-keys.
After this switch, no entries are added to user key index partition
result, thus it never triggers flushing the block.

Fix: 1. After adding the entry in sub_builder_index_, if there is a switch
from user-key to internal-key, then flush policy is updated to point to
Block Builder of internal-keys index partition.
2. Set sub_builder_index_->seperator_is_key_plus_seq_ = true if
seperator_is_key_plus_seq_  is set to true so that subsequent partitions
can also use internal key mode.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7096

Test Plan: make check -j64

Reviewed By: ajkr

Differential Revision: D22416598

Pulled By: akankshamahajan15

fbshipit-source-id: 01fc2dc07ea1b32f8fb803995ebe6e9a3fbe67ac
2020-07-08 21:03:04 -07:00
wenh
226d1f9c73 extend listener callback functions to more file I/O operations (#7055)
Summary:
Currently, `EventListener` in listner.h only have callback functions for file read and write. One may favor extended callback functions for more file I/O operations like flush, sync and close. This PR tries to add those interface and have them called when appropriate throughout the code base.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7055

Test Plan:
Write an experimental listener with those new callback functions with log output in them; run experiments and check logs to see those functions are actually called.
Default test suits `make check` should also be included.

Reviewed By: riversand963

Differential Revision: D22380624

Pulled By: roghnin

fbshipit-source-id: 4121491d45c2c2aae8c255e7998090559a241c6a
2020-07-07 18:21:18 -07:00
Andrew Kryczka
dd29ad4223 Separate internal and user key comparators in BlockIter (#6944)
Summary:
Replace `BlockIter::comparator_` and `IndexBlockIter::user_comparator_wrapper_` with a concrete `UserComparatorWrapper` and `InternalKeyComparator`. The motivation for this change was the inconvenience of not knowing the concrete type of `BlockIter::comparator_`, which prevented calling specialized internal key comparison functions to optimize comparison of keys with global seqno applied.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6944

Test Plan:
benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
created by regular flush; "ingestion_db" created by ingesting a file. Both
DBs have same contents.

```
$ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
$ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
$ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
```

benchmark run command:

```
$ TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=$SEEK_NEXT -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=0 -threads=1 -reads=200000000 -mmap_read=1 -verify_checksum=false
```

results: perf improved marginally for ingestion_db and did not change significantly for normal_db:

SEEK_NEXT | DB | code | ops/sec | % change
-- | -- | -- | -- | --
0 | normal_db | master | 350880 |  
0 | normal_db | PR6944 | 351040 | 0.0
0 | ingestion_db | master | 343255 |  
0 | ingestion_db | PR6944 | 349424 | 1.8
10 | normal_db | master | 218711 |  
10 | normal_db | PR6944 | 217892 | -0.4
10 | ingestion_db | master | 220334 |  
10 | ingestion_db | PR6944 | 226437 | 2.8

Reviewed By: pdillinger

Differential Revision: D21924676

Pulled By: ajkr

fbshipit-source-id: ea4288a2eefa8112eb6c651a671c1de18c12e538
2020-07-07 17:26:16 -07:00
Zitan Chen
373d5ac485 BackupEngine verifies table file checksums on creating new backups (#7015)
Summary:
When table file checksums are enabled and stored in the DB manifest by using the RocksDB default crc32c checksum function, BackupEngine will calculate the crc32c checksum of the file to be copied and compare the calculated result with the one stored in the DB manifest before copying the file to the backup directory.

After copying to the backup directory, BackupEngine will verify the checksum of the copied file with the one calculated before copying. This helps detect some rare corruption events such as bit-flips during the copying process.

No verification with checksums in DB manifest will be performed if the table file checksum function is not the RocksDB default crc32c checksum function.

In addition, If `share_table_files` and `share_files_with_checksum` are true, BackupEngine will compare the checksums computed before and after copying of the table files.

Corresponding tests are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7015

Test Plan: Passed make check

Reviewed By: pdillinger

Differential Revision: D22165732

Pulled By: gg814

fbshipit-source-id: ee0e8cc397c455eba64545c29380b9d9853588ec
2020-07-02 18:15:12 -07:00
Peter Dillinger
a680a7ea37 Un-revert #7049, revert #7022 (#7071)
Summary:
Even though local bisection gave me a clear signal (and still does) that reverting https://github.com/facebook/rocksdb/issues/7049 would fix the failures in MultiThreadedDBTest, https://github.com/facebook/rocksdb/issues/7022 seems to be the root cause. Reverting https://github.com/facebook/rocksdb/issues/7022 and keeping https://github.com/facebook/rocksdb/issues/7049 seems to fix the issue in local reproducer also. (Had these landed in opposite order, bisection would have found the root cause.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7071

Reviewed By: akankshamahajan15

Differential Revision: D22362857

Pulled By: pdillinger

fbshipit-source-id: ed63df3d74e9d4ce1604de8fe43b216166c7a3f0
2020-07-02 13:30:41 -07:00
Akanksha Mahajan
5edfe3a3d8 Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7022)
Summary:
When format_version is high enough to support user-key and there are index entries for same user key that spans multiple data blocks then it changes from user-key mode to internal-key mode. But the flush policy is not reset to point to Block Builder of internal-keys. After this switch, no entries are added to user key index partition result, thus it never triggers flushing the block.

Fix: After adding the entry in sub_builder_index_, if there is a switch from user-key to internal-key, then flush policy is updated to point to Block Builder of internal-keys index partition.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7022

Test Plan:
1. make check -j64
           2. Added one unit test case

Reviewed By: ajkr

Differential Revision: D22197734

Pulled By: akankshamahajan15

fbshipit-source-id: d87e9e46bccab8e896ee6979d6b79c51f73d479e
2020-07-01 14:58:08 -07:00
Zitan Chen
6a243b3ade Generalize BackupEngine naming option for share_files_with_checksum SSTs and revert BackupEngine::VerifyBackup to check only file sizes by default (#7032)
Summary:
`bool BackupableDBOptions::new_naming_for_backup_files` is updated to `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming`, where `BackupTableNameOption` is an `enum` type with two enumerators `kChecksumAndFileSize` and `kChecksumAndFileSize`. This opens up possibilities of extenting the current naming scheme for backup table files. By default, `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming` is set to `kChecksumAndDbSessionId`.

Revert `BackupEngine::VerifyBackup` to only check file sizes by default.

Also fix the construction of the `SstFileDumper` in `GetFileDbIdentities` by setting a proper `Env` of the `Options` passed in the constructor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7032

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22237763

Pulled By: gg814

fbshipit-source-id: 466902a4e731babd64e30f0e82ca1aa82962e52e
2020-06-30 18:47:16 -07:00
Burton Li
5be2cb6948 Compaction filter support for BlobDB (#6850)
Summary:
Added compaction filter support for BlobDB non-TTL values. Same as vanilla RocksDB, user compaction filter applies to all k/v pairs of the compaction for non-TTL values. It honors `min_blob_size`, which potentially results value transitions between inlined data and stored-in-blob data when size of value is changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6850

Reviewed By: siying

Differential Revision: D22263487

Pulled By: ltamasi

fbshipit-source-id: 8fc03f8cde2a5c831e63b436b3dbf1b7f90939e8
2020-06-29 17:32:14 -07:00
Zitan Chen
1569dc48f5 BackupEngine::VerifyBackup verifies checksum by default (#7014)
Summary:
A parameter `verify_with_checksum` is added to `BackupEngine::VerifyBackup`, which is true by default. So now `BackupEngine::VerifyBackup` verifies backup files with checksum AND file size by default. When `verify_with_checksum` is false, `BackupEngine::VerifyBackup` only compares file sizes to verify backup files.

Also add a test for the case when corruption does not change the file size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7014

Test Plan: Passed backupable_db_test

Reviewed By: zhichao-cao

Differential Revision: D22165590

Pulled By: gg814

fbshipit-source-id: 606a7450714e868bceb38598c89fd356c6004f4f
2020-06-26 11:42:12 -07:00
Zitan Chen
95fbb62c44 Update HISTORY.md to include the Public API Change for DB::OpenForReadonly introduced earlier (#7023)
Summary:
`DB::OpenForReadOnly()` now returns `Status::NotFound` when the specified DB directory does not exist. Previously the error returned depended on the underlying `Env`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7023

Reviewed By: ajkr

Differential Revision: D22207845

Pulled By: gg814

fbshipit-source-id: f35830811a0e67efb0ee82eda3a9739bc526baba
2020-06-25 06:14:29 -07:00