Commit Graph

2164 Commits

Author SHA1 Message Date
mrambacher
3dff28cf9b Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033)
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>.  The shared ptr has some performance degradation on certain hardware classes.

For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere.  For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it.  The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.

There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold.  In those cases, the shared pointer was preserved.

Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:

6.17: readrandom   :      28.046 micros/op 854902 ops/sec;   61.3 MB/s (355999 of 355999 found)
6.18: readrandom   :      32.615 micros/op 735306 ops/sec;   52.7 MB/s (290999 of 290999 found)
PR: readrandom   :      27.500 micros/op 871909 ops/sec;   62.5 MB/s (367999 of 367999 found)

(Note that the times for 6.18 are prior to revert of the SystemClock).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033

Reviewed By: pdillinger

Differential Revision: D27014563

Pulled By: mrambacher

fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
2021-03-15 04:34:11 -07:00
Peter Dillinger
589ea6bec2 Add BackupEngine API for backup file details (#8042)
Summary:
This API can be used for things like determining how much space
can be freed up by deleting a particular backup, etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8042

Test Plan:
validation of the API added to many existing backup unit
tests

Reviewed By: mrambacher

Differential Revision: D26936577

Pulled By: pdillinger

fbshipit-source-id: f0bbd90f0917b9781a6837652fb4616d9247816a
2021-03-12 11:03:54 -08:00
Yanqin Jin
82b3888433 Enable backward iterator for keys with user-defined timestamp (#8035)
Summary:
This PR does the following:

- Enable backward iteration for keys with user-defined timestamp. Note that merge, single delete, range delete are not supported yet.
- Introduces a new helper API `Comparator::EqualWithoutTimestamp()`.
- Fix a typo in `SetTimestamp()`.
- Add/update unit tests

Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced for CPU-intensive workloads with a lot of `Prev()`. Also provided results of iterating keys with timestamps.

1. Disable timestamp, run:
```
./db_bench -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Results:
> Baseline
> - seekrandom [AVG    6 runs] : 96115 ops/sec;   53.2 MB/sec
> - seekrandom [MEDIAN 6 runs] : 98075 ops/sec;   54.2 MB/sec
>
> This PR
> - seekrandom [AVG    6 runs] : 95521 ops/sec;   52.8 MB/sec
> - seekrandom [MEDIAN 6 runs] : 96338 ops/sec;   53.3 MB/sec

2. Enable timestamp, run:
```
./db_bench -user_timestamp_size=8  -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Result:
> Baseline: not supported
>
> This PR
> - seekrandom [AVG    6 runs] : 90514 ops/sec;   50.1 MB/sec
> - seekrandom [MEDIAN 6 runs] : 90834 ops/sec;   50.2 MB/sec

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8035

Reviewed By: ltamasi

Differential Revision: D26926668

Pulled By: riversand963

fbshipit-source-id: 95330cc2242397c03e09d29e5417dfb0adc98ef5
2021-03-10 11:15:46 -08:00
Hans Holmberg
670567db09 Add support for custom file systems to ldb and sst_dump (#8010)
Summary:
This PR adds support for custom file systems to ldb and sst_dump by adding command line options for specifying --fs_uri and --backup_fs uri (for ldb backup/restore commands). fs_uri is already supported in db_bench and db_stress, and there is already support in ldb and db stress for specifying customized envs.

The PR also fixes what looks like a bug in the ldb backup/restore commands. As it is right now, backups can only be made from and to the same environment/file system which does not seem to be the intended behavior. This PR makes it possible to do/restore backups between different envs/file systems.

Example:
`./ldb backup --fs_uri=zenfs://dev:nvme2n1 --backup_fs_uri=posix:// --backup_dir=/tmp/my_rocksdb_backup  --db=rocksdbtest/dbbench
`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8010

Reviewed By: jay-zhuang

Differential Revision: D26904654

Pulled By: ajkr

fbshipit-source-id: 9b695ed8b944fcc6b27c4daaa9f52e87ee2c1fb4
2021-03-09 20:49:15 -08:00
Ed rodriguez
7381dad1b1 make:Fix c header prototypes (#7994)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7994

Reviewed By: jay-zhuang

Differential Revision: D26904603

Pulled By: ajkr

fbshipit-source-id: 0af92a51de895b40c7faaa4f0870b3f63279fe21
2021-03-09 20:44:23 -08:00
Peter Dillinger
4b18c46d10 Refactor: add LineFileReader and Status::MustCheck (#8026)
Summary:
Removed confusing, awkward, and undocumented internal API
ReadOneLine and replaced with very simple LineFileReader.

In refactoring backupable_db.cc, this has the side benefit of
removing the arbitrary cap on the size of backup metadata files.

Also added Status::MustCheck to make it easy to mark a Status as
"must check." Using this, I can ensure that after
LineFileReader::ReadLine returns false the caller checks GetStatus().

Also removed some excessive conditional compilation in status.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8026

Test Plan: added unit test, and running tests with ASSERT_STATUS_CHECKED

Reviewed By: mrambacher

Differential Revision: D26831687

Pulled By: pdillinger

fbshipit-source-id: ef749c265a7a26bb13cd44f6f0f97db2955f6f0f
2021-03-09 20:12:38 -08:00
Peter Dillinger
847ca9f964 Make default share_files_with_checksum=true (#8020)
Summary:
New comment for share_files_with_checksum:
// Only used if share_table_files is set to true. Setting to false is
// DEPRECATED and potentially dangerous because in that case BackupEngine
// can lose data if backing up databases with distinct or divergent
// history, for example if restoring from a backup other than the latest,
// writing to the DB, and creating another backup. Setting to true (default)
// prevents these issues by ensuring that different table files (SSTs) with
// the same number are treated as distinct. See
// share_files_with_checksum_naming and ShareFilesNaming.

I have also removed interim option kFlagMatchInterimNaming, which is no
longer needed and was never needed for correct+compatible operation
(just performance).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8020

Test Plan:
tests updated. Backward+forward compatibility verified with
SHORT_TEST=1 check_format_compatible.sh. ldb uses default backup
options, and I manually verified shared_checksum in
/tmp/rocksdb_format_compatible_peterd/bak/current/ after run.

Reviewed By: ajkr

Differential Revision: D26786331

Pulled By: pdillinger

fbshipit-source-id: 36f968dfef1f5cacbd65154abe1d846151a55130
2021-03-09 16:27:13 -08:00
Peter Dillinger
0028e3398b Make format_version=5 new default (#8017)
Summary:
Haven't seen any production issues with new Bloom filter and
it's now > 1 year old (added in 6.6.0).

Updated check_format_compatible.sh and HISTORY.md

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8017

Test Plan: tests updated (or prior bugs fixed)

Reviewed By: ajkr

Differential Revision: D26762197

Pulled By: pdillinger

fbshipit-source-id: 0e755c46b443087c1544da0fd545beb9c403d1c2
2021-03-09 12:42:53 -08:00
kshair
d2e9eab1ea Fix mis-spelling (#8001)
Summary:
concurrnet -> concurrent

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8001

Reviewed By: ajkr

Differential Revision: D26659381

Pulled By: riversand963

fbshipit-source-id: 890d102d1cf836ed3b183da66d3d56a3158017d0
2021-03-09 01:19:18 -08:00
Peter Dillinger
ce391ff84b Clarifying comments for Read() APIs (#8029)
Summary:
I recently discovered the confusing, undocumented semantics of
Read() functions in the FileSystem and Env APIs. I have added
clarification to the best of my reverse-engineered understanding, and
made a note in HISTORY.md for implementors to check their
implementations, as a subtly non-adherent implementation could lead to
RocksDB quietly ignoring some portion of a file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8029

Test Plan: no code changes

Reviewed By: anand1976

Differential Revision: D26831698

Pulled By: pdillinger

fbshipit-source-id: 208f97ff6037bc13bb2ef360b987c2640c79bd03
2021-03-05 14:42:19 -08:00
Levi Tamasi
cb25bc1128 Update compaction statistics to include the amount of data read from blob files (#8022)
Summary:
The patch does the following:
1) Exposes the amount of data (number of bytes) read from blob files from
`BlobFileReader::GetBlob` / `Version::GetBlob`.
2) Tracks the total number and size of blobs read from blob files during a
compaction (due to garbage collection or compaction filter usage) in
`CompactionIterationStats` and propagates this data to
`InternalStats::CompactionStats` / `CompactionJobStats`.
3) Updates the formulae for write amplification calculations to include the
amount of data read from blob files.
4) Extends the compaction stats dump with a new column `Rblob(GB)` and
a new line containing the total number and size of blob files in the current
`Version` to complement the information about the shape and size of the LSM tree
that's already there.
5) Updates `CompactionJobStats` so that the number of files and amount of data
written by a compaction are broken down per file type (i.e. table/blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022

Test Plan: Ran `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D26801199

Pulled By: ltamasi

fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2
2021-03-04 00:43:48 -08:00
matthewvon
4126bdc0e1 Feature: add SetBufferSize() so that managed size can be dynamic (#7961)
Summary:
This PR adds SetBufferSize() to the WriteBufferManager object.  This enables user code to adjust the global budget for write_buffers based upon other memory conditions such as growth in table reader memory as the dataset grows.

The buffer_size_ member variable is now atomic to match design of other changeable size_t members within WriteBufferManager.

This change is useful as is.  However, this change is also essential if someone decides they wanted to enable db_write_buffer_size modifications through the DB::SetOptions() API, i.e. no waste taking this as is.

Any format / spacing changes are due to clang-format as required by check-in automation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7961

Reviewed By: ajkr

Differential Revision: D26639075

Pulled By: akankshamahajan15

fbshipit-source-id: 0604348caf092d35f44e85715331dc920e5c1033
2021-03-03 14:22:11 -08:00
Yanqin Jin
cef4a6c49f Compaction filter support for (new) BlobDB (#7974)
Summary:
Allow applications to implement a custom compaction filter and pass it to BlobDB.

The compaction filter's custom logic can operate on blobs.
To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method.
Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely
on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in
db/blob/db_blob_compaction_test.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D26509280

Pulled By: riversand963

fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39
2021-02-25 16:32:35 -08:00
Akanksha Mahajan
cd79a00903 Make BlockBasedTable::kMaxAutoReadAheadSize configurable (#7951)
Summary:
RocksDB does auto-readahead for iterators on noticing more
than two reads for a table file. The readahead starts at 8KB and doubles on every
additional read upto BlockBasedTable::kMaxAutoReadAheadSize which is
256*1024.
This PR adds a new option BlockBasedTableOptions::max_auto_readahead_size which
replaces BlockBasedTable::kMaxAutoReadAheadSize and the new option can be
configured.
If max_auto_readahead_size is set 0 then no implicit auto prefetching will
be done. If max_auto_readahead_size provided is less than
8KB (which is initial readahead size used by rocksdb in case of
auto-readahead), readahead size will remain same as max_auto_readahead_size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7951

Test Plan: Add new unit test case.

Reviewed By: anand1976

Differential Revision: D26568085

Pulled By: akankshamahajan15

fbshipit-source-id: b6543520fc74e97d859f2002328d4c5254d417af
2021-02-23 16:54:08 -08:00
Yanqin Jin
7343eb4a74 Update HISTORY and bump version (#7984)
Summary:
Prepare to cut 6.18.fb branch

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7984

Reviewed By: ajkr

Differential Revision: D26557151

Pulled By: riversand963

fbshipit-source-id: 8c144c807090cdae67e6655e7a17056ce8c50bc0
2021-02-19 19:21:49 -08:00
Andrew Kryczka
d904233d2f Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.

However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.

Related changes include:

- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970

Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.

Reviewed By: pdillinger

Differential Revision: D26467994

Pulled By: ajkr

fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 14:09:54 -08:00
mrambacher
4bc9df9459 Fix handling of Mutable options; Allow DB::SetOptions to update mutable TableFactory Options (#7936)
Summary:
Added a "only_mutable_options" flag to the ConfigOptions.  When set, the Configurable methods will only look at/update options that are marked as kMutable.

Fixed DB::SetOptions to allow for the update of any mutable TableFactory options.  Fixes https://github.com/facebook/rocksdb/issues/7385.

Added tests for the new flag.  Updated HISTORY.md

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936

Reviewed By: akankshamahajan15

Differential Revision: D26389646

Pulled By: mrambacher

fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0
2021-02-19 10:29:02 -08:00
Zaiyang Li
69877ac4f2 c:h export rocksdb_transactiondb_open_column_families (#7967)
Summary:
Hi, I noticed a bug in rocksdb C API, where a function is not exported and created a fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7967

Reviewed By: jay-zhuang

Differential Revision: D26505722

Pulled By: ajkr

fbshipit-source-id: 05d676dbd59ec87fe32322cda9e39e405b07178d
2021-02-18 15:51:54 -08:00
Levi Tamasi
dab4fe5bcd Add checkpoint support to BlobDB (#7959)
Summary:
The patch adds checkpoint support to BlobDB. Blob files are hard linked or
copied, depending on whether the checkpoint directory is on the same filesystem
or not, similarly to table files.

TODO: Add support for blob files to `ExportColumnFamily` and to the checksum
verification logic used by backup/restore.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7959

Test Plan: Ran `make check` and the crash test for a while.

Reviewed By: riversand963

Differential Revision: D26434768

Pulled By: ltamasi

fbshipit-source-id: 994be55a8dc08133028250760fca440d2c7c4dc5
2021-02-17 12:42:36 -08:00
Levi Tamasi
ba8008c870 Mention the new BlobDB in HISTORY.md and remove the "under construction" signs (#7969)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7969

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D26467043

Pulled By: ltamasi

fbshipit-source-id: c69a725669d18af6e911743c998e3a1db75948c0
2021-02-16 16:20:22 -08:00
Zhichao Cao
d1c510baec Handoff checksum Implementation (#7523)
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523

Test Plan: add new unit test, pass make check/ make asan_check

Reviewed By: pdillinger

Differential Revision: D24313271

Pulled By: zhichao-cao

fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
2021-02-10 22:20:32 -08:00
Jay Zhuang
cf160b98e1 Add full_history_ts_low option to compaction (#7884)
Summary:
The full_history_ts_low is used for user-defined timestamp GC
compaction, which is introduced in https://github.com/facebook/rocksdb/issues/7740, https://github.com/facebook/rocksdb/issues/7657 and https://github.com/facebook/rocksdb/issues/7655.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7884

Reviewed By: ltamasi

Differential Revision: D25982553

Pulled By: jay-zhuang

fbshipit-source-id: 36303d412d65b5d8166b6da24fa21ad85adbabee
2021-02-08 13:45:48 -08:00
Levi Tamasi
974458891c Revert "Turn on memtable bloom filter by default. (#6584)" (#7939)
Summary:
This reverts commit ee79a28963.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7939

Reviewed By: siying

Differential Revision: D26298564

Pulled By: ltamasi

fbshipit-source-id: 6d663516e82e6de436f8d5317932ca9a98e152bd
2021-02-06 22:34:30 -08:00
Andrew Kryczka
8d2bbdd04f Allow range deletions in *TransactionDB only when safe (#7929)
Summary:
Explicitly reject all range deletions on `TransactionDB` or `OptimisticTransactionDB`, except when the user provides sufficient promises that allow us to proceed safely. The necessary promises are described in the API doc for `TransactionDB::DeleteRange()`. There is currently no way to provide enough promises to make it safe in `OptimisticTransactionDB`.

Fixes https://github.com/facebook/rocksdb/issues/7913.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7929

Test Plan: unit tests covering the cases it's permitted/rejected

Reviewed By: ltamasi

Differential Revision: D26240254

Pulled By: ajkr

fbshipit-source-id: 2834a0ce64cc3e4c3799e35b885a5e79c2f4f6d9
2021-02-05 15:57:26 -08:00
sdong
ee79a28963 Turn on memtable bloom filter by default. (#6584)
Summary:
Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584

Test Plan: Run all existing tests.

Reviewed By: pdillinger

Differential Revision: D20626739

fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9
2021-02-05 12:59:46 -08:00
Stanislav Tkach
3feee6db17 Add get/set deadline and io_timeout C functions (read options) (#7914)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7914

Reviewed By: jay-zhuang

Differential Revision: D26184409

Pulled By: ajkr

fbshipit-source-id: 8e30faac5223ec80c22e2b617af67775322065d8
2021-02-04 17:00:58 -08:00
Andrew Kryczka
78ee8564ad Integrity protection for live updates to WriteBatch (#7748)
Summary:
This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.).

The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer.

When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748

Test Plan:
- an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught
- add to stress/crash test to verify it works in variety of configs/operations without intentional corruption
- [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc.

Reviewed By: pdillinger

Differential Revision: D25754492

Pulled By: ajkr

fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866
2021-01-29 12:18:58 -08:00
mrambacher
0a9a05ae12 Make builds reproducible (#7866)
Summary:
Closes https://github.com/facebook/rocksdb/issues/7035

Changed how build_version.cc was generated:
- Included the GIT tag/branch in the build_version file
- Changed the "Build Date" to be:
      - If the GIT branch is "clean" (no changes), the date of the last git commit
      - If the branch is not clean, the current date
 - Added APIs to access the "build information", rather than accessing the strings directly.

The build_version.cc file is now regenerated whenever the library objects are rebuilt.

Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866

Reviewed By: pdillinger

Differential Revision: D26086565

Pulled By: mrambacher

fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f
2021-01-28 17:42:16 -08:00
mrambacher
12f1137355 Add a SystemClock class to capture the time functions of an Env (#7858)
Summary:
Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.

Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.

There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).

Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858

Reviewed By: pdillinger

Differential Revision: D26006406

Pulled By: mrambacher

fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
2021-01-25 22:09:11 -08:00
Otto Kekäläinen
acc9679cda Fix various spelling errors still found in code (#7785)
Summary:
dont -> don't
refered -> referred

Merging this would allow to decrease the size of the downstream patch at https://salsa.debian.org/mariadb-team/mariadb-10.5/-/blob/master/debian/patches/fix-spelling.patch

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7785

Reviewed By: akankshamahajan15

Differential Revision: D25761408

Pulled By: jay-zhuang

fbshipit-source-id: 290406ef2a3b05a3daeedbe3b20a00798ef581e7
2021-01-15 20:07:39 -08:00
Levi Tamasi
ffe4906192 Update version to 6.17 (#7871)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7871

Test Plan: `make check`

Reviewed By: jay-zhuang

Differential Revision: D25932233

Pulled By: ltamasi

fbshipit-source-id: 8b80b0638a4f34f21a27ba80b3eda7d75410b2e8
2021-01-15 18:53:00 -08:00
anand76
8e7b068ecc Make ldb load column family options from OPTIONS file (#7847)
Summary:
When the --try_load_options is used in conjunction with the
--column_family option, ldb incorrectly sets the ColumnFamilyOptions for
that column family to defaults. This PR fixes that by retaining from the
OPTIONS file and applying command line overrides.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7847

Test Plan: Add a unit test in ldb_cmd_test

Reviewed By: ajkr

Differential Revision: D25874720

Pulled By: anand1976

fbshipit-source-id: 04bcf23b55e5a30b5b6a59b0e5cb4faef3da7429
2021-01-11 20:56:34 -08:00
Adam Retter
4926b33742 Improvements to Env::GetChildren (#7819)
Summary:
The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html

There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source.

Also some minor fixes to `Env::GetChildren`:
* Improve error handling in POSIX implementation
* Remove unnecessary array allocation on Windows
* Fix struct name for Windows Non-UTF-8 API

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819

Reviewed By: ajkr

Differential Revision: D25837394

Pulled By: jay-zhuang

fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e
2021-01-09 09:44:34 -08:00
Akanksha Mahajan
8ed680bdb0 Add new API to report dummy entries size in cache in WriteBufferManager (#7837)
Summary:
Add new API WriteBufferManager::dummy_entries_in_cache_usage() which reports the dummy entries size stored in cache to account for DataBlocks in WriteBufferManager.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7837

Test Plan: Updated test ./write_buffer_manager_test

Reviewed By: ajkr

Differential Revision: D25794312

Pulled By: akankshamahajan15

fbshipit-source-id: 197f5e8701e3dc57a7df72dab1735624f90daf4b
2021-01-08 13:26:24 -08:00
Adam Retter
6e0f62f2b6 Add more tests to ASSERT_STATUS_CHECKED (3), API change (#7715)
Summary:
Third batch of adding more tests to ASSERT_STATUS_CHECKED.

* db_compaction_filter_test
* db_compaction_test
* db_dynamic_level_test
* db_inplace_update_test
* db_sst_test
* db_tailing_iter_test
* db_io_failure_test

Also update GetApproximateSizes APIs to all return Status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7715

Reviewed By: jay-zhuang

Differential Revision: D25806896

Pulled By: pdillinger

fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262
2021-01-06 14:15:02 -08:00
mrambacher
e628f59e87 Create a CustomEnv class; Add WinFileSystem; Make LegacyFileSystemWrapper private (#7703)
Summary:
This PR does the following:
-> Creates a WinFileSystem class.  This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems.
-> Introduces a CustomEnv class.  A CustomEnv is an Env that takes a FileSystem as constructor argument.  I believe there will only ever be two implementations of this class (PosixEnv and WinEnv).  There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls
-> Eliminates the public uses of the LegacyFileSystemWrapper.

With this change in place, there are effectively the following patterns of Env:
- "Base Env classes" (PosixEnv, WinEnv).  These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem.  These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system.
- Wrapped Composite Env classes (MemEnv).  These classes take in an Env and a FileSystem.  The core env functions are re-directed to the wrapped env.  The file system calls are redirected to the input file system
- Legacy Wrapped Env classes.  These classes take in an Env input (but no FileSystem).  The core env functions are re-directed to the wrapped env.  A "Legacy File System" is created using this env and the file system calls directed to the env itself.

With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created.  Any other use of the PosixEnv is via another wrapped env.  This cleans up some of the issues with the env construction and destruction.

Additionally, there were places in the code that required had an Env when they required a FileSystem.  Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem().  These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703

Reviewed By: zhichao-cao

Differential Revision: D25762190

Pulled By: anand1976

fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b
2021-01-06 10:49:32 -08:00
Dylan Wen
159ea4702c Fix typos in comments (#7790)
Summary:
Hi there,

This PR fixes some typos in comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7790

Reviewed By: ajkr

Differential Revision: D25684213

Pulled By: zhichao-cao

fbshipit-source-id: b77026018cbdd59c9db25aa73edeb359d9962f3e
2021-01-04 11:39:14 -08:00
mrambacher
81367a4616 Eliminate the creation of ImmutableDBOptions in WBWI::GetFromBatch (#6851)
Summary:
1. Made `WriteBatchWithIndexInternal` into a class that stores the `DB*` or `DBOptions*`.

2. Changed the `GetFromBatch` method to be non-static and use an instance of the class.  Added `MergeKey` methods to perform the merge itself and return any status.

This change unifies the multiple calls to the `MergeHelper` under a single wrapped API.

Closes https://github.com/facebook/rocksdb/issues/6683

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6851

Reviewed By: ajkr

Differential Revision: D21706574

Pulled By: pdillinger

fbshipit-source-id: 6860bd64d62669aaa591846e914eed3b674e68b1
2021-01-04 09:05:46 -08:00
jbosh
edb0b1fb7f rocksdb_transaction_get_for_update now exports (#6293)
Summary:
Added missing ROCKSDB_LIBRARY_API decorator to rocksdb_transaction_get_for_update.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6293

Reviewed By: jay-zhuang

Differential Revision: D25234298

Pulled By: ajkr

fbshipit-source-id: 8a4817adaec1f445f338c8d8c59d3392925b5721
2020-12-30 15:42:59 -08:00
Sergei Petrunia
daab7603f6 Range Locking: Implementation of range locking (#7506)
Summary:
Range Locking - an implementation based on the locktree library

- Add a RangeTreeLockManager and RangeTreeLockTracker which implement
  range locking using the locktree library.
- Point locks are handled as locks on single-point ranges.
- Add a unit test: range_locking_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7506

Reviewed By: akankshamahajan15

Differential Revision: D25320703

Pulled By: cheng-chang

fbshipit-source-id: f86347384b42ba2b0257d67eca0f45f806b69da7
2020-12-22 19:12:36 -08:00
Jay Zhuang
a8aeefd0fd Update release version to 6.16 (#7782)
Summary:
Update release version to 6.8

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7782

Reviewed By: siying

Differential Revision: D25648579

Pulled By: jay-zhuang

fbshipit-source-id: c536d606868b95c5fb2ae8f19c17eb259d67bc51
2020-12-19 12:39:21 -08:00
Peter Dillinger
4d1ac19e3d aggregated-table-properties with GetMapProperty (#7779)
Summary:
So that we can more easily get aggregate live table data such
as total filter, index, and data sizes.

Also adds ldb support for getting properties

Also fixed some missing/inaccurate related comments in db.h

For example:

    $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties
    rocksdb.aggregated-table-properties.data_size: 102871
    rocksdb.aggregated-table-properties.filter_size: 0
    rocksdb.aggregated-table-properties.index_partitions: 0
    rocksdb.aggregated-table-properties.index_size: 2232
    rocksdb.aggregated-table-properties.num_data_blocks: 100
    rocksdb.aggregated-table-properties.num_deletions: 0
    rocksdb.aggregated-table-properties.num_entries: 15000
    rocksdb.aggregated-table-properties.num_merge_operands: 0
    rocksdb.aggregated-table-properties.num_range_deletions: 0
    rocksdb.aggregated-table-properties.raw_key_size: 288890
    rocksdb.aggregated-table-properties.raw_value_size: 198890
    rocksdb.aggregated-table-properties.top_level_index_size: 0
    $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1
    rocksdb.aggregated-table-properties-at-level1.data_size: 80909
    rocksdb.aggregated-table-properties-at-level1.filter_size: 0
    rocksdb.aggregated-table-properties-at-level1.index_partitions: 0
    rocksdb.aggregated-table-properties-at-level1.index_size: 1787
    rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81
    rocksdb.aggregated-table-properties-at-level1.num_deletions: 0
    rocksdb.aggregated-table-properties-at-level1.num_entries: 12466
    rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0
    rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0
    rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210
    rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414
    rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0
    $

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779

Test Plan: Added a test to ldb_test.py

Reviewed By: jay-zhuang

Differential Revision: D25653103

Pulled By: pdillinger

fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 08:00:14 -08:00
sdong
75e4af14e0 Update code comment for options.ttl (#7775)
Summary:
The behavior of options.ttl has been updated long ago but we didn't update the code comments.
Also update the periodic compaction's comment.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7775

Test Plan: See it can still build through CI.

Reviewed By: ajkr

Differential Revision: D25592015

fbshipit-source-id: b1db18b6787e7048ce6aedcbc3bb44493c9fc49b
2020-12-18 15:29:03 -08:00
Peter Dillinger
003e72b201 Use size_t for filter APIs, protect against overflow (#7726)
Summary:
Deprecate CalculateNumEntry and replace with
ApproximateNumEntries (better name) using size_t instead of int and
uint32_t, to minimize confusing casts and bad overflow behavior
(possible though probably not realistic). Bloom sizes are now explicitly
capped at max size supported by implementations: just under 4GiB for
fv=5 Bloom, and just under 512MiB for fv<5 Legacy Bloom. This
hardening could help to set up for fuzzing.

Also, since RocksDB only uses this information as an approximation
for trying to hit certain sizes for partitioned filters, it's more important
that the function be reasonably fast than for it to be completely
accurate. It's hard enough to be 100% accurate for Ribbon (currently
reversing CalculateSpace) that adding optimize_filters_for_memory
into the mix is just not worth trying to be 100% accurate for num
entries for bytes.

Also:
- Cleaned up filter_policy.h to remove MSVC warning handling and
potentially unsafe use of exception for "not implemented"
- Correct the number of entries limit beyond which current Ribbon
implementation falls back on Bloom instead.
- Consistently use "num_entries" rather than "num_entry"
- Remove LegacyBloomBitsBuilder::CalculateNumEntry as it's essentially
obsolete from general implementation
BuiltinFilterBitsBuilder::CalculateNumEntries.
- Fix filter_bench to skip some tests that don't make sense when only
one or a small number of filters has been generated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7726

Test Plan:
expanded existing unit tests for CalculateSpace /
ApproximateNumEntries. Also manually used filter_bench to verify Legacy and
fv=5 Bloom size caps work (much too expensive for unit test). Note that
the actual bits per key is below requested due to space cap.

    $ ./filter_bench -impl=0 -bits_per_key=20 -average_keys_per_filter=256000000 -vary_key_count_ratio=0 -m_keys_total_max=256 -allow_bad_fp_rate
    ...
    Total size (MB): 511.992
    Bits/key stored: 16.777
    ...
    $ ./filter_bench -impl=2 -bits_per_key=20 -average_keys_per_filter=2000000000 -vary_key_count_ratio=0 -m_keys_total_max=2000
    ...
    Total size (MB): 4096
    Bits/key stored: 17.1799
    ...
    $

Reviewed By: jay-zhuang

Differential Revision: D25239800

Pulled By: pdillinger

fbshipit-source-id: f94e6d065efd31e05ec630ae1a82e6400d8390c4
2020-12-11 22:18:12 -08:00
Sergei Petrunia
d8bd9fc7b3 Range Locking: Allow different LockManagers, add Range Lock definitions (#7443)
Summary:
This PR has two commits:
1.  Modify the code to allow different Lock Managers (of any kind) to be used.  It is implied that a LockManager uses its own custom LockTracker.
2.  Add definitions for Range Locking (class Endpoint and GetRangeLock() function.

cheng-chang, is this what you've had in mind (should the PR have both item 1 and item 2?)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7443

Reviewed By: zhichao-cao

Differential Revision: D24123172

Pulled By: cheng-chang

fbshipit-source-id: c6548ad6d4cc3c25f68d13b29147bc6fdf357185
2020-12-07 20:18:07 -08:00
Zhichao Cao
29e8f6a698 Add kManifestWriteNoWAL to BackgroundErrorReason to handle Flush IO Error when WAL is disabled (#7693)
Summary:
In the current code base, all the manifest writes with IO error will be set with reason: BackgroundErrorReason::kManifestWrite, which will be mapped to the kHardError if the IO Error is retryable. However, if the system does not use the WAL, all the retryable IO error should be mapped to kSoftError. Create this PR to handle is special case by adding kManifestWriteNoWAL to BackgroundErrorReason.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7693

Test Plan: make check, add new testing cases to error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D25066204

Pulled By: zhichao-cao

fbshipit-source-id: d59553896c2eac3fb37c05238544d2b265379462
2020-12-02 18:24:01 -08:00
Jay Zhuang
7fec715db4 Make CompactRange and GetApproximateSizes work with timestamp (#7684)
Summary:
Add timestamp to the `CompactRange()` and `GetApproximateSizes` range keys if needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7684

Test Plan: make check

Reviewed By: riversand963

Differential Revision: D25015421

Pulled By: jay-zhuang

fbshipit-source-id: 51ca0756087eb053a3b11801e5c7ce1c6e2d38a9
2020-12-02 13:00:53 -08:00
Steve Yen
cc431ece37 Fix merge operator docs typo (#7716)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7716

Reviewed By: pdillinger

Differential Revision: D25214340

Pulled By: zhichao-cao

fbshipit-source-id: 143a8e7d076917e60bbe6993d60ec55f33e2ab56
2020-11-30 12:09:39 -08:00
anand76
7c19d43883 Fix initialization order of DBOptions and kHostnameForDbHostId (#7702)
Summary:
Fix initialization order of DBOptions and kHostnameForDbHostId by making the initialization of the latter static rather than dynamic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7702

Reviewed By: ajkr

Differential Revision: D25111633

Pulled By: anand1976

fbshipit-source-id: 7afad834a66e40bcd8694a43b40d378695212224
2020-11-19 22:39:40 -08:00
Cheng Chang
5c585e1908 Ship the track WAL in MANIFEST feature (#7689)
Summary:
Updates the option description and HISTORY.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7689

Test Plan: N/A

Reviewed By: zhichao-cao

Differential Revision: D25056238

Pulled By: cheng-chang

fbshipit-source-id: 6af1ef6f8dcf2173cbc0fccadc0e06cefd92bcae
2020-11-19 14:45:54 -08:00
Ramkumar Vadivelu
5bd1258381 Update release history to 6.15 (#7673)
Summary:
Update release history to 6.15

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7673

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D24971069

Pulled By: ramvadiv

fbshipit-source-id: 5cb3f5cbc1b19beb580ea8095acdef72cc092905
2020-11-15 12:37:24 -08:00
Peter Dillinger
60af964372 Experimental (production candidate) SST schema for Ribbon filter (#7658)
Summary:
Added experimental public API for Ribbon filter:
NewExperimentalRibbonFilterPolicy(). This experimental API will
take a "Bloom equivalent" bits per key, and configure the Ribbon
filter for the same FP rate as Bloom would have but ~30% space
savings. (Note: optimize_filters_for_memory is not yet implemented
for Ribbon filter. That can be added with no effect on schema.)

Internally, the Ribbon filter is configured using a "one_in_fp_rate"
value, which is 1 over desired FP rate. For example, use 100 for 1%
FP rate. I'm expecting this will be used in the future for configuring
Bloom-like filters, as I expect people to more commonly hold constant
the filter accuracy and change the space vs. time trade-off, rather than
hold constant the space (per key) and change the accuracy vs. time
trade-off, though we might make that available.

### Benchmarking

```
$ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 34.1341
Number of filters: 1993
Total size (MB): 238.488
Reported total allocated memory (MB): 262.875
Reported internal fragmentation: 10.2255%
Bits/key stored: 10.0029
----------------------------
Mixed inside/outside queries...
  Single filter net ns/op: 18.7508
  Random filter net ns/op: 258.246
    Average FP rate %: 0.968672
----------------------------
Done. (For more info, run with -legend or -help.)
$ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 130.851
Number of filters: 1993
Total size (MB): 168.166
Reported total allocated memory (MB): 183.211
Reported internal fragmentation: 8.94626%
Bits/key stored: 7.05341
----------------------------
Mixed inside/outside queries...
  Single filter net ns/op: 58.4523
  Random filter net ns/op: 363.717
    Average FP rate %: 0.952978
----------------------------
Done. (For more info, run with -legend or -help.)
```

168.166 / 238.488 = 0.705  -> 29.5% space reduction

130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction)

### Working around a hashing "flaw"

bloom_test discovered a flaw in the simple hashing applied in
StandardHasher when num_starts == 1 (num_slots == 128), showing an
excessively high FP rate.  The problem is that when many entries, on the
order of number of hash bits or kCoeffBits, are associated with the same
start location, the correlation between the CoeffRow and ResultRow (for
efficiency) can lead to a solution that is "universal," or nearly so, for
entries mapping to that start location. (Normally, variance in start
location breaks the effective association between CoeffRow and
ResultRow; the same value for CoeffRow is effectively different if start
locations are different.) Without kUseSmash and with num_starts > 1 (thus
num_starts ~= num_slots), this flaw should be completely irrelevant.  Even
with 10M slots, the chances of a single slot having just 16 (or more)
entries map to it--not enough to cause an FP problem, which would be local
to that slot if it happened--is 1 in millions. This spreadsheet formula
shows that: =1/(10000000*(1 - POISSON(15, 1, TRUE)))

As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is
intended for CPU efficiency of filters with many more entries/slots than
kCoeffBits, a very reasonable work-around is to disallow num_starts==1
when !kUseSmash, by making the minimum non-zero number of slots
2*kCoeffBits. This is the work-around I've applied. This also means that
the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not
space-efficient for less than a few hundred entries. Because of this, I
have made it fall back on constructing a Bloom filter, under existing
schema, when that is more space efficient for small filters. (We can
change this in the future if we want.)

TODO: better unit tests for this case in ribbon_test, and probably
update StandardHasher for kUseSmash case so that it can scale nicely to
small filters.

### Other related changes

* Add Ribbon filter to stress/crash test
* Add Ribbon filter to filter_bench as -impl=3
* Add option string support, as in "filter_policy=experimental_ribbon:5.678;"
where 5.678 is the Bloom equivalent bits per key.
* Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom
* Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on
binary searching CalculateSpace (inefficient), so that subclasses
(especially experimental ones) don't have to provide an efficient
implementation inverting CalculateSpace.
* Minor refactor FastLocalBloomBitsBuilder for new base class
XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder,
which allows the latter to fall back on Bloom construction in some
extreme cases.
* Mostly updated bloom_test for Ribbon filter, though a test like
FullBloomTest::Schema is a next TODO to ensure schema stability
(in case this becomes production-ready schema as it is).
* Add some APIs to ribbon_impl.h for configuring Ribbon filters.
Although these are reasonably covered by bloom_test, TODO more unit
tests in ribbon_test
* Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data
for constructing the linear approximations in GetNumSlotsFor95PctSuccess.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658

Test Plan:
Some unit tests updated but other testing is left TODO. This
is considered experimental but laying down schema compatibility as early
as possible in case it proves production-quality. Also tested in
stress/crash test.

Reviewed By: jay-zhuang

Differential Revision: D24899349

Pulled By: pdillinger

fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8
2020-11-12 20:46:14 -08:00
Levi Tamasi
bbbb5a280d Add options for integrated blob GC (#7661)
Summary:
This patch simply adds a couple of options that will enable users to
configure garbage collection when using the integrated BlobDB
implementation. The actual GC logic will be added in a separate step.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7661

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24906544

Pulled By: ltamasi

fbshipit-source-id: ee0e056a712a4b4475cd90de8b27d969bd61b7e1
2020-11-12 18:58:44 -08:00
mrambacher
c442f6809f Create a Customizable class to load classes and configurations (#6590)
Summary:
The Customizable class is an extension of the Configurable class and allows instances to be created by a name/ID.  Classes that extend customizable can define their Type (e.g. "TableFactory", "Cache") and  a method to instantiate them (TableFactory::CreateFromString).  Customizable objects can be registered with the ObjectRegistry and created dynamically.

Future PRs will make more types of objects extend Customizable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6590

Reviewed By: cheng-chang

Differential Revision: D24841553

Pulled By: zhichao-cao

fbshipit-source-id: d0c2132bd932e971cbfe2c908ca2e5db30c5e155
2020-11-11 15:10:41 -08:00
Yanqin Jin
bcba372352 Report if unpinnable value encountered during backward iteration (#7618)
Summary:
There is an undocumented behavior about a certain combination of options and operations.
- inplace_update_support = true, and
- call `SeekForPrev()`, `SeekToLast()`, and/or `Prev()` on unflushed data.

We should stop the backward iteration and report an error of `Status::NotSupported`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7618

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D24769619

Pulled By: riversand963

fbshipit-source-id: 81d199fa55ed4739ab10e719cc345a992238ccbb
2020-11-10 17:17:39 -08:00
Yanqin Jin
fde0cd7ced Add API to verify whole sst file checksum (#7578)
Summary:
Existing API `VerifyChecksum()` allows application to verify sst files' block checksums.
Since whole file, user-specified checksum is tracked in MANIFEST, we can expose a new
API to verify sst files' file checksums.

```
// Compute table file checksums if applicable and compare with MANIFEST.
// Returns OK if no file has mismatching whole-file checksum.
Status DB::VerifyFileChecksums(const ReadOptions& /*read_options*/);
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7578

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D24436783

Pulled By: riversand963

fbshipit-source-id: 52b51519b842f2b3c4e3351998a97c86cbec85b3
2020-11-03 20:34:56 -08:00
jsteemann
2404f8b9ec slightly improve jemalloc allocator API header (#7592)
Summary:
Fix a few typos and avoid a potential nullptr dereference.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7592

Reviewed By: zhichao-cao

Differential Revision: D24582111

Pulled By: riversand963

fbshipit-source-id: 51e9260e8cad1fcdedd310c889f0faeec6efd937
2020-10-28 13:47:12 -07:00
Andrew Kryczka
6c2c0635c9 Require only one Logger::Logv() implementation (#7605)
Summary:
A user who extended `Logger` recently pointed out it is unusual to
require they implement the two-argument `Logv()` overload when they've
already implemented the three-argument `Logv()` overload. I agree with
that and think we can fix it by only calling the two-argument overload
from the default implementation of the three-argument overload. Then
when the three-argument overload is overridden, RocksDB would not
rely on the two-argument overload. Only `Logger::LogHeader()` needed
adjustment to achieve this.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7605

Reviewed By: riversand963

Differential Revision: D24584749

Pulled By: ajkr

fbshipit-source-id: 9aabe040ac761c4c0dbebc4be046967403ecaf21
2020-10-28 10:00:51 -07:00
mrambacher
f35f7f2704 Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:

1.  The MockFileSystem class was split out from the MockEnv.  This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one).  The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.

2.  Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set.  To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).

3.  Updated the test framework to have a ROCKSDB_GTEST_SKIP macro.  This can be used to flag tests that are skipped.  Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).

I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV,  both, and neither under both MacOS and RedHat.  A few tests were disabled/skipped for the MEM/ENCRYPTED cases.  The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem.  (I will also push a change to disable those tests soon).  There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.

Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.

Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale.  I do not know how to do that, so if someone could write that job, it would be appreciated :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566

Reviewed By: zhichao-cao

Differential Revision: D24408980

Pulled By: jay-zhuang

fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 10:33:09 -07:00
jmn
b1cdb8cc86 add StartTrace and EndTrace to stackable_db (#7585)
Summary:
In addition to trace block cache access, we want to support trace queries on MySQL. To achieve that StartTrace and EndTrace need to be added to the stackable_db.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7585

Reviewed By: zhichao-cao

Differential Revision: D24482306

Pulled By: nmjnmjnmj

fbshipit-source-id: de641b4837c64cd33b44b5bebaeae5d1527c8c31
2020-10-22 17:31:54 -07:00
Zhichao Cao
d8ec0a760a Make FileType Public and Replace kLogFile with kWalFile (#7580)
Summary:
As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in https://github.com/facebook/rocksdb/issues/7523

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7580

Test Plan: make check

Reviewed By: riversand963

Differential Revision: D24485420

Pulled By: zhichao-cao

fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799
2020-10-22 17:06:20 -07:00
mrambacher
1eda625eab Revert Statuses returned from pre-Configurable options functions (#7563)
Summary:
Further refinement of the earlier PR.  Now the Status is NotFound with a subcode of PathNotFound. Also the existing functions for options parsing/loading are reverted to return InvalidArgument no matter in which way the user-provided arguments are deemed invalid.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7563

Reviewed By: zhichao-cao

Differential Revision: D24422491

Pulled By: ajkr

fbshipit-source-id: ba6b237cd0584d3f925c5ba0d349aeb8c250af67
2020-10-20 11:53:28 -07:00
anand76
00751e4292 Add a host location property to TableProperties (#7479)
Summary:
This PR adds support for writing a location identifier of the DB host to SST files as a table property. By default, the hostname is used, but can be overridden by the user. There have been some recent corruptions in files written by ```SstFileWriter``` before checksumming, so this property can be used to trace it back to the writing host and checking the host for hardware isues.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7479

Test Plan: Add new unit tests

Reviewed By: pdillinger

Differential Revision: D24340671

Pulled By: anand1976

fbshipit-source-id: 2038949fd8d160c0633ccb4f9da77740f19fa2a2
2020-10-19 11:38:48 -07:00
Cheng Chang
0ea7db768e Abstract out LockManager interface (#7532)
Summary:
In order to be able to introduce more locking protocols, we need to abstract out the locking subsystem in TransactionDB into a set of interfaces.

PR https://github.com/facebook/rocksdb/pull/7013 introduces interface `LockTracker`. This PR is a follow up to take the first step to abstract out a `LockManager` interface.

Further modifications to the interface may be needed when introducing the first implementation of range lock. But the idea here is to put the range lock implementation based on range tree under the `utilities/transactions/lock/range/range_tree`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7532

Test Plan: point_lock_manager_test

Reviewed By: ajkr

Differential Revision: D24238731

Pulled By: cheng-chang

fbshipit-source-id: 2a9458cd8b3fb008d9529dbc4d3b28c24631f463
2020-10-19 10:14:42 -07:00
Stanislav Tkach
ed90e2a450 Add getters to the C API for env, universal compaction options and fifo compaction options (#7501)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7501

Reviewed By: ltamasi

Differential Revision: D24344109

Pulled By: pdillinger

fbshipit-source-id: d9a2b1b1cc8c8d8a96f13b8ae6814380caa10c96
2020-10-16 11:04:01 -07:00
Stanislav Tkach
1a83f5a8ac Expose BackupableDBOptions in the C API (#7550)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7550

Reviewed By: jay-zhuang

Differential Revision: D24315343

Pulled By: ajkr

fbshipit-source-id: fc7855b630a50c00dcb940241942295932732f39
2020-10-14 17:51:47 -07:00
Akanksha Mahajan
7b65666cf1 Update IOTrace operations in stackable_db.h (#7514)
Summary:
Update IOTrace operations in stackabledb.h and also trace few
other IO operations.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7514

Test Plan: make check -j64

Reviewed By: anand1976

Differential Revision: D24151202

Pulled By: akankshamahajan15

fbshipit-source-id: 112cd3d2041f8c6398b7b0ba1a783b8c93224d4a
2020-10-14 10:16:15 -07:00
Andrew Kryczka
75d3b6fdf0 Redesign block cache pinning API (#7520)
Summary:
The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520

Test Plan:
- new unit test
- added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0`

Reviewed By: pdillinger

Differential Revision: D24200034

Pulled By: ajkr

fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e
2020-10-11 14:58:24 -07:00
Cheng Chang
12b78e40bd Track WAL in MANIFEST: add option track_and_verify_wals_in_manifest (#7275)
Summary:
This option determines whether WALs will be tracked in MANIFEST and verified on recovery.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7275

Test Plan:
db_options_test
options_test

Reviewed By: pdillinger

Differential Revision: D23181418

Pulled By: cheng-chang

fbshipit-source-id: 5dd1cdc166f3dfc1c93c094df4a2f7734e3b4547
2020-10-09 16:42:19 -07:00
Akanksha Mahajan
9dd25487cc Update release history 6.14 (#7525)
Summary:
Update release history for 6.14

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7525

Test Plan: No code change

Reviewed By: jay-zhuang

Differential Revision: D24224690

Pulled By: akankshamahajan15

fbshipit-source-id: 95441aefde96672fea5a6af5d7e67cdafb1ebdd2
2020-10-09 16:05:22 -07:00
Akanksha Mahajan
38d0a365e3 Add Stats for MultiGet (#7366)
Summary:
Add following stats for MultiGet in Histogram to get more insight on MultiGet.
    1. Number of index and filter blocks read from file as part of MultiGet
    request per level.
    2. Number of data blocks read from file per level.
    3. Number of SST files loaded from file system per level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7366

Reviewed By: anand1976

Differential Revision: D24127040

Pulled By: akankshamahajan15

fbshipit-source-id: e63a003056b833729b277edc0639c08fb432756b
2020-10-07 13:28:48 -07:00
Peter Dillinger
9082771b86 Add is_full_compaction to CompactionJobStats, cleanup (#7451)
Summary:
This exposes to the listener interface whether a compaction was
full or not. Also cleaned up API comment for CompactionJobInfo::stats,
which is not of a nullable type. And since CompactionJob is always
created with non-null CompactionJobStats, removed conditionals on it
being nullptr and instead assert non-null.

TODO later: update C and Java interfaces

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7451

Test Plan: updated existing unit tests to check new field, make check

Reviewed By: ltamasi

Differential Revision: D23977796

Pulled By: pdillinger

fbshipit-source-id: 1ae7e26cb949631c2b2fb9e696710daf53cc378d
2020-10-01 12:52:58 -07:00
sdong
7508175558 Introduce options.check_flush_compaction_key_order (#7467)
Summary:
Introduce an new option options.check_flush_compaction_key_order, by default set to true, which checks key order of flush and compaction, and fail the operation if the order is violated.
Also did minor refactor hash checking code, which consolidates the hashing logic to a vlidation class, where the key ordering logic is added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7467

Test Plan: Add unit tests to validate the check can catch reordering in flush and compaction, and can be properly disabled.

Reviewed By: riversand963

Differential Revision: D24010683

fbshipit-source-id: 8dd6292d2cda8006054e9ded7cfa4bf405f0527c
2020-10-01 10:10:26 -07:00
Peter Dillinger
ddbc5dad05 Enable force_consistency_checks by default (#7446)
Summary:
This has been running in production on some key workloads, so
we believe it to be safe and extremely low cost. Nevertheless, I've
added code to ensure that "force_consistency_checks" is mentioned in
any corruption reports so that people know how to disable in case of
false positive corruption reports.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7446

Test Plan:
make check, CI, temporary debug print new message with
./version_builder_test

Reviewed By: ajkr

Differential Revision: D23972101

Pulled By: pdillinger

fbshipit-source-id: 9623e400f3752577c0ecf977e6d0915562cf9968
2020-09-30 11:57:32 -07:00
Akanksha Mahajan
9d212d3f0e Provide users with option to opt-in to get corrupt data in logs/messages (#7420)
Summary:
Add a new Option "allow_data_in_errors". When it's set by users, it allows them to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data.
By default value is set false to prevent users data to be exposed in the messages.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7420

Test Plan:
1. make check -j64
           2. Add a new test case

Reviewed By: ajkr

Differential Revision: D23835028

Pulled By: akankshamahajan15

fbshipit-source-id: 8d2eba8fb898e79fcf1fccc07295065a75eb59b1
2020-09-29 23:17:45 -07:00
Akanksha Mahajan
9a63bbd391 Add few unit test cases in ASSERT_STATUS_CHECKED build (#7427)
Summary:
Fix few test cases and add them in ASSERT_STATUS_CHECKED build.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7427

Test Plan:
1.  ASSERT_STATUS_CHECKED=1 make -j48 check,
                 2. travis build for ASSERT_STATUS_CHECKED,
                 3. Without ASSERT_STATUS_CHECKED:  make check -j64, CircleCI build and travis build

Reviewed By: pdillinger

Differential Revision: D23909983

Pulled By: akankshamahajan15

fbshipit-source-id: 42d7e4aea972acb9fcddb7ca73fcb82f93272434
2020-09-24 21:48:57 -07:00
Zhichao Cao
0ce9b3a22d Add AppendWithVerify and PositionedAppendWithVerify to Env and FileSystem (#7419)
Summary:
Add new AppendWithVerify and PositionedAppendWithVerify APIs to Env and FileSystem to bring the data verification information (data checksum information) from upper layer (e.g., WritableFileWriter) to the storage layer. This PR only include the API definition, no functional codes are added to unblock other developers which depend on these APIs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7419

Test Plan: make -j32

Reviewed By: pdillinger

Differential Revision: D23883196

Pulled By: zhichao-cao

fbshipit-source-id: 94676c26bc56144cc32e3661f84f21eccd790411
2020-09-23 19:02:26 -07:00
Peter Dillinger
31d1cea4f3 Add and fix clang -Wshift-sign-overflow (#7431)
Summary:
This option is apparently used by some teams within Facebook
(internal ref T75998621)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7431

Test Plan: USE_CLANG=1 make check before (fails) and after

Reviewed By: jay-zhuang

Differential Revision: D23876584

Pulled By: pdillinger

fbshipit-source-id: abb8b67a1f1aac75327944d266e284b2b6727191
2020-09-23 15:26:17 -07:00
Peter Dillinger
9d8eb77c4d Less I/O for incremental backups, slightly better corruption detection (#7413)
Summary:
Two relatively simple functional changes to incremental backup
behavior, integrated with a minor refactoring to reduce code redundancy and
improve error/log message. There are nuances to the impact of these changes,
but I believe they are fundamentally good and generally safe. Those functional
changes:

* Incremental backups no longer read DB table files that are already saved to a
shared part of the backup directory, unless `share_files_with_checksum` is used
with `kLegacyCrc32cAndFileSize` naming (discouraged) where crc32c full file
checksums are needed to determine file naming.
  * Justification: incremental backups should not need to read the whole DB,
especially without rate limiting. (Although other BackupEngine reads are not
rate limited either, other non-trivial reads are generally limited by a
corresponding write, as in copying files.) Also, the fact that this is not
already fixed was arguably a bug/oversight in the implementation of https://github.com/facebook/rocksdb/issues/7110.

* When considering whether a table file is already backed up in a shared part
of backup directory, BackupEngine would already query the sizes of source (DB)
and pre-existing destination (backup) files. BackupEngine now uses these file
sizes to detect corruption, as at least one of (a) old backup, (b) backup in
progress, or (c) current DB is corrupt if there's a size mismatch.
  * Justification: a random related fix that also helps to cover a small hole
in corruption checking uncovered by the other functional change:
  * For `share_table_files` without "checksum" (not recommended), the other
change regresses in detecting fundamentally unsafe use of this option
combination: when you might generate different versions of same SST file
number. As demonstrated by `BackupableDBTest.FailOverwritingBackups,` this
regression is greatly mitigated by the new file size checking. Nevertheless,
almost no reason to use `share_files_with_checksum=false` should remain, and
comments are updated appropriately.

Also, this change renames internal function `CalculateChecksum` to
`ReadFileAndComputeChecksum` to make the performance impact of this function
clear in code reviews.

It is not clear what 'same_path' is for in backupable_db.cc, and I suspect it
cannot be true for a DB with unique file names (like DBImpl). Nevertheless,
I've tried to keep its functionality intact when `true` to minimize risk for
now, despite having no unit tests for which it is true.

Select impact details (much more in unit tests): For
`share_files_with_checksum`, I am confident there is no regression (vs.
pre-6.12) in detecting DB or backup corruption at backup creation time, mostly
because the old design did not leverage this extra checksum computation for
detecting inconsistencies at backup creation time. (With computed checksums in
names, a recently corrupted file just looked like a different file vs. what was
already backed up.)

Even in the hypothetical case of DB session id collision (~100 bits entropy
collision), file size in name and/or our file size check add an extra layer of
protection against false success in creating an accurate new backup. (Unit test
included.)

`DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking
are still able to catch corruptions that `CreateNewBackup` does not. Note that
when custom file checksum support is added to BackupEngine, that will
essentially give the same power as `DB::VerifyChecksum` into `CreateNewBackup`.
We could add options for `CreateNewBackup` to cover some of what would be
caught by `VerifyBackup` with checksum checking.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7413

Test Plan:
Two new unit tests included, both of which fail without these
changes. Although we don't test the I/O improvement directly, we test it
indirectly in DB corruption detection power that was inadvertently unlocked
with new backup file naming PLUS computing current content checksums (now
removed). (I don't think that case of DB corruption detection justifies reading
the whole DB on incremental backup.)

Reviewed By: zhichao-cao

Differential Revision: D23818480

Pulled By: pdillinger

fbshipit-source-id: 148aff16f001af5b9fd4b22f155311c2461f1bac
2020-09-21 16:19:24 -07:00
Peter Dillinger
b475a83f9d Postponing custom checksum support in BackupEngine (#7411)
Summary:
This change reverts BackupEngine to 6.12 state to accommodate a
higher-priority fix that does not easily merge with this custom checksum
support. We intend to reinstate this support soon, by merging a revert
of this change.

For backupable_db_test, I've removed the tests depending on this
feature.

I've also removed relevant HISTORY.md entry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7411

Test Plan: unit tests

Reviewed By: ajkr

Differential Revision: D23793835

Pulled By: pdillinger

fbshipit-source-id: 7e861436539584799b13d1a8ae559b81b6d08052
2020-09-18 15:27:03 -07:00
Zhichao Cao
c268628c25 Map retryable IO error during Flush without WAL to soft error and no switch memtable during resume (#7310)
Summary:
In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310

Test Plan: adding new unit test, pass make check.

Reviewed By: anand1976

Differential Revision: D23710892

Pulled By: zhichao-cao

fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9
2020-09-17 20:25:45 -07:00
Adam Retter
3ac07a12fe RocksJava - Add errorIfLogFileExists parameter to RocksDB.openReadOnly (#7046)
Summary:
Expose from C++ API to Java API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7046

Reviewed By: riversand963

Differential Revision: D23726297

Pulled By: pdillinger

fbshipit-source-id: fc66bf626ce6fe9797e7d021ac849eacab91bf6d
2020-09-17 15:41:25 -07:00
Peter Dillinger
93719fc953 Restore file size in backup table file names (and other cleanup) (#7400)
Summary:
Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '\_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.

This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.

We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).

We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.

Two final auxiliary notes:
Recognizing that the backup file names have become a de facto part of
the backup meta schema, this change makes them easier to parse and
extend by putting a distinct marker, 's', before DB session ids embedded
in the name. When we extend this to allow custom checksums in the name,
they can get their own marker to ensure safe parsing. For backward
compatibility, file size does not get a marker but is assumed for
`_[0-9]+[.]`

Another change from initial 6.12 default behavior is never including
file custom checksum in the file name. Looking ahead to 6.13, we do not
want the default behavior to cause backup space amplification for
someone turning on file custom checksum checking in BackupEngine; we
want that to be an easy decision. When implemented, including file
custom checksums in backup file names will be a non-default option.

Actual file name patterns and priorities, as regexes:

    kLegacyCrc32cAndFileSize OR pre-6.12 SST file ->
      [0-9]+_[0-9]+_[0-9]+[.]sst
    kFlagMatchInterimNaming set (default) AND early 6.12 SST file ->
      [0-9]+_[0-9a-fA-F-]+[.]sst
    kUseDbSessionId AND NOT kFlagIncludeFileSize ->
      [0-9]+_s[0-9A-Z]{20}[.]sst
    kUseDbSessionId AND kFlagIncludeFileSize (default) ->
      [0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst

We might add opt-in options for more '\_' separated data in the name,
but embedded file size, if present, will always be after last '\_' and
before '.sst'.

This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400

Test Plan:
unit tests included. Sync point callbacks are used to mimic
previous version SST files.

Reviewed By: ajkr

Differential Revision: D23759587

Pulled By: pdillinger

fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025
2020-09-17 10:24:22 -07:00
mrambacher
a08d6f18f0 Add more tests to ASSERT_STATUS_CHECKED (#7367)
Summary:
db_options_test
options_file_test
auto_roll_logger_test
options_util_test
persistent_cache_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7367

Reviewed By: jay-zhuang

Differential Revision: D23712520

Pulled By: zhichao-cao

fbshipit-source-id: 99b331e357f5d6a6aabee89d1bd933002cbb3908
2020-09-16 15:48:07 -07:00
mrambacher
67bd5401e9 Changes to EncryptedEnv public API (#7279)
Summary:
Cleaned up the public API to use the EncryptedEnv.  This change will allow providers to be developed and added to the system easier in the future.  It will also allow better integration in the future with the OPTIONS file.

- The internal classes were moved out of the public API into an internal "env_encryption_ctr.h" header.  Short-cut constructors were added to provide the original API functionality.
- The APIs to the constructors were changed to take shared_ptr, rather than raw pointers or references to allow better memory management and alternative implementations.
- CreateFromString methods were added to allow future expansion to other provider and cipher implementations through a standard API.

Additionally, there was a code duplication in the NewXXXFile methods.  This common code was moved under a templatized function.

A first-pass at structuring the code was made to potentially allow multiple EncryptionProviders in a single EncryptedEnv.  The idea was that different providers may use different cipher keys or different versions/algorithms.  The EncryptedEnv should have some means of picking different providers based on information.  The groundwork was started for this (the use of the provider_ member variable was localized) but the work has not been completed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7279

Reviewed By: jay-zhuang

Differential Revision: D23709440

Pulled By: zhichao-cao

fbshipit-source-id: 0e845fff0e03a52603eb9672b4ade32d063ff2f2
2020-09-15 17:14:10 -07:00
mrambacher
7d472accdc Bring the Configurable options together (#5753)
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts

Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753

Reviewed By: ajkr

Differential Revision: D23385030

Pulled By: zhichao-cao

fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
2020-09-14 17:01:01 -07:00
anand76
18a3227b12 Add a new IOStatus subcode to indicate that writes are fenced off (#7374)
Summary:
In a distributed file system, directory ownership is enforced by fencing
off the previous owner once they've been preempted by a new owner. This
PR adds a IOStatus subcode for ```StatusCode::IOError``` to indicate this.
Once this error is returned for a file write, the DB is put in read-only
mode and not allowed to resume in read-write mode.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7374

Test Plan: Add new unit tests in ```error_handler_fs_test```

Reviewed By: riversand963

Differential Revision: D23687777

Pulled By: anand1976

fbshipit-source-id: bef948642089dc0af399057864d9a8ca339e8b2f
2020-09-14 16:04:47 -07:00
Peter Dillinger
ecc8ffe17b Update master to version 6.13 (#7378)
Summary:
for release fork

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7378

Test Plan: make check + CI

Reviewed By: jay-zhuang

Differential Revision: D23669163

Pulled By: pdillinger

fbshipit-source-id: 14cbf95b32717c28418c71cc8e10f06733bbc49f
2020-09-12 13:18:09 -07:00
Levi Tamasi
5ce246c716 Expose the start of the expiration range for TTL blob files through LiveFileMetaData (#7365)
Summary:
The patch adds support for exposing the start of the expiration range
for TTL blob files through the `GetLiveFilesMetaData` API. This can be
used for monitoring purposes, i.e. to make sure TTL blob files are
deleted in a timely manner. The patch also fixes a couple of uninitialized
variable issues.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7365

Test Plan: `make check`

Reviewed By: pdillinger

Differential Revision: D23605465

Pulled By: ltamasi

fbshipit-source-id: 97a9612bf5f4b058423debdd3f28f576bb23a70f
2020-09-10 11:33:33 -07:00
Stanislav Tkach
5c39d8df69 Add getters to the C API for flush, write, cache and compact options (#7321)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7321

Reviewed By: ajkr

Differential Revision: D23590160

fbshipit-source-id: 35d106e732ac37f674222759cdb1dbb31e005ca7
2020-09-09 11:45:27 -07:00
Yanqin Jin
ab202e8d72 Add a new stats level to exclude tickers (#7329)
Summary:
Currently, application may pass a statistics object to db but later
wants to reduce stats tracking overhead by setting stats level to
kExceptHistogramOrTimers (the current lowest level). Tickers will still
be incremented, causing up to 1% CPU. We can add a new lowest stats
level `kExceptTickers` to disable ticker incrementing as well, thus
reducing CPU cycles spent on tickers.

Test Plan (devserver):
```
make check
make clean
DEBUG_LEVEL=0 make db_bench
./db_bench -perf_level=1 -stats_level=0 -statistics -benchmarks=fillseq,readrandom -duration=120
```

Measure CPU util (%) before and after change:
CPU util by rocksdb::RecordTick: 1.1 vs (<0.1)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7329

Reviewed By: pdillinger

Differential Revision: D23434014

Pulled By: riversand963

fbshipit-source-id: 72ff0f02a192ac476d4b0044b9f37fd4a22ff0d4
2020-09-04 23:25:03 -07:00
Andrew Kryczka
5746767387 add ldb unsafe_remove_sst_file subcommand (#7335)
Summary:
This is adapted from https://github.com/facebook/rocksdb/issues/6678 but takes a different approach, avoiding opening a read-write DB and avoiding the `DeleteFile()` API.

First, this PR refactors how options variables are initialized in `ldb` so it can be reused in a subcommand that doesn't open a DB:

- Separated remaining option initialization logic out of `OpenDB()`. The new `PrepareOptions()` function initializes the full options state.
- Fixed an old TODO about applying the subcommand CF option overrides to the proper `ColumnFamilyOptions` object.

Second, this PR adds the `ldb unsafe_remove_sst_file` subcommand. It uses the `VersionSet`-level APIs to remove the file with the specified number.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7335

Test Plan: played with interactive python and this file removal command. Verified openability/correct results in case of multiple column families, multiple levels, etc.

Reviewed By: pdillinger

Differential Revision: D23454575

Pulled By: ajkr

fbshipit-source-id: 039b7a8cbfc42fd123dcb25821eef51d61148afe
2020-09-03 16:54:51 -07:00
Andrew Kryczka
40e97b02be add warning on DeleteFile() API (#7337)
Summary:
Since we can't land https://github.com/facebook/rocksdb/issues/7336 until the next major release, added a strong warning against the `DeleteFile()` API in the meantime.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7337

Reviewed By: pdillinger

Differential Revision: D23459728

Pulled By: ajkr

fbshipit-source-id: 326cb9b18190386080c35c761a8736d8a877dafb
2020-09-03 16:42:01 -07:00
Eduardo Barreto Alexandre
5b1ccdc191 Expose rocksdb_open_column_families_with_ttl C function (#7314)
Summary:
This PR creates `rocksdb_open_column_families_with_ttl` which allows C API users to open a DBWithTLL with column families.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7314

Reviewed By: cheng-chang

Differential Revision: D23430287

Pulled By: ajkr

fbshipit-source-id: 307aa21d170d1402653263a91f6f832ef76afba0
2020-09-03 14:39:58 -07:00
Jay Zhuang
c2485f2d81 Add buffer prefetch support for non directIO usecase (#7312)
Summary:
A new file interface `SupportPrefetch()` is added. When the user overrides it to `false`, an internal prefetch buffer will be used for readahead. Useful for non-directIO but FS doesn't have readahead support.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7312

Reviewed By: anand1976

Differential Revision: D23329847

Pulled By: jay-zhuang

fbshipit-source-id: 71cd4ce6f4a820840294e4e6aec111ab76175527
2020-08-27 18:16:53 -07:00
Peter Dillinger
9aad24da55 Real fix for race in backup custom checksum checking (#7309)
Summary:
This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294.
To get DB checksum info for live files, we now read the manifest file
that will become part of the checkpoint/backup. This requires a little
extra handling in taking a custom checkpoint, including only reading the
manifest file up to the size prescribed by the checkpoint.

This moves GetFileChecksumsFromManifest from backup code to
file_checksum_helper.{h,cc} and removes apparently unnecessary checking
related to column families.

Updated HISTORY.md and warned potential future users of
DB::GetLiveFilesChecksumInfo()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309

Test Plan: updated unit test, before and after

Reviewed By: ajkr

Differential Revision: D23311994

Pulled By: pdillinger

fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5
2020-08-26 10:39:20 -07:00
Zhichao Cao
d51f88c9e4 Pass SST file checksum information through OnTableFileCreated (#7108)
Summary:
When SST file is created, application is able to know the file information through OnTableFileCreated callback in LogAndNotifyTableFileCreationFinished. Since file checksum information can be useful for application when the SST file is created, we add file_checksum and file_checksum_func_name information to TableFileCreationInfo, which will be passed through OnTableFileCreated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7108

Test Plan: make check, listener_test.

Reviewed By: ajkr

Differential Revision: D22470240

Pulled By: zhichao-cao

fbshipit-source-id: 92c20344d9b986eadfe3480f3769bf4add0dbaae
2020-08-25 10:46:11 -07:00
rockeet
e653af7164 DBWithTTL::Open() param ttls: vector<int32_t> to const vector<int32_t>& (#7196)
Summary:
fix DBWithTTL::Open() param ttls: vector<int32_t> to const vector<int32_t>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7196

Reviewed By: akankshamahajan15

Differential Revision: D23277772

Pulled By: ajkr

fbshipit-source-id: bf69834b5c2062c7e166dab21fbfd40416c7872d
2020-08-24 16:24:16 -07:00
mrambacher
e9befdebbf Add EnvTestWithParam::OptionsTest to the ASSERT_STATUS_CHECKED passes (#7283)
Summary:
This test uses database functionality and required more extensive work to get it to pass than the other tests.  The DB functionality required for this test now passes the check.

When it was unclear what the proper behavior was for unchecked status codes, a TODO was added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283

Reviewed By: akankshamahajan15

Differential Revision: D23251497

Pulled By: ajkr

fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523
2020-08-20 19:18:35 -07:00
Stanislav Tkach
b288f0131b Add getters for the read options to the C API (#7289)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7289

Reviewed By: akankshamahajan15

Differential Revision: D23252520

Pulled By: ajkr

fbshipit-source-id: 85cea485a6dcaa1c67c32a83eb49a1b623966609
2020-08-20 16:36:19 -07:00
Levi Tamasi
b9bb59d49d Add initial set of options for integrated blob write path (#7280)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7280

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23195192

Pulled By: ltamasi

fbshipit-source-id: 743b382de391963e62ba86119e9fbd0233ea3b3a
2020-08-18 18:32:37 -07:00
Andrew Kryczka
e6e2f3699c fix doc about kTolerateCorruptedTailRecords recovery (#7270)
Summary:
- Made it clear only one record in the tail is allowed to have a problem
- Added detail about the valid use case instead of calling it legacy behavior

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7270

Reviewed By: riversand963

Differential Revision: D23169075

Pulled By: ajkr

fbshipit-source-id: 2a4b45aa8641f17efa104523fbad765012a98fb0
2020-08-18 09:52:25 -07:00
Yanqin Jin
92593d511a Add a new EntryType for deletion with timestamp (#7195)
Summary:
Add `kEntryDeleteWithTimestamp` to `EntryType` which is a public API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7195

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22914704

Pulled By: riversand963

fbshipit-source-id: 886f73c6b70c527cad1c8fc9fc8d3afe60e1ea39
2020-08-17 16:26:06 -07:00
Daniel Smith
510c66fb75 Typo: s/entires/entries/ (#7248)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7248

Reviewed By: ajkr

Differential Revision: D23108368

Pulled By: jay-zhuang

fbshipit-source-id: 41765334b0bae26bd74a4d9250b9b98ea7ac6c3f
2020-08-14 13:37:23 -07:00
Zitan Chen
b578ca2e4d BackupEngine supports custom file checksums (#7085)
Summary:
A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups.

Tests are added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085

Test Plan: Passed make check

Reviewed By: pdillinger

Differential Revision: D22390756

Pulled By: gg814

fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f
2020-08-12 13:31:09 -07:00
anand76
832b056a30 Enable IO timeouts for iterators (#7161)
Summary:
Introduce io_timeout in ReadOptions and enabled deadline/io_timeout for
Iterators.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7161

Test Plan: New unit tests in db_basic_test

Reviewed By: riversand963

Differential Revision: D22687352

Pulled By: anand1976

fbshipit-source-id: 67bbb0e6d7ae80b256589244468494292538c6ec
2020-08-07 12:01:08 -07:00
Akanksha Mahajan
493f425e77 Add support to start and end IOTracing through DB APIs (#7203)
Summary:
1. Add support to start io tracing through DB::StartIOTrace(Env*, const TraceOptions&, std::unique_ptr<TraceWriter>&&) and end tracing through DB::EndIOTrace(). This doesn't trace DB::Open.

User side code:

//Open DB
DB::Open(options, dbname, &db);

/* Start tracing */
db->StartIOTrace(env, trace_opt, std::move(trace_writer));

/* Perform Operations */

/*End tracing*/
db->EndIOTrace();

2. Fix the build errors for Windows.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7203

Test Plan: make check -j64

Reviewed By: anand1976

Differential Revision: D22901947

Pulled By: akankshamahajan15

fbshipit-source-id: e59c0b785a802168e6f1aa028d99c224a35cb30c
2020-08-04 18:41:45 -07:00
Levi Tamasi
8cb278d11a Move CompressionType to its own header file (#7162)
Summary:
The patch moves `CompressionType` to its own header file and makes sure
all other public headers include this new header directly, as opposed to
relying on transitive includes or forward declarations.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7162

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D22676545

Pulled By: ltamasi

fbshipit-source-id: 01d7a232377a229cbbc373d0ec1bf01dc0b0ce02
2020-08-03 15:49:31 -07:00
Adam Retter
18efd760c5 Add defaults to ReadOptions doc (#7215)
Summary:
Very small improvements to document the defaults.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7215

Reviewed By: zhichao-cao

Differential Revision: D22902286

fbshipit-source-id: a754d172a0d8e4c03754f6f1771d4a693d60a770
2020-08-03 14:34:49 -07:00
Aaron Kabcenell
56ed601df3 Compaction Read/Write Stats by Compaction Type (#7165)
Summary:
Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165

Test Plan:
TTL and periodic can be checked by runnning db_bench with the options activated:

/db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1
./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1

Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected.

Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking

Reviewed By: ajkr

Differential Revision: D22693050

Pulled By: akabcenell

fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d
2020-07-29 13:39:29 -07:00
codingsh
50f206ad84 feat: export SetBackgroundThreads(n, Env::BOTTOM); (#7191)
Summary:
- https://github.com/rust-rocksdb/rust-rocksdb/pull/448

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7191

Reviewed By: riversand963

Differential Revision: D22809066

Pulled By: ajkr

fbshipit-source-id: 036939f9a28cacc3f677c318d1aed97fe5f4f85e
2020-07-29 12:24:13 -07:00
Yanqin Jin
a38f04ac26 Update HISTORY and version for 6.12 release (#7194)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7194

Reviewed By: gg814

Differential Revision: D22810654

Pulled By: riversand963

fbshipit-source-id: 01f13089fa2b7e31b827da3e30c90e5c62c41380
2020-07-29 10:13:21 -07:00
mrambacher
d9d190742c Make env*_test work with ASSERT_STATUS_CHECKED (#7176)
Summary:
Make (most of) the env*_test pass when ASSERT_STATUS_CHECKED is enabled.

One test that opens a database is currently disabled in this mode, as there are many errors that need revisited for DB tests and status checks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7176

Reviewed By: cheng-chang

Differential Revision: D22799278

Pulled By: ajkr

fbshipit-source-id: 16d8a02eaeecd6df1060249b6a5811292801f2ed
2020-07-28 22:59:48 -07:00
codingsh
83ea266b43 export stats_persist_period_sec (#7168)
Summary:
fixed
 - https://github.com/rust-rocksdb/rust-rocksdb/issues/447
 -  https://github.com/rust-rocksdb/rust-rocksdb/pull/448

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7168

Reviewed By: cheng-chang

Differential Revision: D22736013

Pulled By: ajkr

fbshipit-source-id: fdd784aa75d26a367b9108b05ffdd94a2ae117d3
2020-07-28 13:05:34 -07:00
Yanqin Jin
b0279d3869 Header file should not be executable (#7182)
Summary:
As title.
Undo file mode change in https://github.com/facebook/rocksdb/issues/6759 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7182

Reviewed By: ajkr

Differential Revision: D22786166

Pulled By: riversand963

fbshipit-source-id: 696903069acda42f26bbbf1f2875f5a08b761b42
2020-07-28 09:39:13 -07:00
Tomas Kolda
cd4592c220 SST Partitioner interface that allows to split SST files (#6957)
Summary:
SST Partitioner interface that allows to split SST files during compactions.

It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957

Reviewed By: ajkr

Differential Revision: D22461239

fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e
2020-07-24 13:44:49 -07:00
Jay Zhuang
b0c5ecd6b3 Make max_subcompactions dynamically changeable (#7159)
Summary:
Make `max-subcompactions` dynamically changeable by passing the `DBOption` to Compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7159

Reviewed By: siying

Differential Revision: D22671238

Pulled By: jay-zhuang

fbshipit-source-id: 311ca9f6bb606965544d8708616d358cfed5be42
2020-07-22 18:32:52 -07:00
mrambacher
d44cbc5314 Add hash of key/value checks when paranoid_file_checks=true (#7134)
Summary:
When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file.  If the values do not match, the SST file is rejected.

Code put in place for the check for both flush and compaction jobs.  Corresponding test added to corruption_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134

Reviewed By: cheng-chang

Differential Revision: D22646149

fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f
2020-07-22 11:04:40 -07:00
Haosen Wen
dbc51adbac Use steady_clock instead of system_clock in FileOperationInfo::TimePoint (#7153)
Summary:
Issue https://github.com/facebook/rocksdb/issues/7133 reported that using `system_clock` in `FileOperationInfo::TimePoint` causes the duration of file flush operation (which can be a noop on MacOS in some scenarios) appears to be 0 and fail an assertion in listener_test. Using `steady_clock` supposedly fixed the problem.
`steady_clock` actually fits better into the use cases of `FileOperationInfo::TimePoint` as all usages care about durations but not wall clock time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7153

Test Plan: make check.

Reviewed By: riversand963

Differential Revision: D22654136

Pulled By: roghnin

fbshipit-source-id: 5980b1080734bdae496a18071a2c2b5887c67d85
2020-07-22 08:55:02 -07:00
Zitan Chen
b923dc720b BackupEngine computes table checksums only once if db session ids are available (#7110)
Summary:
BackupEngine requires computing table checksums twice when backing up table files to the `shared_checksum` directory.

The repeated computation can be avoided by utilizing the db session id stored as a part of the table properties.

Filenames of table files in the `shared_checksum` directory depend on the following conditions:
1. the naming scheme is `kOptionalChecksumAndDbSessionId`,
2. `db_session_id` is not empty,
3. checksum is available in the DB manifest.

If 1,2,3 are satisfied, then the filenames will be of the form `<file_number>_<checksum>_<db_session_id>.sst`.
If 1,2 are satisfied, then the filenames will be of the form `<file_number>_<db_session_id>.sst`.
In all other cases, the filenames are of the form `<file_number>_<checksum>_<size>.sst`.

Additionally, if `kOptionalChecksumAndDbSessionId` is used (and not falling back to `kChecksumAndFileSize`), the `<checksum>` appeared in the filenames is hexadecimally encoded, instead of being plain `uint32_t` value.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7110

Test Plan: backupable_db_test and manual tests.

Reviewed By: ajkr

Differential Revision: D22508992

Pulled By: gg814

fbshipit-source-id: 5669f0ea9ad5a097f69f6d87aca4abba15032389
2020-07-21 10:35:40 -07:00
Zhichao Cao
a10f12eda1 Auto resume the DB from Retryable IO Error (#6765)
Summary:
In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765

Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check

Reviewed By: anand1976

Differential Revision: D21916789

Pulled By: zhichao-cao

fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
2020-07-15 11:03:58 -07:00
Akanksha Mahajan
d93bd3ce25 Add FileSystem wrapper classes for IO tracing. (#7002)
Summary:
1. Add the wrapper classes FileSystemTracingWrapper, FSSequentialFileTracingWrapper, FSRandomAccessFileTracingWrapper, FSWritableFileTracingWrapper, FSRandomRWFileTracingWrapper that forward the calls to underlying storage system and then pass the file operation information to IOTracer. IOTracer dumps the record in binary format for tracing.
2. Add the wrapper classes FileSystemPtr, FSSequentialFilePtr, FSRandomAccessFilePtr, FSWritableFilePtr and FSRandomRWFilePtr that overload operator-> and return ptr to underlying storage system or Tracing wrapper class based on enabling/disabling of IO tracing. These classes are added to bypass Tracing Wrapper classes when we disable tracing.
3. Add enums in trace.h that distinguish which options need to be added for different file operations(Read, close, write etc) as part of tracing record.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7002

Test Plan: make check -j64

Reviewed By: anand1976

Differential Revision: D22127897

Pulled By: akankshamahajan15

fbshipit-source-id: 74cff58ce5661c9a3832dfaa52483f3b2d8565e0
2020-07-13 16:36:55 -07:00
Stanislav Tkach
393e486e3e Add getters for options to the C API (#7094)
Summary:
Along with https://github.com/facebook/rocksdb/issues/6925 and https://github.com/facebook/rocksdb/issues/6998, this should add getters for all Options fields except several ones with non-trivial interface (for example rocksdb_options_set_min_level_to_compress).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7094

Reviewed By: riversand963

Differential Revision: D22479800

Pulled By: pdillinger

fbshipit-source-id: d14f305e12cfe268d07e0fe229d55cef299c792a
2020-07-10 14:30:04 -07:00
rafael-aero
712458fc34 Add RestoreDBFromLatestBackup to C API, add new C# package (#7092)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7092

Reviewed By: riversand963

Differential Revision: D22412323

Pulled By: ajkr

fbshipit-source-id: 3fc1c63bb19a8cd2c0ae620800c28f199a7f494b
2020-07-08 11:56:41 -07:00
wenh
226d1f9c73 extend listener callback functions to more file I/O operations (#7055)
Summary:
Currently, `EventListener` in listner.h only have callback functions for file read and write. One may favor extended callback functions for more file I/O operations like flush, sync and close. This PR tries to add those interface and have them called when appropriate throughout the code base.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7055

Test Plan:
Write an experimental listener with those new callback functions with log output in them; run experiments and check logs to see those functions are actually called.
Default test suits `make check` should also be included.

Reviewed By: riversand963

Differential Revision: D22380624

Pulled By: roghnin

fbshipit-source-id: 4121491d45c2c2aae8c255e7998090559a241c6a
2020-07-07 18:21:18 -07:00
Zitan Chen
373d5ac485 BackupEngine verifies table file checksums on creating new backups (#7015)
Summary:
When table file checksums are enabled and stored in the DB manifest by using the RocksDB default crc32c checksum function, BackupEngine will calculate the crc32c checksum of the file to be copied and compare the calculated result with the one stored in the DB manifest before copying the file to the backup directory.

After copying to the backup directory, BackupEngine will verify the checksum of the copied file with the one calculated before copying. This helps detect some rare corruption events such as bit-flips during the copying process.

No verification with checksums in DB manifest will be performed if the table file checksum function is not the RocksDB default crc32c checksum function.

In addition, If `share_table_files` and `share_files_with_checksum` are true, BackupEngine will compare the checksums computed before and after copying of the table files.

Corresponding tests are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7015

Test Plan: Passed make check

Reviewed By: pdillinger

Differential Revision: D22165732

Pulled By: gg814

fbshipit-source-id: ee0e8cc397c455eba64545c29380b9d9853588ec
2020-07-02 18:15:12 -07:00
Zitan Chen
6a243b3ade Generalize BackupEngine naming option for share_files_with_checksum SSTs and revert BackupEngine::VerifyBackup to check only file sizes by default (#7032)
Summary:
`bool BackupableDBOptions::new_naming_for_backup_files` is updated to `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming`, where `BackupTableNameOption` is an `enum` type with two enumerators `kChecksumAndFileSize` and `kChecksumAndFileSize`. This opens up possibilities of extenting the current naming scheme for backup table files. By default, `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming` is set to `kChecksumAndDbSessionId`.

Revert `BackupEngine::VerifyBackup` to only check file sizes by default.

Also fix the construction of the `SstFileDumper` in `GetFileDbIdentities` by setting a proper `Env` of the `Options` passed in the constructor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7032

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22237763

Pulled By: gg814

fbshipit-source-id: 466902a4e731babd64e30f0e82ca1aa82962e52e
2020-06-30 18:47:16 -07:00
Burton Li
5be2cb6948 Compaction filter support for BlobDB (#6850)
Summary:
Added compaction filter support for BlobDB non-TTL values. Same as vanilla RocksDB, user compaction filter applies to all k/v pairs of the compaction for non-TTL values. It honors `min_blob_size`, which potentially results value transitions between inlined data and stored-in-blob data when size of value is changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6850

Reviewed By: siying

Differential Revision: D22263487

Pulled By: ltamasi

fbshipit-source-id: 8fc03f8cde2a5c831e63b436b3dbf1b7f90939e8
2020-06-29 17:32:14 -07:00
Anand Ananthabhotla
9a5886bd8c Extend Get/MultiGet deadline support to table open (#6982)
Summary:
Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open.

The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future.

Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982

Test Plan: Add new unit tests in db_basic_test

Reviewed By: riversand963

Differential Revision: D22219515

Pulled By: anand1976

fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b
2020-06-29 14:53:17 -07:00
Stanislav Tkach
1b85d57cf5 Expose KeyMayExist in the C API (#7021)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7021

Reviewed By: ajkr

Differential Revision: D22246297

Pulled By: pdillinger

fbshipit-source-id: 81dfd0a49e4d5ce0c9f00772c17cca425757ea24
2020-06-29 12:21:53 -07:00
Zitan Chen
1569dc48f5 BackupEngine::VerifyBackup verifies checksum by default (#7014)
Summary:
A parameter `verify_with_checksum` is added to `BackupEngine::VerifyBackup`, which is true by default. So now `BackupEngine::VerifyBackup` verifies backup files with checksum AND file size by default. When `verify_with_checksum` is false, `BackupEngine::VerifyBackup` only compares file sizes to verify backup files.

Also add a test for the case when corruption does not change the file size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7014

Test Plan: Passed backupable_db_test

Reviewed By: zhichao-cao

Differential Revision: D22165590

Pulled By: gg814

fbshipit-source-id: 606a7450714e868bceb38598c89fd356c6004f4f
2020-06-26 11:42:12 -07:00
Stanislav Tkach
70b5d95dc7 Add (more) getters for options to the C API (#6998)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6998

Reviewed By: ajkr

Differential Revision: D22211700

Pulled By: pdillinger

fbshipit-source-id: 1141c20527dee5e13205059bf8e83927063c4c1e
2020-06-25 13:53:33 -07:00
Zitan Chen
be41c61f22 Add a new option for BackupEngine to store table files under shared_checksum using DB session id in the backup filenames (#6997)
Summary:
`BackupableDBOptions::new_naming_for_backup_files` is added. This option is false by default. When it is true, backup table filenames under directory shared_checksum are of the form `<file_number>_<crc32c>_<db_session_id>.sst`.

Note that when this option is true, it comes into effect only when both `share_files_with_checksum` and `share_table_files` are true.

Three new test cases are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6997

Test Plan: Passed make check.

Reviewed By: ajkr

Differential Revision: D22098895

Pulled By: gg814

fbshipit-source-id: a1d9145e7fe562d71cde7ac995e17cb24fd42e76
2020-06-24 19:31:25 -07:00
Yanqin Jin
e66199d848 First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.

Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.

Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949

Reviewed By: anand1976

Differential Revision: D22026020

Pulled By: riversand963

fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-24 19:07:08 -07:00
Peter Dillinger
5b2bbacb6f Minimize memory internal fragmentation for Bloom filters (#6427)
Summary:
New experimental option BBTO::optimize_filters_for_memory builds
filters that maximize their use of "usable size" from malloc_usable_size,
which is also used to compute block cache charges.

Rather than always "rounding up," we track state in the
BloomFilterPolicy object to mix essentially "rounding down" and
"rounding up" so that the average FP rate of all generated filters is
the same as without the option. (YMMV as heavily accessed filters might
be unluckily lower accuracy.)

Thus, the option near-minimizes what the block cache considers as
"memory used" for a given target Bloom filter false positive rate and
Bloom filter implementation. There are no forward or backward
compatibility issues with this change, though it only works on the
format_version=5 Bloom filter.

With Jemalloc, we see about 10% reduction in memory footprint (and block
cache charge) for Bloom filters, but 1-2% increase in storage footprint,
due to encoding efficiency losses (FP rate is non-linear with bits/key).

Why not weighted random round up/down rather than state tracking? By
only requiring malloc_usable_size, we don't actually know what the next
larger and next smaller usable sizes for the allocator are. We pick a
requested size, accept and use whatever usable size it has, and use the
difference to inform our next choice. This allows us to narrow in on the
right balance without tracking/predicting usable sizes.

Why not weight history of generated filter false positive rates by
number of keys? This could lead to excess skew in small filters after
generating a large filter.

Results from filter_bench with jemalloc (irrelevant details omitted):

    (normal keys/filter, but high variance)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.6278
    Number of filters: 5516
    Total size (MB): 200.046
    Reported total allocated memory (MB): 220.597
    Reported internal fragmentation: 10.2732%
    Bits/key stored: 10.0097
    Average FP rate %: 0.965228
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.5104
    Number of filters: 5464
    Total size (MB): 200.015
    Reported total allocated memory (MB): 200.322
    Reported internal fragmentation: 0.153709%
    Bits/key stored: 10.1011
    Average FP rate %: 0.966313

    (very few keys / filter, optimization not as effective due to ~59 byte
     internal fragmentation in blocked Bloom filter representation)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.5649
    Number of filters: 162950
    Total size (MB): 200.001
    Reported total allocated memory (MB): 224.624
    Reported internal fragmentation: 12.3117%
    Bits/key stored: 10.2951
    Average FP rate %: 0.821534
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 31.8057
    Number of filters: 159849
    Total size (MB): 200
    Reported total allocated memory (MB): 208.846
    Reported internal fragmentation: 4.42297%
    Bits/key stored: 10.4948
    Average FP rate %: 0.811006

    (high keys/filter)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.7017
    Number of filters: 164
    Total size (MB): 200.352
    Reported total allocated memory (MB): 221.5
    Reported internal fragmentation: 10.5552%
    Bits/key stored: 10.0003
    Average FP rate %: 0.969358
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.7131
    Number of filters: 160
    Total size (MB): 200.928
    Reported total allocated memory (MB): 200.938
    Reported internal fragmentation: 0.00448054%
    Bits/key stored: 10.1852
    Average FP rate %: 0.963387

And from db_bench (block cache) with jemalloc:

    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ (for FILE in /dev/shm/dbbench.no_optimize/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17063835
    $ (for FILE in /dev/shm/dbbench/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17430747
    $ #^ 2.1% additional filter storage
    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8440400
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 21087528
    rocksdb.bloom.filter.useful COUNT : 4963889
    rocksdb.bloom.filter.full.positive COUNT : 1214081
    rocksdb.bloom.filter.full.true.positive COUNT : 1161999
    $ #^ 1.04 % observed FP rate
    $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8448592
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 18220328
    rocksdb.bloom.filter.useful COUNT : 5360933
    rocksdb.bloom.filter.full.positive COUNT : 1321315
    rocksdb.bloom.filter.full.true.positive COUNT : 1262999
    $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters

(Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427

Test Plan: unit test added, 'make check' with gcc, clang and valgrind

Reviewed By: siying

Differential Revision: D22124374

Pulled By: pdillinger

fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e
2020-06-22 13:32:07 -07:00
Matthew Von-Maszewski
1092f19d95 Make EncryptEnv inheritable (#6830)
Summary:
EncryptEnv class is both declared and defined within env_encryption.cc.  This makes it really tough to derive new classes from that base.

This branch moves declaration of the class to rocksdb/env_encryption.h.  The change facilitates making new encryption modules (such as an upcoming openssl AES CTR pull request) possible / easy.

The only coding change was to add the EncryptEnv object to env_basic_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6830

Reviewed By: riversand963

Differential Revision: D21706593

Pulled By: ajkr

fbshipit-source-id: 64d2da95a1569ceeb9b1549c3bec5404cf4c89f0
2020-06-22 13:27:16 -07:00
Zitan Chen
94d04529de Store DB identity and DB session ID in SST files (#6983)
Summary:
`db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file.

The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`.

In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`.

A table property test is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983

Test Plan: make check and some manual tests.

Reviewed By: zhichao-cao

Differential Revision: D22048826

Pulled By: gg814

fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9
2020-06-17 10:57:40 -07:00
Andrew Kryczka
742b452863 update minor version for 6.11 release (#6994)
Summary:
The 6.11.fb branch is already cut so I will also backport this PR to
that branch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6994

Reviewed By: riversand963

Differential Revision: D22084532

Pulled By: ajkr

fbshipit-source-id: 0b025f738cc31c65c673cbf89302359e88a34d19
2020-06-16 21:46:05 -07:00
Zitan Chen
88db97b06d Add a DB Session ID (#6959)
Summary:
Added DB::GetDbSessionId by using the same format and machinery as DB::GetDbIdentity.
The DB Session ID is generated (and therefore, updated) each time a DB object is opened. It is written to the LOG file right after the line of “DB SUMMARY”.
A test for the uniqueness, for different openings and during the same opening, is also added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6959

Test Plan: Passed make check

Reviewed By: zhichao-cao

Differential Revision: D21951721

Pulled By: gg814

fbshipit-source-id: 958a48a612db49a39998ea703cded45987d3fa8b
2020-06-15 10:47:02 -07:00
Zhen Li
9c24a5cb4d Fix persistent cache on windows (#6932)
Summary:
Persistent cache feature caused rocks db crash on windows. I posted a issue for it, https://github.com/facebook/rocksdb/issues/6919. I found this is because no "persistent_cache_key_prefix" is generated for persistent cache. Looking repo history, "GetUniqueIdFromFile" is not implemented on Windows. So my fix is adding "NewId()" function in "persistent_cache" and using it to generate prefix for persistent cache. In this PR, i also re-enable related test cases defined in "db_test2" and "persistent_cache_test" for windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6932

Test Plan:
1. run related test cases in "db_test2" and "persistent_cache_test" on windows and see it passed.
2. manually run db_bench.exe with "read_cache_path" and verified.

Reviewed By: riversand963

Differential Revision: D21911608

Pulled By: cheng-chang

fbshipit-source-id: cdfd938d54a385edbb2836b13aaa1d39b0a6f1c2
2020-06-13 13:28:31 -07:00
Cheng Chang
f7613e2a9e Make it able to lower cpu priority to specific level in threadpool (#6969)
Summary:
`Env::LowerThreadPoolCPUPriority` takes a new parameter `CpuPriority` to be able to lower to a specific priority such as `CpuPriority::kIdle`, previously, the priority is always lowered to `CpuPriority::kLow`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6969

Test Plan: unit test `EnvPosixTest::LowerThreadPoolCpuPriority` added to `env_test.cc`.

Reviewed By: siying

Differential Revision: D22011169

Pulled By: cheng-chang

fbshipit-source-id: 568878c24a924912e35cef00c552d4a63431cdf4
2020-06-13 13:25:20 -07:00
Zhichao Cao
b3585a11b4 Ingest SST files with checksum information (#6891)
Summary:
Application can ingest SST files with file checksum information, such that during ingestion, DB is able to check data integrity and identify of the SST file. The PR introduces generate_and_verify_file_checksum to IngestExternalFileOption to control if the ingested checksum information should be verified with the generated checksum.

    1. If generate_and_verify_file_checksum options is *FALSE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enables the SST file checksum and the checksum function name matches the checksum function name in DB, we trust the ingested checksum, store it in Manifest. If the checksum function name does not match, we treat that as an error and fail the IngestExternalFile() call.
    2. If generate_and_verify_file_checksum options is *TRUE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enable the SST file checksum, we will use the checksum generator from DB to calculate the checksum for each ingested SST files after they are copied or moved. Then, compare the checksum results with the ingested checksum information: _A)_ if the checksum function name does not match, _verification always report true_ and we store the DB generated checksum information in Manifest. _B)_ if the checksum function name mach, and checksum match, ingestion continues and stores the checksum information in the Manifest. Otherwise, terminate file ingestion and report file corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6891

Test Plan: added unit test, pass make asan_check

Reviewed By: pdillinger

Differential Revision: D21935988

Pulled By: zhichao-cao

fbshipit-source-id: 7b55f486632db467e76d72602218d0658aa7f6ed
2020-06-11 14:27:36 -07:00
Peter Dillinger
edf74d1cb1 Add --version and --help to ldb and sst_dump (#6951)
Summary:
as title
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6951

Test Plan: tests included + manual

Reviewed By: ajkr

Differential Revision: D21918540

Pulled By: pdillinger

fbshipit-source-id: 79d4991f2a831214fc7e477a839ec19dbbace6c5
2020-06-09 10:04:01 -07:00
Yanqin Jin
3020df9df5 Remove unnecessary inclusion of version_edit.h in env (#6952)
Summary:
In db_options.c, we should avoid including header files in the `db` directory to avoid introducing unnecessary dependency. The reason why `version_edit.h` has been included in `db_options.cc` is because we need two constants, `kUnknownChecksum` and `kUnknownChecksumFuncName`. We can put these two constants as `constexpr` in the public header `file_checksum.h`.

Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6952

Reviewed By: zhichao-cao

Differential Revision: D21925341

Pulled By: riversand963

fbshipit-source-id: 2902f3b74c97f0cf16c58ad24c095c787c3a40e2
2020-06-07 21:56:55 -07:00
Akanksha Mahajan
2677bd5967 Add logs and stats in DeleteScheduler (#6927)
Summary:
Add logs and stats for files marked as trash and files deleted immediately in DeleteScheduler
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6927

Test Plan: make check -j64

Reviewed By: riversand963

Differential Revision: D21869068

Pulled By: akankshamahajan15

fbshipit-source-id: e9f673c4fa8049ce648b23c75d742f2f9c6c57a1
2020-06-05 09:43:04 -07:00
Peter Dillinger
c7432cc3c0 Fix more defects reported by Coverity Scan (#6935)
Summary:
Mostly uninitialized values: some probably written before use, but some seem like bugs. Also, destructor needs to be virtual, and possible use-after-free in test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6935

Test Plan: make check

Reviewed By: siying

Differential Revision: D21885484

Pulled By: pdillinger

fbshipit-source-id: e2e7cb0a0cf196f2b55edd16f0634e81f6cc8e08
2020-06-04 15:35:08 -07:00
Zitan Chen
02df00d97b API change: DB::OpenForReadOnly will not write to the file system unless create_if_missing is true (#6900)
Summary:
DB::OpenForReadOnly will not write anything to the file system (i.e., create directories or files for the DB) unless create_if_missing is true.

This change also fixes some subcommands of ldb, which write to the file system even if the purpose is for readonly.

Two tests for this updated behavior of DB::OpenForReadOnly are also added.

Other minor changes:
1. Updated HISTORY.md to include this API change of DB::OpenForReadOnly;
2. Updated the help information for the put and batchput subcommands of ldb with the option [--create_if_missing];
3. Updated the comment of Env::DeleteDir to emphasize that it returns OK only if the directory to be deleted is empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6900

Test Plan: passed make check; also manually tested a few ldb subcommands

Reviewed By: pdillinger

Differential Revision: D21822188

Pulled By: gg814

fbshipit-source-id: 604cc0f0d0326a937ee25a32cdc2b512f9a3be6e
2020-06-03 18:57:49 -07:00
Stanislav Tkach
b7c825d5cf Add (some) getters for options to the C API (#6925)
Summary:
Additionally I have extended the incomplete test added in the https://github.com/facebook/rocksdb/issues/6880.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6925

Reviewed By: ajkr

Differential Revision: D21869788

Pulled By: pdillinger

fbshipit-source-id: e9db80f259c57ca1bdcbc2c66cb938cb1ac26e48
2020-06-03 17:08:50 -07:00
Anatoly Zhmur
22e5c513c2 Add zstd_max_train_bytes to c interop (#6796)
Summary:
Added setting of zstd_max_train_bytes compression option parameter to c interop.

rocksdb_options_set_bottommost_compression_options was using bool parameter and thus not exported, updated it to unsigned char and added to c.h as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6796

Reviewed By: cheng-chang

Differential Revision: D21611471

Pulled By: ajkr

fbshipit-source-id: caaaf153de934837ad9af283c7f8c025ff0b0cf5
2020-06-03 12:27:12 -07:00
Stanislav Tkach
38f988d3b4 Expose rocksdb_options_copy function to the C API (#6880)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6880

Reviewed By: ajkr

Differential Revision: D21842752

Pulled By: pdillinger

fbshipit-source-id: eda326f551ddd9cb397681544b9e9799ea614e52
2020-06-02 13:48:51 -07:00
Peter Dillinger
14eca6bf04 For ApproximateSizes, pro-rate table metadata size over data blocks (#6784)
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.

It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.

So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.

Also includes miscellaneous comment improvements / clarifications.

Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784

Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...

    [ RUN      ] DBTest.ApproximateSizesFilesWithErrorMargin
    db/db_test.cc:1562: Failure
    Expected: (size) <= (11 * 100), actual: 9478 vs 1100

Other tests updated to reflect consistent accounting of metadata.

Reviewed By: siying

Differential Revision: D21334706

Pulled By: pdillinger

fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
2020-06-02 12:30:23 -07:00
Akanksha Mahajan
bcefc59e9f Allow MultiGet users to limit cumulative value size (#6826)
Summary:
1. Add a value_size in read options which limits the cumulative value size of keys read in batches. Once the size exceeds read_options.value_size, all the remaining keys are returned with status Abort without further fetching any key.
2. Add a unit test case MultiGetBatchedValueSizeSimple the reads keys from memory and sst files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6826

Test Plan:
1. make check -j64
	   2. Add a new unit test case

Reviewed By: anand1976

Differential Revision: D21471483

Pulled By: akankshamahajan15

fbshipit-source-id: dea51b8e76d5d1df38ece8cdb29933b1d798b900
2020-05-27 13:07:14 -07:00
Peter Dillinger
803a517b48 Misc things for ASSERT_STATUS_CHECKED, also gcc 4.8.5 (#6871)
Summary:
* Print stack trace on status checked failure
* Make folly_synchronization_distributed_mutex_test a parallel test
* Disable ldb_test.py and rocksdb_dump_test.sh with
  ASSERT_STATUS_CHECKED (broken)
* Fix shadow warning in random_access_file_reader.h reported by gcc
  4.8.5 (ROCKSDB_NO_FBCODE), also https://github.com/facebook/rocksdb/issues/6866
* Work around compiler bug on max_align_t for gcc < 4.9
* Remove an apparently wrong comment in status.h
* Use check_some in Travis config (for proper diagnostic output)
* Fix ignored Status in loop in options_helper.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6871

Test Plan: manual, CI

Reviewed By: ajkr

Differential Revision: D21706619

Pulled By: pdillinger

fbshipit-source-id: daf6364173d6689904eb394461a69a11f5bee2cb
2020-05-23 06:53:37 -07:00
Peter Dillinger
c7aedf1b48 Clean up some code related to file checksums (#6861)
Summary:
* Add missing unit test for schema stability of FileChecksumGenCrc32c
  (previously was only comparing to itself)
* A lot of clarifying comments
* Add some assertions for preconditions
* Rename WritableFileWriter::CalculateFileChecksum -> UpdateFileChecksum
* Simplify FileChecksumGenCrc32c with shared functions
* Implement EndianSwapValue to replace unused EndianTransform

And incidentally since I had trouble with 'make check-format' GitHub action disagreeing with local run,
* Output full diagnostic information when 'make check-format' fails in CI
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6861

Test Plan: new unit test passes before & after other changes

Reviewed By: zhichao-cao

Differential Revision: D21667115

Pulled By: pdillinger

fbshipit-source-id: 6a99970f87605aa024fa540c78cd519ff322c3e6
2020-05-21 08:12:51 -07:00
Zhichao Cao
545e14b53b Generate file checksum in SstFileWriter (#6859)
Summary:
If Option.file_checksum_gen_factory is set, rocksdb generates the file checksum during flush and compaction based on the checksum generator created by the factory and store the checksum and function name in vstorage and Manifest.

This PR enable file checksum generation in SstFileWrite and store the checksum and checksum function name in the  ExternalSstFileInfo, such that application can use them for other purpose, for example, ingest the file checksum with files in IngestExternalFile().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6859

Test Plan: add unit test and pass make asan_check.

Reviewed By: ajkr

Differential Revision: D21656247

Pulled By: zhichao-cao

fbshipit-source-id: 78a3570c76031d8832e3d2de3d6c79cdf2b675d0
2020-05-20 11:55:31 -07:00
Cheng Chang
b9d65f5aa6 Trigger compaction in CompactOnDeletionCollector based on deletion ratio (#6806)
Summary:
In level compaction, if the total size (even if compensated after taking account of the deletions) of a level hasn't exceeded the limit, but there are lots of deletion entries in some SST files of the level, these files should also be good candidates for compaction. Otherwise, queries for the deleted keys might be slow because they need to go over all the tombstones.

This PR adds an option `deletion_ratio` to the factory of `CompactOnDeletionCollector` to configure it to trigger compaction when the ratio of tombstones >= `deletion_ratio`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6806

Test Plan:
Added new unit test in `compact_on_deletion_collector_test.cc`.
make compact_on_deletion_collector_test && ./compact_on_deletion_collector_test

Reviewed By: ajkr

Differential Revision: D21511981

Pulled By: cheng-chang

fbshipit-source-id: 65a9d0150e8c9c00337787686475252e4535a3e1
2020-05-18 08:42:05 -07:00
Stanislav Tkach
70aaa9ceeb Expose CancellAllBackgroundWork to C api (#6832)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6832

Reviewed By: cheng-chang

Differential Revision: D21498186

Pulled By: riversand963

fbshipit-source-id: 66bb0d7c06af2bf0df3c6a09b61bca2fb81f2dd3
2020-05-12 14:50:52 -07:00
sdong
a50ea71c00 Improve ldb consistency checks (#6802)
Summary:
When using ldb, users cannot turn on force consistency check in most commands, while they cannot use checksonsistnecy with --try_load_options. The change fixes both by:
1. checkconsistency now calls OpenDB() so that it gets all the options loading and sanitized options logic
2. use options.check_consistency_checks = true by default, and add a --disable_consistency_checks to turn it off.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6802

Test Plan: Add a new unit test. Some manual tests with corrupted DBs.

Reviewed By: pdillinger

Differential Revision: D21388051

fbshipit-source-id: 8d122732d391b426e3982a1c3232a8e3763ffad0
2020-05-08 14:17:47 -07:00
Andrew Kryczka
1c84660457 prototype status check enforcement (#6798)
Summary:
Tried making Status object enforce that it is checked in some way. In cases it is not checked, `PermitUncheckedError()` must be called explicitly.

Added a way to run tests (`ASSERT_STATUS_CHECKED=1 make -j48 check`) on a
whitelist. The effort appears significant to get each test to pass with
this assertion, so I only fixed up enough to get one test (`options_test`)
working and added it to the whitelist.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6798

Reviewed By: pdillinger

Differential Revision: D21377404

Pulled By: ajkr

fbshipit-source-id: 73236f9c8df38f01cf24ecac4a6d1661b72d077e
2020-05-08 12:40:43 -07:00
anand76
f286fb344b Include options.h in table.h (#6823)
Summary:
https://github.com/facebook/rocksdb/issues/6389 replaced the #include of options.h in table.h with forward declarations, which is causing some build failures in RocksDB users in 6.10. Remove the forward declarations and #include options.h as recommended by the style guide - https://google.github.io/styleguide/cppguide.html#Forward_Declarations
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6823

Reviewed By: riversand963

Differential Revision: D21464078

Pulled By: anand1976

fbshipit-source-id: 6033ee2544d279690f57bb0db91bc83816cee11d
2020-05-07 15:55:29 -07:00
Andrew Kryczka
e9ba4ba348 validate range tombstone covers positive range (#6788)
Summary:
We found some files containing nothing but negative range tombstones,
and unsurprisingly their metadata specified a negative range, which made
things crash. Time to add a bit of user input validation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6788

Reviewed By: zhichao-cao

Differential Revision: D21343719

Pulled By: ajkr

fbshipit-source-id: f1c16e4c3e9fa150958c8c866176632a3206fb74
2020-05-07 11:55:30 -07:00
anand76
c1e1185b7a Update release version to 6.10 (#6797)
Summary:
Update HISTORY.md and version.h to 6.10.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6797

Reviewed By: zhichao-cao

Differential Revision: D21371390

Pulled By: anand1976

fbshipit-source-id: 6017bca24fc5d12076d1ddaec7783c9b85712d42
2020-05-06 16:42:37 -07:00
Cheng Chang
ef0c3eda27 Make users explicitly be aware of prepare before commit (#6775)
Summary:
In current commit protocol of pessimistic transaction, if the transaction is not prepared before commit, the commit protocol implicitly assumes that the user wants to commit without prepare.

This PR adds TransactionOptions::skip_prepare, the default value is `true` because if set to `false`, all existing users who commit without prepare need to update their code to set skip_prepare to true. Although this does not force the user to explicitly express their intention of skip_prepare, it at least lets the user be aware of the assumption of being able to commit without prepare.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6775

Test Plan: added a new unit test TransactionTest::CommitWithoutPrepare

Reviewed By: lth

Differential Revision: D21313270

Pulled By: cheng-chang

fbshipit-source-id: 3d95b7c9b2d6cdddc09bdd66c561bc4fae8c3251
2020-04-30 16:24:20 -07:00
sdong
6277e28039 Flag CompressionOptions::parallel_threads to be experimental (#6781)
Summary:
The feature of CompressionOptions::parallel_threads is still not yet mature. Mention it to be experimental in the comments for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6781

Reviewed By: pdillinger

Differential Revision: D21330678

fbshipit-source-id: d7dd7d099fb002a5c6a5d8da689ce5ee08a9eb13
2020-04-30 15:22:06 -07:00
anand76
ab13d43e1d Pass a timeout to FileSystem for random reads (#6751)
Summary:
Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.

For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.

Tests:
Update existing unit tests to verify the correct timeout value is being passed
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751

Reviewed By: riversand963

Differential Revision: D21285631

Pulled By: anand1976

fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
2020-04-30 14:50:39 -07:00
mrambacher
618bf638aa Add Functions to OptionTypeInfo (#6422)
Summary:
Added functions for parsing, serializing, and comparing elements to OptionTypeInfo.  These functions allow all of the special cases that could not be handled directly in the map of OptionTypeInfo to be moved into the map.  Using these functions, every type can be handled via the map rather than special cased.

By adding these functions, the code for handling options can become more standardized (fewer special cases) and (eventually) handled completely by common classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6422

Test Plan: pass make check

Reviewed By: siying

Differential Revision: D21269005

Pulled By: zhichao-cao

fbshipit-source-id: 9ba71c721a38ebf9ee88259d60bd81b3282b9077
2020-04-28 18:04:26 -07:00
Peter Dillinger
b810e62b39 Clarifying comments in db.h (#6768)
Summary:
And fix a confusingly worded log message
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6768

Reviewed By: anand1976

Differential Revision: D21284527

Pulled By: pdillinger

fbshipit-source-id: f03c1422c229a901c3a65e524740452349626164
2020-04-28 15:26:03 -07:00
Albert Hse-Lin Chen
cc8d16efd6 Fixed minor typo in comment for MergeOperator::FullMergeV2() (#6759)
Summary:
Fixed minor typo in comment for FullMergeV2().
Last operand up to snapshot should be +4 instead of +3.

Signed-off-by: Albert Hse-Lin Chen <hselin@kalista.io>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6759

Reviewed By: cheng-chang

Differential Revision: D21260295

Pulled By: zhichao-cao

fbshipit-source-id: cc942306f246c8606538feb30bfdf6df9fb6c54e
2020-04-27 14:44:47 -07:00
Peter Dillinger
249eff0f30 Stats for redundant insertions into block cache (#6681)
Summary:
Since read threads do not coordinate on loading data into block
cache, two threads between Lookup and Insert can end up loading and
inserting the same data. This is particularly concerning with
cache_index_and_filter_blocks since those are hot and more likely to
be race targets if ejected from (or not pre-populated in) the cache.

Particularly with moves toward disaggregated / network storage, the cost
of redundant retrieval might be high, and we should at least have some
hard statistics from which we can estimate impact.

Example with full filter thrashing "cliff":

    $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10
    ...
    $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
    rocksdb.block.cache.add COUNT : 14181
    rocksdb.block.cache.add.failures COUNT : 0
    rocksdb.block.cache.add.redundant COUNT : 476
    rocksdb.block.cache.data.add COUNT : 12749
    rocksdb.block.cache.data.add.redundant COUNT : 18
    rocksdb.block.cache.filter.add COUNT : 1003
    rocksdb.block.cache.filter.add.redundant COUNT : 217
    rocksdb.block.cache.index.add COUNT : 429
    rocksdb.block.cache.index.add.redundant COUNT : 241
    $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
    rocksdb.block.cache.add COUNT : 1182223
    rocksdb.block.cache.add.failures COUNT : 0
    rocksdb.block.cache.add.redundant COUNT : 302728
    rocksdb.block.cache.data.add COUNT : 31425
    rocksdb.block.cache.data.add.redundant COUNT : 12
    rocksdb.block.cache.filter.add COUNT : 795455
    rocksdb.block.cache.filter.add.redundant COUNT : 130238
    rocksdb.block.cache.index.add COUNT : 355343
    rocksdb.block.cache.index.add.redundant COUNT : 172478
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681

Test Plan: Some manual testing (above) and unit test covering key metrics is included

Reviewed By: ltamasi

Differential Revision: D21134113

Pulled By: pdillinger

fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87
2020-04-27 13:20:27 -07:00
mrambacher
4cbc19d2a1 Add a ConfigOptions for use in comparing objects and converting to/from strings (#6389)
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings.  There is a mishmash of parameters in use here with more needed in the future.  This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389

Reviewed By: siying

Differential Revision: D21163707

Pulled By: zhichao-cao

fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
2020-04-21 17:38:17 -07:00
anand76
c1ccd6b6af Implement deadline support for MultiGet (#6710)
Summary:
Initial implementation of ReadOptions.deadline for MultiGet. If the request takes longer than the deadline, the keys not yet found will be returned with Status::TimedOut(). This
implementation enforces the deadline in DBImpl, which is fairly high
level. Its best effort and may not check the deadline after every key
lookup, but may do so after a batch of keys.

In subsequent stages, we will extend this to passing a timeout down to the FileSystem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6710

Test Plan: Add new unit tests

Reviewed By: riversand963

Differential Revision: D21149158

Pulled By: anand1976

fbshipit-source-id: 9f44eecffeb40873f5034ed59a66d21f9f88879e
2020-04-21 14:51:51 -07:00
Peter Dillinger
31da5e34c1 C++20 compatibility (#6697)
Summary:
Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended:

* Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists
* Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition
* std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API
* Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line
* Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20)
* Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor
* Added GCC 9 C++11 & GCC9 C++20 in Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697

Test Plan: make check and CI

Reviewed By: cheng-chang

Differential Revision: D21020318

Pulled By: pdillinger

fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91
2020-04-20 13:24:25 -07:00
Yanqin Jin
243852ec15 Add IsDirectory() to Env and FS (#6711)
Summary:
IsDirectory() is a common API to check whether a path is a regular file or
directory.
POSIX: call stat() and use S_ISDIR(st_mode)
Windows: PathIsDirectoryA() and PathIsDirectoryW()
HDFS: FileSystem.IsDirectory()
Java: File.IsDirectory()
...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6711

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D21053520

Pulled By: riversand963

fbshipit-source-id: 680aadfd8ce982b63689190cf31b3145d5a89e27
2020-04-17 14:39:18 -07:00
Mike Kolupaev
e45673dece Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.

Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.

It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.

Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621

Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.

Reviewed By: siying

Differential Revision: D20786930

Pulled By: al13n321

fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-15 17:40:44 -07:00
anand76
3d6d7bcf17 Log CompactOnDeletionCollectorFactory parameters on DB open (#6686)
Summary:
Log it in the info log to help in troubleshooting. It is logged as follows -
```
2020/04/10-10:51:39.886662 7ffff7fef340                   Options.table_properties_collectors: CompactOnDeletionCollector (Sliding window size = 100 Deletion trigger = 90);
```

Tests:
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6686

Reviewed By: ltamasi

Differential Revision: D21002442

Pulled By: anand1976

fbshipit-source-id: 7adf0dbae7f1febcb00ce61fea5097118ede5c6a
2020-04-13 19:58:04 -07:00
Zhichao Cao
38dfa406ff Add NewFileChecksumGenCrc32cFactory to file checksum (#6688)
Summary:
Add NewFileChecksumGenCrc32cFactory to file checksum public interface such that applications can use the build in crc32 checksum factory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6688

Test Plan: pass make asan_check

Reviewed By: riversand963

Differential Revision: D21006859

Pulled By: zhichao-cao

fbshipit-source-id: ea8a45196a8b77c310728ab05f6cc0f49f3baef0
2020-04-13 19:13:41 -07:00
Andrew Kryczka
f08630b914 explicitly mark backup interfaces non-extensible (#6654)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6654

Reviewed By: cheng-chang

Differential Revision: D20878094

Pulled By: ajkr

fbshipit-source-id: 94d2561bdb6ffb7fe3773ca07d475337600a5b57
2020-04-10 10:51:09 -07:00
Huisheng Liu
9e89ffb776 make iterator return versions between timestamp bounds (#6544)
Summary:
(Based on Yanqin's idea) Add a new field in readoptions as lower timestamp bound for iterator. When the parameter is not supplied (nullptr), the iterator returns the latest visible version of a record. When it is supplied, the existing timestamp field is the upper bound. Together the two serves as a bounded time window. The iterator returns all versions of a record falling in the window.

SeekRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit e860f8840):
seekrandom   : 7.836 micros/op 4082449 ops/sec; (0 of 73481999 found)
This PR:
seekrandom   : 7.764 micros/op 4120935 ops/sec; (0 of 71303999 found)

db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=seekrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6544

Reviewed By: ltamasi

Differential Revision: D20844069

Pulled By: riversand963

fbshipit-source-id: d97f2bf38a323c8c6a68db213b2d3c694b1c1f74
2020-04-10 09:51:58 -07:00
Kirill Abrosimov
3ff603171d added new functions to c-api (#5630)
Summary:
Few functions from options added to C-api
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5630

Reviewed By: anand1976

Differential Revision: D20896731

Pulled By: ltamasi

fbshipit-source-id: e4215a58b3c2429ec44e3f0d0381cbf86700fb14
2020-04-07 14:45:39 -07:00
Yi Wu
2b02ea25e2 Add counter in perf_context to time cipher time (#6596)
Summary:
Add `encrypt_data_time` and `decrypt_data_time` perf_context counters to time encryption/decryption time when `EnvEncryption` is enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6596

Test Plan: CI

Reviewed By: anand1976

Differential Revision: D20678617

fbshipit-source-id: 7b57536143aa38509cde011f704de33382169e07
2020-04-01 16:59:35 -07:00
Ziyue Yang
03a781a90c Add pipelined & parallel compression optimization (#6262)
Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262

Reviewed By: ajkr

Differential Revision: D20651306

fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
2020-04-01 16:40:18 -07:00
Levi Tamasi
e6f86cfb36 Revert the recent cache deleter change (#6620)
Summary:
Revert "Use function objects as deleters in the block cache (https://github.com/facebook/rocksdb/issues/6545)"

    This reverts commit 6301dbe7a7.

    Revert "Call out the cache deleter related interface change in HISTORY.md (https://github.com/facebook/rocksdb/issues/6606)"

    This reverts commit 3a35542f86.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6620

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D20773311

Pulled By: ltamasi

fbshipit-source-id: 7637a761f718f323ef0e7da959462e8fb06e7a2b
2020-03-31 16:11:06 -07:00
Yanqin Jin
18cf0de640 Use flush time for the props.creation_time for FIFO compaction (#6612)
Summary:
For FIFO compaction, we use flush time instead of oldest key time as the
creation time. This is to prevent FIFO compaction dropping files whose oldest
key time is older than TTL but which has newer keys than TTL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6612

Test Plan: make check

Reviewed By: siying

Differential Revision: D20748217

Pulled By: riversand963

fbshipit-source-id: 3f7b00a847020760537cdddd12f6fe039e5bc663
2020-03-30 18:59:17 -07:00
Zhichao Cao
eaf95c7d1a Update release version to 6.9.0 (#6610)
Summary:
Update release version to 6.9.0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6610

Test Plan: no code change

Reviewed By: riversand963

Differential Revision: D20741094

Pulled By: zhichao-cao

fbshipit-source-id: 80a9e9ea8d164b6923112352d36fcbc1be85c034
2020-03-30 16:31:02 -07:00
Zhichao Cao
e8d332d97e Use FileChecksumGenFactory for SST file checksum (#6600)
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600

Test Plan: tested with make asan_check

Reviewed By: riversand963

Differential Revision: D20717670

Pulled By: zhichao-cao

fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
2020-03-29 15:58:46 -07:00
Cheng Chang
ee50b8d499 Be able to decrease background thread's CPU priority when creating database backup (#6602)
Summary:
When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries.

This PR makes it possible to decrease CPU priority of these threads when creating a new backup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602

Test Plan: make check

Reviewed By: siying, zhichao-cao

Differential Revision: D20683216

Pulled By: cheng-chang

fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84
2020-03-28 19:07:25 -07:00
Zhichao Cao
4246888101 Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 16:04:43 -07:00
Levi Tamasi
6301dbe7a7 Use function objects as deleters in the block cache (#6545)
Summary:
As the first step of reintroducing eviction statistics for the block
cache, the patch switches from using simple function pointers as deleters
to function objects implementing an interface. This will enable using
deleters that have state, like a smart pointer to the statistics object
that is to be updated when an entry is removed from the cache. For now,
the patch adds a deleter template class `SimpleDeleter`, which simply
casts the `value` pointer to its original type and calls `delete` or
`delete[]` on it as appropriate. Note: to prevent object lifecycle
issues, deleters must outlive the cache entries referring to them;
`SimpleDeleter` ensures this by using the ("leaky") Meyers singleton
pattern.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6545

Test Plan: `make asan_check`

Reviewed By: siying

Differential Revision: D20475823

Pulled By: ltamasi

fbshipit-source-id: fe354c33dd96d9bafc094605462352305449a22a
2020-03-26 16:19:58 -07:00
Peter Dillinger
93b80ca7ba Update default BBTO::format_version from 2 to 4 (#6582)
Summary:
Version 4 has been around long enough, for compatibility and
extensive validation, that it should be default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6582

Test Plan:
CI (w.r.t. changing the default; format_version=4 is well
tested and massively in production at Facebook)

Reviewed By: siying

Differential Revision: D20625233

Pulled By: pdillinger

fbshipit-source-id: 2f83ed874cffa4a39bc7a66cdf3833b978fbb948
2020-03-24 21:22:21 -07:00
Huisheng Liu
a6ce5c823b multiget support for timestamps (#6483)
Summary:
Add timestamp support for MultiGet().
timestamp from readoptions is honored, and timestamps can be returned along with values.

MultiReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit 17bef7d3a):
  multireadrandom :     104.173 micros/op 307167 ops/sec; (5462999 of 5462999 found)
This PR:
  multireadrandom :     104.199 micros/op 307095 ops/sec; (5307999 of 5307999 found)

.\db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=multireadrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6483

Reviewed By: anand1976

Differential Revision: D20498373

Pulled By: riversand963

fbshipit-source-id: 8505f22bc40fd791bc7dd05e48d7e67c91edb627
2020-03-24 11:24:09 -07:00
anand76
a9d168cfd7 Simplify migration to FileSystem API (#6552)
Summary:
The current Env/FileSystem API separation has a couple of issues -
1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation.
2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes.

This PR solves the above issues and simplifies the migration in the following ways -
1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```.
1a. - This also makes it more robust by ensuring that even if RocksDB
  has some stray calls to Env APIs rather than FileSystem, they will go
  through the same object and thus there is no risk of getting out of
  sync.
2. Provide a ```NewCompositeEnv()``` API that can be used to construct a
PosixEnv with a custom FileSystem implementation. This eliminates an
indirection to call Env APIs, and relieves the FileSystem developer of
the burden of having to implement wrappers for the Env APIs.
3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and
```NewLogger()```

Tests:
1. New unit tests
2. make check and make asan_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552

Reviewed By: riversand963

Differential Revision: D20592038

Pulled By: anand1976

fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7
2020-03-23 21:54:21 -07:00
Yanqin Jin
fb09ef05dc Attempt to recover from db with missing table files (#6334)
Summary:
There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status.
This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version.
`DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed.
To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush.

Test plan (on devserver):
```
$make check
$COMPILE_WITH_ASAN=1 make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334

Reviewed By: anand1976

Differential Revision: D19778960

Pulled By: riversand963

fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8
2020-03-20 19:30:48 -07:00
Cheng Chang
2d9efc9ab2 Cache result of GetLogicalBufferSize in Linux (#6457)
Summary:
In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down.

This PR introduces two new APIs:
1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes.
2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes.

Other modifications:
1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms.
2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`.
3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457

Test Plan:
1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`.
2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`.

`make check`

Differential Revision: D20131243

Pulled By: cheng-chang

fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04
2020-03-11 18:40:05 -07:00
Chao Zhao
4028eba67b Optional sequence number exporting during checkpoint creation (#5528)
Summary:
Add sequence_number_ptr to the checkpoint interface to expose the sequence number during taking the checkpoint. The number will be consistent with the seq # in rocksdb log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5528

Test Plan: make check -j64

Reviewed By: Winger1994

Differential Revision: D16080209

fbshipit-source-id: 6dc3c7680287ee97d673c5e61f89aae1f43e33df
2020-03-10 13:40:18 -07:00
Yanqin Jin
fd1da22111 Support options.max_open_files != -1 with FIFO compaction (#6503)
Summary:
Allow user to specify options.max_open_files != -1 with FIFO compaction.
If max_open_files != -1, not all table files are kept open.
In the past, FIFO style compaction requires all table files to be open in order
to read file creation time from table properties. Later, we added file creation
time to MANIFEST, making it possible to read file creation time without opening
file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6503

Test Plan: make check

Differential Revision: D20353758

Pulled By: riversand963

fbshipit-source-id: ba5c61a648419e47e9ef6d74e0e280e3ee24f296
2020-03-09 18:45:06 -07:00
Yanqin Jin
d93812c9ae Iterator with timestamp (#6255)
Summary:
Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.

Create an iterator with timestamp.
```
...
read_opts.timestamp = &ts;
auto* iter = db->NewIterator(read_opts);
// target is key without timestamp.
for (iter->Seek(target); iter->Valid(); iter->Next()) {}
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
delete iter;
read_opts.timestamp = &ts1;
// lower_bound and upper_bound are without timestamp.
read_opts.iterate_lower_bound = &lower_bound;
read_opts.iterate_upper_bound = &upper_bound;
auto* iter1 = db->NewIterator(read_opts);
// Do Seek or SeekToFirst()
delete iter1;
```

Test plan (dev server)
```
$make check
```

Simple benchmarking (dev server)
1. The overhead introduced by this PR even when timestamp is disabled.
key size: 16 bytes
value size: 100 bytes
Entries: 1000000
Data reside in main memory, and try to stress iterator.
Repeated three times on master and this PR.
- Seek without next
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
```
master: 159047.0 ops/sec
this PR: 158922.3 ops/sec (2% drop in throughput)
- Seek and next 10 times
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
```
master: 109539.3 ops/sec
this PR: 107519.7 ops/sec (2% drop in throughput)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255

Differential Revision: D19438227

Pulled By: riversand963

fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
2020-03-06 16:24:27 -08:00
Otto Kekäläinen
f6c2777d95 Fix spelling: commited -> committed (#6481)
Summary:
In most places in the code the variable names are spelled correctly as
COMMITTED but in a couple places not. This fixes them and ensures the
variable is always called COMMITTED everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6481

Differential Revision: D20306776

Pulled By: pdillinger

fbshipit-source-id: b6c1bfe41db559b4bc6955c530934460c07f7022
2020-03-06 12:45:20 -08:00
Zhichao Cao
e62fe50634 Introduce FaultInjectionTestFS to test fault File system instead of Env (#6414)
Summary:
In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller.

A set of ErrorHandlerFSTests are introduced for testing
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6414

Test Plan: pass make asan_check, pass error_handler_fs_test.

Differential Revision: D20252421

Pulled By: zhichao-cao

fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21
2020-03-04 12:35:05 -08:00
Huisheng Liu
904a60ff63 return timestamp from get (#6409)
Summary:
Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included.

ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
    base line (commit 72ee067b9):
        101.712 micros/op 314602 ops/sec;   36.0 MB/s (5658999 of 5658999 found)
    This PR:
        100.288 micros/op 319071 ops/sec;   36.5 MB/s (5674999 of 5674999 found)

./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409

Differential Revision: D20200086

Pulled By: riversand963

fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50
2020-03-02 16:01:00 -08:00
Cheng Chang
b47a714051 Update release version to 6.8 (#6450)
Summary:
Update release version to 6.8
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6450

Test Plan: no code change

Differential Revision: D20071889

Pulled By: cheng-chang

fbshipit-source-id: 91450aae09b201926469ff32f59ed436366f3b74
2020-02-24 11:44:47 -08:00