Commit Graph

1387 Commits

Author SHA1 Message Date
mrambacher
d057e8326d Make MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481)
Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
 - Added Options/Configurable/Object Registration for TTL and Cassandra variants
 - Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char.  Made the delimiter into a configurable option
 - Added tests for new functionality

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481

Reviewed By: zhichao-cao

Differential Revision: D30136050

Pulled By: mrambacher

fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af
2021-08-06 08:27:25 -07:00
anand76
c268859aaa Remove corruption error injection in FaultInjectionTestFS (#8616)
Summary:
```FaultInjectionTestFS``` injects various types of read errors in ```FileSystem``` APIs. One type of error is corruption errors, where data is intentionally corrupted or truncated. There is corresponding validation in db_stress to verify that an injected error results in a user visible Get/MultiGet error. However, for corruption errors, its hard to know when a corruption is supposed to be detected by the user request, due to prefetching and, in case of direct IO, padding. This results in false positives. So remove that functionality.

Block checksum validation for Get/MultiGet is confined to ```BlockFetcher```, so we don't lose a lot by disabling this since its a small surface area to test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8616

Reviewed By: zhichao-cao

Differential Revision: D30074422

Pulled By: anand1976

fbshipit-source-id: 6a61fac18f95514c15364b75013799ddf83294df
2021-08-04 15:48:54 -07:00
Merlin Mao
4811115b3e Revert checkpoint fix (#8607)
Summary:
PR https://github.com/facebook/rocksdb/pull/8572 looses custom types in the options file. Need more API changes to fix this issue. Revert this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8607

Reviewed By: ajkr

Differential Revision: D30058289

Pulled By: autopear

fbshipit-source-id: 78f5a154c0bf193e8441bae4a36fa79b95277fd4
2021-08-02 18:29:35 -07:00
Mikhail Golubev
8f52972cf9 Allow to use a string as a delimiter in StringAppendOperator (#8536)
Summary:
An arbitrary string can be used as a delimiter in StringAppend merge operator
flavor. In particular, it allows using an empty string, combining binary values for
the same key byte-to-byte one next to another.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8536

Reviewed By: mrambacher

Differential Revision: D29962120

Pulled By: zhichao-cao

fbshipit-source-id: 4ef5d846a47835cf428a11200409e30e2dbffc4f
2021-08-02 16:50:41 -07:00
mrambacher
ab7f7c9e49 Allow WAL dir to change with db dir (#8582)
Summary:
Prior to this change, the "wal_dir"  DBOption would always be set (defaults to dbname) when the DBOptions were sanitized.  Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file.

After this change, the "wal_dir" option is only set under specific circumstances.  Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname.  Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path).

Tests were added to the core and ldb to test that a database could be created and renamed without issue.  Additional tests for various permutations of wal_dir were also added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582

Reviewed By: pdillinger, autopear

Differential Revision: D29881122

Pulled By: mrambacher

fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb
2021-07-30 12:16:44 -07:00
Yanqin Jin
066b51126d Several simple local code clean-ups (#8565)
Summary:
This PR tries to remove some unnecessary checks as well as unreachable code blocks to
improve readability. An obvious non-public API method naming typo is also corrected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565

Test Plan: make check

Reviewed By: lth

Differential Revision: D29963984

Pulled By: riversand963

fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a
2021-07-30 12:07:49 -07:00
mrambacher
3aee4fbd41 Make EventListener into a Customizable Class (#8473)
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "".  If there is no name, the listener cannot be loaded from the ObjectRegistry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473

Reviewed By: zhichao-cao

Differential Revision: D29901488

Pulled By: mrambacher

fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
2021-07-27 07:47:02 -07:00
Merlin Mao
55f7ded80d Checkpoint dir options fix (#8572)
Summary:
Originally the 2 options `db_log_dir` and `wal_dir` will be reused in a snapshot db since the options files are just copied. By default, if `wal_dir` was not set when a db was created, it is set to the db's dir. Therefore, the snapshot db will use the same WAL dir. If both the original db and the snapshot db write to or delete from the WAL dir, one may modify or delete files which belong to the other. The same applies to `db_log_dir` as well, but as info log files are not copied or linked, it is simpler for this option.

2 arguments are added to `Checkpoint::CreateCheckpoint()`, allowing to override these 2 options.

`wal_dir`:  If the function argument `wal_dir` is empty, or set to the original db location, or the checkpoint location, the snapshot's `wal_dir` option will be updated to the checkpoint location. Otherwise, the absolute path specified in the argument will be used. During checkpointing, live WAL files will be copied or linked the new location, instead of the current WAL dir specified in the original db.

`db_log_dir`: Same as `wal_dir`, but no files will be copied or linked.

A new unit test was added: `CheckpointTest.CheckpointWithOptionsDirsTest`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8572

Test Plan:
New unit test
```
checkpoint_test --gtest_filter="CheckpointTest.CheckpointWithOptionsDirsTest"
```

Output
```
Note: Google Test filter = CheckpointTest.CheckpointWithOptionsDirsTest
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CheckpointTest
[ RUN      ] CheckpointTest.CheckpointWithOptionsDirsTest
[       OK ] CheckpointTest.CheckpointWithOptionsDirsTest (11712 ms)
[----------] 1 test from CheckpointTest (11712 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (11713 ms total)
[  PASSED  ] 1 test.
```
This test will fail without this patch. Just modify the code to remove the 2 arguments introduced in this patch in `CreateCheckpoint()`.

Reviewed By: zhichao-cao

Differential Revision: D29832761

Pulled By: autopear

fbshipit-source-id: e6a639b4d674380df82998c0839e79cab695fe29
2021-07-23 11:13:01 -07:00
Drewryz
3b27725245 Fix a minor issue with initializing the test path (#8555)
Summary:
The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555

Reviewed By: ajkr

Differential Revision: D29758399

Pulled By: jay-zhuang

fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011
2021-07-23 08:38:45 -07:00
Jay Zhuang
c4a503f3df Fix an race condition during multiple DB opening (#8574)
Summary:
ObjectLibrary is shared between multiple DB instances, the
Register() could have race condition.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8574

Test Plan: pass the failed test

Reviewed By: ajkr

Differential Revision: D29855096

Pulled By: jay-zhuang

fbshipit-source-id: 541eed0bd495d2c963d858d81e7eabf1ba16153c
2021-07-22 13:43:06 -07:00
Zhichao Cao
87e82a41a9 Fix incorrect Status::NoSpace() status check (#8504)
Summary:
If we want to check whether a Status s is NoSpace() or not, we should check the subcode instread of using s==Status::NoSpace(). Fix some of the incorrect check in the ErrorHandler.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8504

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D29601764

Pulled By: zhichao-cao

fbshipit-source-id: cdab56a827891c23746bba9cbb53f169fe35f086
2021-07-20 18:09:51 -07:00
sdong
39a07c9651 DB Stress Reopen write failure to skip WAL (#8548)
Summary:
When DB Stress enables write failure in reopen, WAL files are also created with a wrapper writalbe file which buffers write until fsync. However, crash test currently expects all writes to WAL is persistent. This is at odd with the unsynced bytes dropped. To work it around temporarily, we disable WAL write failure for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8548

Test Plan: Run db_stress. Manual printf to make sure only WAL files are skipped.

Reviewed By: jay-zhuang

Differential Revision: D29745095

fbshipit-source-id: 1879dd2c01abad7879ca243ee94570ec47c347f3
2021-07-16 16:09:33 -07:00
Peter Dillinger
df5dc73bec Don't hold DB mutex for block cache entry stat scans (#8538)
Summary:
I previously didn't notice the DB mutex was being held during
block cache entry stat scans, probably because I primarily checked for
read performance regressions, because they require the block cache and
are traditionally latency-sensitive.

This change does some refactoring to avoid holding DB mutex and to
avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
Some tests have to be updated because now the stats collector is
populated in the Cache aggressively on DB startup rather than lazily.
(I hope to clean up some of this added complexity in the future.)

This change also ensures proper treatment of need_out_of_mutex for
non-int DB properties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538

Test Plan:
Added unit test logic that uses sync points to fail if the DB mutex
is held during a scan, covering the various ways that a scan might be
triggered.

Performance test - the known impact to holding the DB mutex is on
TransactionDB, and the easiest way to see the impact is to hack the
scan code to almost always miss and take an artificially long time
scanning. Here I've injected an unconditional 5s sleep at the call to
ApplyToAllEntries.

Before (hacked):

    $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     433.219 micros/op 2308 ops/sec;    0.1 MB/s ( transactions:78999 aborts:0)
    rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
    $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     448.802 micros/op 2228 ops/sec;    0.1 MB/s ( transactions:75999 aborts:0)
    rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323

Notice the 5s P100 write time.

After (hacked):

    $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     303.645 micros/op 3293 ops/sec;    0.1 MB/s ( transactions:98999 aborts:0)
    rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
    $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     310.383 micros/op 3221 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
    rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918

P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:

    $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     311.365 micros/op 3211 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
    rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
    $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     308.395 micros/op 3242 ops/sec;    0.1 MB/s ( transactions:97999 aborts:0)
    rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832

No substantial difference.

Reviewed By: siying

Differential Revision: D29738847

Pulled By: pdillinger

fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 14:13:08 -07:00
longlijian
4e4ec16957 Replace the namespace "rocksdb" to "ROCKSDB_NAMESPACE" (#8531)
Summary:
For more detail can reference the https://github.com/facebook/rocksdb/issues/6433
(https://github.com/facebook/rocksdb/pull/6433)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8531

Reviewed By: siying

Differential Revision: D29717057

Pulled By: ajkr

fbshipit-source-id: 3ccad9501e5612590e54a7cf8c447118f323c7f4
2021-07-15 17:23:39 -07:00
Myth
bbdc4f2e9a Fix a minor issue in checkpoint test case (#8483)
Summary:
A very simple change :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8483

Reviewed By: ajkr

Differential Revision: D29558904

Pulled By: mrambacher

fbshipit-source-id: bbe68c20c861103726cb6231ca3fb8fbe1e5a546
2021-07-12 09:09:09 -07:00
sdong
b1a53db327 FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync() to recover… (#8501)
Summary:
… small overwritten files.
If a file is overwritten with renamed and the parent path is not synced, FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync() will delete the file. However, RocksDB relies on file renaming to be atomic no matter whether the parent directory is synced or not, and the current behavior breaks the assumption and caused some false positive: https://github.com/facebook/rocksdb/pull/8489

Since the atomic renaming is used in CURRENT files, to fix the problem, in FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync(), we recover the state of overwritten file if the file is small.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8501

Test Plan: Run stress test for a while and see it doesn't break.

Reviewed By: anand1976

Differential Revision: D29594384

fbshipit-source-id: 589b5c2f0a9d2aca53752d7bdb0231efa5b3ae92
2021-07-07 16:23:23 -07:00
Andrew Kryczka
ed8eb436db Move slow valgrind tests behind -DROCKSDB_FULL_VALGRIND_RUN (#8475)
Summary:
Various tests had disabled valgrind due to it slowing down and timing
out (as is the case right now) the CI runs. Where a test was disabled with no comment,
I assumed slowness was the cause. For these tests that were slow under
valgrind, as well as the ones identified in https://github.com/facebook/rocksdb/issues/8352, this PR moves them
behind the compiler flag `-DROCKSDB_FULL_VALGRIND_RUN`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8475

Test Plan: running `make full_valgrind_test`, `make valgrind_test`, `make check`; will verify they appear working correctly

Reviewed By: jay-zhuang

Differential Revision: D29504843

Pulled By: ajkr

fbshipit-source-id: 2aac90749cfbd30d5ce11cb29a07a1b9314eeea7
2021-07-07 11:14:05 -07:00
anand76
df4197ca6e Bypass buffer in TestFSWritableFile if direct IO is enabled (#8490)
Summary:
```TestFSWritableFile``` buffers data in ```Append``` in order to simulate unsynced data loss on crash. This is only required for buffered IO and should be disabled for direct IO. Otherwise, it causes crash tests to assert on the buffer address alignment - ```db_stress: env/io_posix.cc:1194: virtual rocksdb::IOStatus rocksdb::PosixWritableFile::Append(const rocksdb::Slice&, const rocksdb::IOOptions&, rocksdb::IODebugContext*): Assertion `IsSectorAligned(data.data(), GetRequiredBufferAlignment())' failed.```.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8490

Reviewed By: zhichao-cao

Differential Revision: D29565080

Pulled By: anand1976

fbshipit-source-id: 682831fd66ed3b9597caa74fc453e22dfaf9b973
2021-07-06 16:46:16 -07:00
sdong
f33611d5e9 Stress test to inject read failures in DB reopen (#8476)
Summary:
Inject read failures in DB reopen, just as what we do for metadata writes and writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8476

Test Plan: Some manual tests and make sure failures are triggered.

Reviewed By: anand1976

Differential Revision: D29507283

fbshipit-source-id: d04da0163973447041038bd87701686a417c4e0c
2021-07-06 11:05:27 -07:00
sdong
ba224b75c7 Stress Test to inject write failures in reopen (#8474)
Summary:
Previously Stress can inject metadata write failures when reopening a DB. We extend it to file append too, in the same way.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8474

Test Plan: manually run crash test with various setting and make sure the failures are triggered as expected.

Reviewed By: zhichao-cao

Differential Revision: D29503116

fbshipit-source-id: e73a446e80ccbd09301a579280e56ff949381fab
2021-06-30 16:46:41 -07:00
Zhichao Cao
a904c62d28 Using existing crc32c checksum in checksum handoff for Manifest and WAL (#8412)
Summary:
In PR https://github.com/facebook/rocksdb/issues/7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead.

In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine https://github.com/facebook/rocksdb/issues/8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8412

Test Plan: make check, add new testing cases.

Reviewed By: anand1976

Differential Revision: D29151545

Pulled By: zhichao-cao

fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772
2021-06-25 00:47:17 -07:00
Levi Tamasi
3ab0eae860 Deflake BlobDBTest.SnapshotAndGarbageCollection (#8444)
Summary:
This test case has been failing occasionally due to automatic
compactions kicking in, resulting in GC generating additional
blob files that the test did not expect. Disabling automatic
compactions to get rid of this flakiness.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8444

Test Plan: `gtest-parallel --repeat=1000 ./blob_db_test --gtest_filter="BlobDBTest.SnapshotAndGarbageCollection"`

Reviewed By: jay-zhuang

Differential Revision: D29316987

Pulled By: ltamasi

fbshipit-source-id: 9815d189ed7d63890622768675a01990e3680221
2021-06-22 17:34:03 -07:00
Jay Zhuang
f89423a57a Revert "Revert "Snapshot release triggered compaction without multiple tombstones (#8357)" (#8410)" (#8438)
Summary:
This reverts commit 25be1ed66a.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8438

Test Plan: Run the impacted mysql test 40 times

Reviewed By: ajkr

Differential Revision: D29286247

Pulled By: jay-zhuang

fbshipit-source-id: d3bd056971a19a8b012d5d0295fa045c012b3c04
2021-06-22 11:10:03 -07:00
聂佩轩
d53f7ff69a Add DeteleRange support for DBWithTTL (#8384)
Summary:
This commit is for enabling `DBWithTTL` to use `DeteleRange` which it cannot before.
As (int32_t)Timestamp is suffixed to values in `DBWithTTL`, there is no reason that it
cannot use the common used api. I added `DeleteRangeCF` in `DBWithTTLImpl::Write`
so that we can use `DeteleRange` normally. When we run code like
`dbWithTtl->DeleteRange(start, end)`, it executes`WriteBatchInternal::DeleteRange`
internally. Intended to fix https://github.com/facebook/rocksdb/issues/7218

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8384

Test Plan: added corresponded testing logic to existing unit test

Reviewed By: jay-zhuang

Differential Revision: D29176734

fbshipit-source-id: 6874ed979fc08e1d138149d03653e43a75f0e0e6
2021-06-17 16:00:50 -07:00
Andrew Kryczka
25be1ed66a Revert "Snapshot release triggered compaction without multiple tombstones (#8357)" (#8410)
Summary:
This reverts commit 9167ece586.

It was found to reliably trip a compaction picking conflict assertion in a MyRocks unit test. We don't understand why yet so reverting in the meantime.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8410

Test Plan: `make check -j48`

Reviewed By: jay-zhuang

Differential Revision: D29150300

Pulled By: ajkr

fbshipit-source-id: 2de8664f355d6da015e84e5fec2e3f90f49741c8
2021-06-15 18:15:15 -07:00
Zhichao Cao
f44e69c64a Use DbSessionId as cache key prefix when secondary cache is enabled (#8360)
Summary:
Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360

Test Plan: add the test to lru_cache_test

Reviewed By: pdillinger

Differential Revision: D29006215

Pulled By: zhichao-cao

fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814
2021-06-10 11:02:43 -07:00
Andrew Kryczka
9167ece586 Snapshot release triggered compaction without multiple tombstones (#8357)
Summary:
This is a duplicate of https://github.com/facebook/rocksdb/issues/4948 by mzhaom to fix tests after rebase.

This change is a follow-up to https://github.com/facebook/rocksdb/issues/4927, which made this possible by allowing tombstone dropping/seqnum zeroing optimizations on the last key in the compaction. Now the `largest_seqno != 0` condition suffices to prevent snapshot release triggered compaction from entering an infinite loop.

The issues caused by the extraneous condition `level_and_file.second->num_deletions > 1` are:

- files could have `largest_seqno > 0` forever making it impossible to tell they cannot contain any covering keys
- it doesn't trigger compaction when there are many overwritten keys. Some MyRocks use case actually doesn't use Delete but instead calls Put with empty value to "delete" keys, so we'd like to be able to trigger compaction in this case too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8357

Test Plan: - make check

Reviewed By: jay-zhuang

Differential Revision: D28855340

Pulled By: ajkr

fbshipit-source-id: a261b51eecafec492499e6d01e8e43112f801798
2021-06-04 00:21:40 -07:00
Peter Dillinger
311a544c2a Use deleters to label cache entries and collect stats (#8297)
Summary:
This change gathers and publishes statistics about the
kinds of items in block cache. This is especially important for
profiling relative usage of cache by index vs. filter vs. data blocks.
It works by iterating over the cache during periodic stats dump
(InternalStats, stats_dump_period_sec) or on demand when
DB::Get(Map)Property(kBlockCacheEntryStats), except that for
efficiency and sharing among column families, saved data from
the last scan is used when the data is not considered too old.

The new information can be seen in info LOG, for example:

    Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
    Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)

And also through DB::GetProperty and GetMapProperty (here using
ldb just for demonstration):

    $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
    rocksdb.block-cache-entry-stats.bytes.data-block: 0
    rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
    rocksdb.block-cache-entry-stats.bytes.filter-block: 0
    rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
    rocksdb.block-cache-entry-stats.bytes.index-block: 178992
    rocksdb.block-cache-entry-stats.bytes.misc: 0
    rocksdb.block-cache-entry-stats.bytes.other-block: 0
    rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
    rocksdb.block-cache-entry-stats.capacity: 8388608
    rocksdb.block-cache-entry-stats.count.data-block: 0
    rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
    rocksdb.block-cache-entry-stats.count.filter-block: 0
    rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
    rocksdb.block-cache-entry-stats.count.index-block: 215
    rocksdb.block-cache-entry-stats.count.misc: 1
    rocksdb.block-cache-entry-stats.count.other-block: 0
    rocksdb.block-cache-entry-stats.count.write-buffer: 0
    rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
    rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
    rocksdb.block-cache-entry-stats.percent.misc: 0.000000
    rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
    rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
    rocksdb.block-cache-entry-stats.secs_since_last_collection: 0

Solution detail - We need some way to flag what kind of blocks each
entry belongs to, preferably without changing the Cache API.
One of the complications is that Cache is a general interface that could
have other users that don't adhere to whichever convention we decide
on for keys and values. Or we would pay for an extra field in the Handle
that would only be used for this purpose.

This change uses a back-door approach, the deleter, to indicate the
"role" of a Cache entry (in addition to the value type, implicitly).
This has the added benefit of ensuring proper code origin whenever we
recognize a particular role for a cache entry; if the entry came from
some other part of the code, it will use an unrecognized deleter, which
we simply attribute to the "Misc" role.

An internal API makes for simple instantiation and automatic
registration of Cache deleters for a given value type and "role".

Another internal API, CacheEntryStatsCollector, solves the problem of
caching the results of a scan and sharing them, to ensure scans are
neither excessive nor redundant so as not to harm Cache performance.

Because code is added to BlocklikeTraits, it is pulled out of
block_based_table_reader.cc into its own file.

This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
(could still be added), and with actual stat gathering.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297

Test Plan: manual testing with db_bench, and a couple of basic unit tests

Reviewed By: ltamasi

Differential Revision: D28488721

Pulled By: pdillinger

fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 16:51:13 -07:00
mrambacher
6b0a22a4b0 Fix MultiGet with PinnableSlices and Merge for WBWI (#8299)
Summary:
The MultiGetFromBatchAndDB would fail if the PinnableSlice value being returned was pinned.  This could happen if the value was retrieved from the DB (not memtable) or potentially if the values were reused (and a previous iteration returned a slice that was pinned).

This change resets the pinnable value to clear it prior to attempting to use it, thereby eliminating the problem with the value already being pinned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8299

Reviewed By: jay-zhuang

Differential Revision: D28455426

Pulled By: mrambacher

fbshipit-source-id: a34d7d983ec9b6bb4c8a2b4892f72858d43e6972
2021-05-18 14:35:47 -07:00
anand76
feb06e83b2 Initial support for secondary cache in LRUCache (#8271)
Summary:
Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.

Tests:
Results from cache_bench and db_bench don't show any regression due to these changes.

cache_bench results before and after this change -
Command
```./cache_bench -ops_per_thread=10000000 -threads=1```
Before
```Complete in 40.688 s; QPS = 245774```
```Complete in 40.486 s; QPS = 246996```
```Complete in 42.019 s; QPS = 237989```
After
```Complete in 40.672 s; QPS = 245869```
```Complete in 44.622 s; QPS = 224107```
```Complete in 42.445 s; QPS = 235599```

db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 -
Commands
```./db_bench  --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true```

```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300```
Before
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      80.702 micros/op 198104 ops/sec;   54.4 MB/s (3708999 of 3708999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      87.124 micros/op 183625 ops/sec;   50.4 MB/s (3439999 of 3439999 found)
```
After
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      77.653 micros/op 206025 ops/sec;   56.6 MB/s (3866999 of 3866999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      84.962 micros/op 188299 ops/sec;   51.7 MB/s (3535999 of 3535999 found)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271

Reviewed By: zhichao-cao

Differential Revision: D28357511

Pulled By: anand1976

fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2
2021-05-13 22:58:40 -07:00
Andrew Kryczka
d76c46e6a0 Deflake TransactionStressTest.ExpiredTransactionDataRace1 (#8258)
Summary:
We saw the `Commit()` fail with "Operation expired" so apparently the
expiration time is too short. Increased the magnitude of the times in
this test to make flakiness less likely.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8258

Reviewed By: jay-zhuang

Differential Revision: D28177033

Pulled By: ajkr

fbshipit-source-id: 0357acee6cc14c104b6ccd39231a683a606ab130
2021-05-12 15:49:05 -07:00
Peter Dillinger
78a309bf86 New Cache API for gathering statistics (#8225)
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:

* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)

I have enhanced cache_bench to validate this approach:

* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.

To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225

Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.

The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.

Baseline typical output:

```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662

Operation latency (ns):
Count: 80000000 Average: 11223.9494  StdDev: 29.61
Min: 0  Median: 7759.3973  Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[       0,       1 ]       68   0.000%   0.000%
(    2900,    4400 ]       89   0.000%   0.000%
(    4400,    6600 ] 33630240  42.038%  42.038% ########
(    6600,    9900 ] 18129842  22.662%  64.700% #####
(    9900,   14000 ]  7877533   9.847%  74.547% ##
(   14000,   22000 ] 15193238  18.992%  93.539% ####
(   22000,   33000 ]  3037061   3.796%  97.335% #
(   33000,   50000 ]  1626316   2.033%  99.368%
(   50000,   75000 ]   421532   0.527%  99.895%
(   75000,  110000 ]    56910   0.071%  99.966%
(  110000,  170000 ]    16134   0.020%  99.986%
(  170000,  250000 ]     5166   0.006%  99.993%
(  250000,  380000 ]     3017   0.004%  99.996%
(  380000,  570000 ]     1337   0.002%  99.998%
(  570000,  860000 ]      805   0.001%  99.999%
(  860000, 1200000 ]      319   0.000% 100.000%
( 1200000, 1900000 ]      231   0.000% 100.000%
( 1900000, 2900000 ]      100   0.000% 100.000%
( 2900000, 4300000 ]       39   0.000% 100.000%
( 4300000, 6500000 ]       16   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
```

New, gather_stats=false. Median thread ops/sec of 5 runs:

```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458

Operation latency (ns):
Count: 80000000 Average: 11298.1027  StdDev: 42.18
Min: 0  Median: 7722.0822  Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[       0,       1 ]      109   0.000%   0.000%
(    2900,    4400 ]      793   0.001%   0.001%
(    4400,    6600 ] 34054563  42.568%  42.569% #########
(    6600,    9900 ] 17482646  21.853%  64.423% ####
(    9900,   14000 ]  7908180   9.885%  74.308% ##
(   14000,   22000 ] 15032072  18.790%  93.098% ####
(   22000,   33000 ]  3237834   4.047%  97.145% #
(   33000,   50000 ]  1736882   2.171%  99.316%
(   50000,   75000 ]   446851   0.559%  99.875%
(   75000,  110000 ]    68251   0.085%  99.960%
(  110000,  170000 ]    18592   0.023%  99.983%
(  170000,  250000 ]     7200   0.009%  99.992%
(  250000,  380000 ]     3334   0.004%  99.997%
(  380000,  570000 ]     1393   0.002%  99.998%
(  570000,  860000 ]      700   0.001%  99.999%
(  860000, 1200000 ]      293   0.000% 100.000%
( 1200000, 1900000 ]      196   0.000% 100.000%
( 1900000, 2900000 ]       69   0.000% 100.000%
( 2900000, 4300000 ]       32   0.000% 100.000%
( 4300000, 6500000 ]       10   0.000% 100.000%
```

New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:

```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551

Operation latency (ns):
Count: 80000000 Average: 11311.2629  StdDev: 45.28
Min: 0  Median: 7686.5458  Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[       0,       1 ]       71   0.000%   0.000%
(    2900,    4400 ]      291   0.000%   0.000%
(    4400,    6600 ] 34492060  43.115%  43.116% #########
(    6600,    9900 ] 16727328  20.909%  64.025% ####
(    9900,   14000 ]  7845828   9.807%  73.832% ##
(   14000,   22000 ] 15510654  19.388%  93.220% ####
(   22000,   33000 ]  3216533   4.021%  97.241% #
(   33000,   50000 ]  1680859   2.101%  99.342%
(   50000,   75000 ]   439059   0.549%  99.891%
(   75000,  110000 ]    60540   0.076%  99.967%
(  110000,  170000 ]    14649   0.018%  99.985%
(  170000,  250000 ]     5242   0.007%  99.991%
(  250000,  380000 ]     3260   0.004%  99.995%
(  380000,  570000 ]     1599   0.002%  99.997%
(  570000,  860000 ]     1043   0.001%  99.999%
(  860000, 1200000 ]      471   0.001%  99.999%
( 1200000, 1900000 ]      275   0.000% 100.000%
( 1900000, 2900000 ]      143   0.000% 100.000%
( 2900000, 4300000 ]       60   0.000% 100.000%
( 4300000, 6500000 ]       27   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
( 9800000, 14000000 ]        1   0.000% 100.000%

Gather stats latency (us):
Count: 46 Average: 980387.5870  StdDev: 60911.18
Min: 879155  Median: 1033777.7778  Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
(  860000, 1200000 ]       45  97.826%  97.826% ####################
( 1200000, 1900000 ]        1   2.174% 100.000%

Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```

Reviewed By: mrambacher

Differential Revision: D28295742

Pulled By: pdillinger

fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
2021-05-11 16:17:10 -07:00
mrambacher
9f2d255aed Add ObjectRegistry to ConfigOptions (#8166)
Summary:
This change enables a couple of things:
- Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
- The ObjectRegistry is created fewer times and can be re-used

The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166

Reviewed By: zhichao-cao

Differential Revision: D27657952

Pulled By: mrambacher

fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd
2021-05-11 06:47:22 -07:00
mrambacher
ff463742b5 Add Merge Operator support to WriteBatchWithIndex (#8135)
Summary:
The WBWI has two differing modes of operation dependent on the value
of the constructor parameter `overwrite_key`.
Currently, regardless of the parameter, neither mode performs as
expected when using Merge. This PR remedies this by correctly invoking
the appropriate Merge Operator before returning results from the WBWI.

Examples of issues that exist which are solved by this PR:

## Example 1 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `v2`, that is to say that the Merge behaves like a Put.

## Example 2 with o`verwrite_key=true`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

## Example 3 with `overwrite_key=false`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`

## Example 4 with `overwrite_key=true`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v1')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

## Example 5 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`

## Example 6 with `overwrite_key=true`
Currently, from an empty database, `('k1' -> 'v1')`, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8135

Reviewed By: pdillinger

Differential Revision: D27657938

Pulled By: mrambacher

fbshipit-source-id: 0fbda6bbc66bedeba96a84786d90141d776297df
2021-05-10 12:50:25 -07:00
Peter Dillinger
c26b75baa5 Deprecate obsolete "backupable db" from public APIs (#8274)
Summary:
An early design of BackupEngine used stackable DB, so I guess a
DB had to opt-in to being backupable. Unfortunately the naming of that
obsolete design still infects our public API and implementation.

This change fixes the public API, with a deprecated
backward-compatibility header. `BackupableDBOptions` is renamed to
`BackupEngineOptions` (copy-replace in the public header) and
backup_engine.h replaces backupable_db.h (present for backward
compatibility). The only other change in backupable_db.h ->
backup_engine.h is cleaning up headers.

Later changes will fix the internal implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8274

Test Plan:
The internal implementation of BackupEngine uses the name
BackupEngineOptions, while the unit tests use the old name
BackupableDBOptions. This gives me confidence that both still work.

Reviewed By: mrambacher

Differential Revision: D28259471

Pulled By: pdillinger

fbshipit-source-id: a25dbe327b9772143488e7bb0ec7139ee42d0613
2021-05-07 13:53:15 -07:00
mrambacher
8948dc8524 Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262)
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions.  This change cleans that up by introducing an ImmutableOptions struct.  Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).

Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR.  All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.

Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262

Reviewed By: pdillinger

Differential Revision: D28226540

Pulled By: mrambacher

fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
2021-05-05 14:00:17 -07:00
Andrew Kryczka
0f42e50fec Fix GetLiveFiles() returning OPTIONS-000000 (#8268)
Summary:
See release note in HISTORY.md.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8268

Test Plan: unit test repro

Reviewed By: siying

Differential Revision: D28227901

Pulled By: ajkr

fbshipit-source-id: faf61d13b9e43a761e3d5dcf8203923126b51339
2021-05-05 12:54:46 -07:00
Peter Dillinger
85becd94c1 Refactor: use TableBuilderOptions to reduce parameter lists (#8240)
Summary:
Greatly reduced the not-quite-copy-paste giant parameter lists
of rocksdb::NewTableBuilder, rocksdb::BuildTable,
BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor.

Moved weird separate parameter `uint32_t column_family_id` of
TableFactory::NewTableBuilder into TableBuilderOptions.

Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t
target_file_size` is not randomly placed between uint64_t timestamps
(was easy to mix up).

Replaced a couple of fields of BlockBasedTableBuilder::Rep with a
FilterBuildingContext. The motivation for this change is making it
easier to pass along more data into new fields in FilterBuildingContext
(follow-up PR).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240

Test Plan: ASAN make check

Reviewed By: mrambacher

Differential Revision: D28075891

Pulled By: pdillinger

fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f
2021-04-29 07:00:50 -07:00
sdong
cde69a7cfd db_stress to add --open_metadata_write_fault_one_in (#8235)
Summary:
DB Stress to add --open_metadata_write_fault_one_in which would randomly fail in some file metadata modification operations during DB Open, including file creation, close, renaming and directory sync. Some operations can fail before and after the operations take place.
If DB open fails, db_stress would retry without the failure ingestion, and DB is expected to open successfully.
This option is enabled in crash test in half of the time.
Some follow up changes would allow write failures in open time, and ingesting those failures in non-DB open cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8235

Test Plan: Run stress tests for a while and see failures got triggered. This can reproduce the bug fixed by https://github.com/facebook/rocksdb/pull/8192 and a similar one that fails when fsyncing parent directory.

Reviewed By: anand1976

Differential Revision: D28010944

fbshipit-source-id: 36a96da4dc3633e5f7680cef3ea0a900fcdb5558
2021-04-28 10:58:05 -07:00
Adam Retter
2760c2aef8 WBWI Internal Move implementation from .h into .cpp (#8229)
Summary:
Moves some of the structural refactoring from https://github.com/facebook/rocksdb/pull/8135 into this PR.
This just cleans up the code by moving implementation out of the .h file and into the .cc file.

Should be considered for merge before both https://github.com/facebook/rocksdb/pull/7214 and https://github.com/facebook/rocksdb/pull/8135

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8229

Reviewed By: jay-zhuang

Differential Revision: D27999669

Pulled By: mrambacher

fbshipit-source-id: 6eccecbf1f11bb9f5a173e86d1e7bc448bc96071
2021-04-26 09:48:22 -07:00
mrambacher
01e460d538 Make types of Immutable/Mutable Options fields match that of the underlying Option (#8176)
Summary:
This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code.  With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively.

readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release).

There are still fields in the ImmutableCFOptions that are not CF options but DB options.  Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions).  But that will be part of a future PR to minimize changes and disruptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176

Reviewed By: pdillinger

Differential Revision: D27954339

Pulled By: mrambacher

fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad
2021-04-22 20:43:54 -07:00
Andrew Gallagher
2e5de5a2c3 Cleanup include (#8208)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8208

Make include of "file_system.h" use the same include path as everywhere
else.

Reviewed By: riversand963, akankshamahajan15

Differential Revision: D27881606

fbshipit-source-id: fc1e076229fde21041a813c655ce017b5070c8b3
2021-04-20 14:57:27 -07:00
Yanqin Jin
a376c22066 Handle rename() failure in non-local FS (#8192)
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.

This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.

As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
  MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
  code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
  POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
  new MANIFEST.) Therefore, we keep the new MANIFEST.
    - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
    - If process reopens the db immediately after the failure, then the CURRENT file can point
      to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
      succeed and ignore the other.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D27804648

Pulled By: riversand963

fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
2021-04-19 18:11:13 -07:00
Levi Tamasi
0c6e4674a6 Fix a data race related to DB properties (#8206)
Summary:
Historically, the DB properties `rocksdb.cur-size-active-mem-table`,
`rocksdb.cur-size-all-mem-tables`, and `rocksdb.size-all-mem-tables` called
the method `MemTable::ApproximateMemoryUsage` for mutable memtables,
which is not safe without synchronization. This resulted in data races with
memtable inserts. The patch changes the code handling these properties
to use `MemTable::ApproximateMemoryUsageFast` instead, which returns a
cached value backed by an atomic variable. Two test cases had to be updated
for this change. `MemoryTest.MemTableAndTableReadersTotal` was fixed by
increasing the value size used so each value ends up in its own memtable,
which was the original intention (note: the test has been broken in the sense
that the test code didn't consider that memtable sizes below 64 KB get
increased to 64 KB by `SanitizeOptions`, and has been passing only by
accident). `DBTest.MemoryUsageWithMaxWriteBufferSizeToMaintain` relies on
completely up-to-date values and thus was changed to use `ApproximateMemoryUsage`
directly instead of going through the DB properties. Note: this should be safe in this case
since there's only a single thread involved.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8206

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D27866811

Pulled By: ltamasi

fbshipit-source-id: 7bd754d0565e0a65f1f7f0e78ffc093beef79394
2021-04-19 16:38:02 -07:00
Akanksha Mahajan
c377c2ba15 Fix flaky test BackupableDBTest.FileSizeForIncremental (#8197)
Summary:
Test was flaky because for kUseDbSessionId naming, blob files use
naming scheme kLegacyCrc32cAndFileSize. So expected number of files
because of collision can vary. So disabling blobdb for this test case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8197

Reviewed By: pdillinger

Differential Revision: D27836997

Pulled By: akankshamahajan15

fbshipit-source-id: 5eb21a5f4acae3d6b730a9e1b207264fbc18cb80
2021-04-18 16:18:35 -07:00
Justin Chapman
d89483098f Assert unlimited max_open_files for FIFO compaction. (#8172)
Summary:
Resolves https://github.com/facebook/rocksdb/issues/8014

- Add an assertion on `DB::Open` to ensure `db_options.max_open_files` is unlimited if FIFO Compaction is being used.
- This is to align with what the docs mention and to prevent premature data deletion.
- Update tests to work with this assertion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8172

Test Plan:
```bash
$ make check -j$(nproc)

Generated TARGETS Summary:
- 6 libs
- 0 binarys
- 180 tests
```

Reviewed By: ajkr

Differential Revision: D27768792

Pulled By: thejchap

fbshipit-source-id: cf6350535e3a3577fec72bcba75b3c094dc7a6f3
2021-04-14 12:05:47 -07:00
Peter Dillinger
bb75092574 Misc Backup API enhancements (#8170)
Summary:
* CreateNewBackup(WithMetadata) returning the BackupID of new backup
through optional new output param. This is especially useful with the
new mutithreading support, so that you can transactionally determine the
ID of a backup you create.
* GetBackupInfo / GetLatestBackupInfo for individual backups, so that
you don't have to comb through a vector of backups if you don't want to.

Updated HISTORY.md (including re: BlobDB support as new feature)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8170

Test Plan:
Added test logic to existing tests, to minimize increase in
cost of running tests

Reviewed By: zhichao-cao

Differential Revision: D27680410

Pulled By: pdillinger

fbshipit-source-id: 1fc45b73d81aae293ccd4a43d9583d7fd915d3eb
2021-04-12 11:00:47 -07:00
Akanksha Mahajan
d52b520d51 Integrated BlobDB for backup/restore support (#8129)
Summary:
Add support for blob files for backup/restore like table files.
    Since DB session ID is currently not supported for blob files (there is no place to store it in
    the header), so for blob files uses the
    kLegacyCrc32cAndFileSize naming scheme even if
    share_files_with_checksum_naming is set to kUseDbSessionId.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8129

Test Plan: Add new test units

Reviewed By: ltamasi

Differential Revision: D27408510

Pulled By: akankshamahajan15

fbshipit-source-id: b27434d189a639ef3e6ad165c61a143a2daaf06e
2021-04-07 13:38:54 -07:00
Peter Dillinger
a4e82a3cca Fix read-only DB writing to filesystem with write_dbid_to_manifest (#8164)
Summary:
Fixing another crash test failure in the case of
write_dbid_to_manifest=true and reading a backup as read-only DB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8164

Test Plan:
enhanced unit test for backup as read-only DB, ran
blackbox_crash_test more with elevated backup_one_in

Reviewed By: zhichao-cao

Differential Revision: D27622237

Pulled By: pdillinger

fbshipit-source-id: 680d0f99ddb465a601737f2e3f2c80efd47384fb
2021-04-07 10:26:47 -07:00
Peter Dillinger
35af0433cf Fix crash test with backup as read-only DB (#8161)
Summary:
Forgot to re-test crash test after adding read-only filesystem
enforcement to https://github.com/facebook/rocksdb/issues/8142. The problem is ReadOnlyFileSystem would reject
CreateDirIfMissing whenever DBOptions::create_if_missing=true. The fix
that is better for users is to allow CreateDirIfMissing in
ReadOnlyFileSystem if the directory exists, so that they don't cause a
failure on using create_if_missing with opening backups as read-only
DBs. Added this option test to the unit test (in addition to being in the
crash test).

Also fixed a couple of lints.

And some better messaging from 'make format' so that when you run it
with uncommitted changes, it's clear that it's only checking the
uncommitted changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8161

Test Plan: local blackbox_crash_test with amplified backup_one_in

Reviewed By: ajkr

Differential Revision: D27614409

Pulled By: pdillinger

fbshipit-source-id: 63ccb626c7e34c200d61c6bca2a8f60da9015179
2021-04-06 23:31:51 -07:00
Peter Dillinger
879357fdb0 Make backups openable as read-only DBs (#8142)
Summary:
A current limitation of backups is that you don't know the
exact database state of when the backup was taken. With this new
feature, you can at least inspect the backup's DB state without
restoring it by opening it as a read-only DB.

Rather than add something like OpenAsReadOnlyDB to the BackupEngine API,
which would inhibit opening stackable DB implementations read-only
(if/when their APIs support it), we instead provide a DB name and Env
that can be used to open as a read-only DB.

Possible follow-up work:

* Add a version of GetBackupInfo for a single backup.
* Let CreateNewBackup return the BackupID of the newly-created backup.

Implementation details:

Refactored ChrootFileSystem to split off new base class RemapFileSystem,
which allows more general remapping of files. We use this base class to
implement BackupEngineImpl::RemapSharedFileSystem.

To minimize API impact, I decided to just add these fields `name_for_open`
and `env_for_open` to those set by GetBackupInfo when
include_file_details=true. Creating the RemapSharedFileSystem adds a bit
to the memory consumption, perhaps unnecessarily in some cases, but this
has been mitigated by (a) only initialize the RemapSharedFileSystem
lazily when GetBackupInfo with include_file_details=true is called, and
(b) using the existing `shared_ptr<FileInfo>` objects to hold most of the
mapping data.

To enhance API safety, RemapSharedFileSystem is wrapped by new
ReadOnlyFileSystem which rejects any attempts to write. This uncovered a
couple of places in which DB::OpenForReadOnly would write to the
filesystem, so I fixed these. Added a release note because this affects
logging.

Additional minor refactoring in backupable_db.cc to support the new
functionality.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142

Test Plan:
new test (run with ASAN and UBSAN), added to stress test and
ran it for a while with amplified backup_one_in

Reviewed By: ajkr

Differential Revision: D27535408

Pulled By: pdillinger

fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148
2021-04-06 14:37:53 -07:00
Peter Dillinger
96205baa63 Likely fix flaky TableFileCorruptedBeforeBackup (#8151)
Summary:
Before corrupting a file in the DB and expecting corruption to
be detected, open DB read-only to ensure file is not made obsolete by
compaction. Also, to avoid obsolete files not yet deleted, only select
live files to corrupt.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8151

Test Plan: watch CI

Reviewed By: akankshamahajan15

Differential Revision: D27568849

Pulled By: pdillinger

fbshipit-source-id: 39a69a2eafde0482b20a197949d24abe21952f27
2021-04-05 11:40:31 -07:00
Peter Dillinger
ec11c23caa Add thread safety to BackupEngine, explain more (#8115)
Summary:
BackupEngine previously had unclear but strict concurrency
requirements that the API user must follow for safe use. Now we make
that clear, by separating operations into "Read," "Append," and "Write"
operations, and specifying which combinations are safe across threads on
the same BackupEngine object (previously none; now all, using a
read-write lock), and which are safe across different BackupEngine
instances open on the same backup_dir.

The changes to backupable_db.h should be backward compatible. It is
mostly about eliminating copies of what should be the same function and
(unsurprisingly) useful documentation comments were often placed on
only one of the two copies. With the re-organization, we are also
grouping different categories of operations. In the future we might add
BackupEngineReadAppendOnly, but that didn't seem necessary.

To mark API Read operations 'const', I had to mark some implementation
functions 'const' and some fields mutable.

Functional changes:
* Added RWMutex locking around public API functions to implement thread
safety on a single object. To avoid future bugs, this is another
internal class layered on top (removing many "override" in
BackupEngineImpl). It would be possible to allow more concurrency
between operations, rather than mutual exclusion, but IMHO not worth the
work.
* Fixed a race between Open() (Initialize()) and CreateNewBackup() for
different objects on the same backup_dir, where Initialize() could
delete the temporary meta file created during CreateNewBackup().
(This was found by the new test.)

Also cleaned up a couple of "status checked" TODOs, and improved a
checksum mismatch error message to include involved files.

Potential follow-up work:
* CreateNewBackup has an API wart because it doesn't tell you the
BackupID it just created, which makes it of limited use in a multithreaded
setting.
* We could also consider a Refresh() function to catch up to
changes made from another BackupEngine object to the same dir.
* Use a lock file to prevent multiple writer BackupEngines, but this
won't work on remote filesystems not supporting lock files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8115

Test Plan:
new mini-stress test in backup unit tests, run with gcc,
clang, ASC, TSAN, and UBSAN, 100 iterations each.

Reviewed By: ajkr

Differential Revision: D27347589

Pulled By: pdillinger

fbshipit-source-id: 28d82ed2ac672e44085a739ddb19d297dad14b15
2021-03-29 22:41:51 -07:00
Yanqin Jin
d6052d381e Remove duplicate code (#8079)
Summary:
The implementation of TransactionDB::WrapDB() and
TransactionDB::WrapStackableDB() are almost identical, except for the
type of the first argument `db`. This PR adds a new template function in
anonymous namespace, and calls it in the above two functions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8079

Test Plan: make check

Reviewed By: lth

Differential Revision: D27184575

Pulled By: riversand963

fbshipit-source-id: f2855a6db3a7e897d0d611f7050ca4b696c56a7a
2021-03-22 12:29:21 -07:00
Peter Dillinger
3bfd3ed2f3 Begin forward compatibility for new backup meta schema (#8069)
Summary:
This does not add any new public APIs or published
functionality, but adds the ability to read and use (and in tests,
write) backups with a new meta file schema, based on the old schema
but not forward-compatible (before this change). The new schema enables
some capabilities not in the old:

* Explicit versioning, so that users get clean error messages the next
time we want to break forward compatibility.
* Ignoring unrecognized fields (with warning), so that new non-critical
features can be added without breaking forward compatibility.
* Rejecting future "non-ignorable" fields, so that new features critical
to some use-cases could potentially be added outside of linear schema
versions, with broken forward compatibility.
* Fields at the end of the meta file, such as for checksum of the meta
file's contents (up to that point)
* New optional 'size' field for each file, which is checked when present
* Optionally omitting 'crc32' field, so that we aren't required to have
a crc32c checksum for files to take a backup. (E.g. to support backup
via hard links and to better support file custom checksums.)

Because we do not have a JSON parser and to share code, the new schema
is simply derived from the old schema.

BackupEngine code is updated to allow missing checksums in some places,
and to make that easier, `has_checksum` and `verify_checksum_after_work`
are eliminated. Empty `checksum_hex` indicates checksum is unknown. I'm
not too afraid of regressing on data integrity, because
(a) we have pretty good test coverage of corruption detection in backups, and
(b) we are increasingly relying on the DB itself for data integrity rather than
it being an exclusive feature of backups.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8069

Test Plan:
new unit tests, added to crash test (some local run with
boosted backup probability)

Reviewed By: ajkr

Differential Revision: D27139824

Pulled By: pdillinger

fbshipit-source-id: 9e0e4decfb42bb84783d64d2d246456d97e8e8c5
2021-03-19 20:15:40 -07:00
Zhichao Cao
dd0447ae2c Add new Append API with DataVerificationInfo to Env WritableFile (#8071)
Summary:
Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071

Test Plan: make check, added new unit test.

Reviewed By: anand1976

Differential Revision: D27177043

Pulled By: zhichao-cao

fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e
2021-03-19 11:44:13 -07:00
Yanqin Jin
7ee41a5d25 Fix a test failure when built with ASSERT_STATUS_CHECKED=1 (#8075)
Summary:
As title.
Test plan
ASSERT_STATUS_CHECKED=1 make -j20 backupable_db_test error_handler_fs_test
./backupable_db_test
./error_handler_fs_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8075

Reviewed By: zhichao-cao

Differential Revision: D27173832

Pulled By: riversand963

fbshipit-source-id: 37dac50f7c89127804ff2572abddd4174642de30
2021-03-18 21:52:48 -07:00
mrambacher
1a343bc393 Make ChRootEnv, EncryptedEnv, and TimedEnv into FileSystems (#7968)
Summary:
These classes were wraps of Env that provided only extensions to the FileSystem functionality.  Changed the classes to be FileSystems and the wraps to be of the CompositeEnvWrapper.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7968

Reviewed By: anand1976

Differential Revision: D26900253

Pulled By: mrambacher

fbshipit-source-id: 94001d8024a3c54a1c11adadca2bac66c3af2a77
2021-03-15 19:50:11 -07:00
mrambacher
3dff28cf9b Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033)
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>.  The shared ptr has some performance degradation on certain hardware classes.

For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere.  For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it.  The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.

There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold.  In those cases, the shared pointer was preserved.

Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:

6.17: readrandom   :      28.046 micros/op 854902 ops/sec;   61.3 MB/s (355999 of 355999 found)
6.18: readrandom   :      32.615 micros/op 735306 ops/sec;   52.7 MB/s (290999 of 290999 found)
PR: readrandom   :      27.500 micros/op 871909 ops/sec;   62.5 MB/s (367999 of 367999 found)

(Note that the times for 6.18 are prior to revert of the SystemClock).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033

Reviewed By: pdillinger

Differential Revision: D27014563

Pulled By: mrambacher

fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
2021-03-15 04:34:11 -07:00
Peter Dillinger
589ea6bec2 Add BackupEngine API for backup file details (#8042)
Summary:
This API can be used for things like determining how much space
can be freed up by deleting a particular backup, etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8042

Test Plan:
validation of the API added to many existing backup unit
tests

Reviewed By: mrambacher

Differential Revision: D26936577

Pulled By: pdillinger

fbshipit-source-id: f0bbd90f0917b9781a6837652fb4616d9247816a
2021-03-12 11:03:54 -08:00
Peter Dillinger
4b18c46d10 Refactor: add LineFileReader and Status::MustCheck (#8026)
Summary:
Removed confusing, awkward, and undocumented internal API
ReadOneLine and replaced with very simple LineFileReader.

In refactoring backupable_db.cc, this has the side benefit of
removing the arbitrary cap on the size of backup metadata files.

Also added Status::MustCheck to make it easy to mark a Status as
"must check." Using this, I can ensure that after
LineFileReader::ReadLine returns false the caller checks GetStatus().

Also removed some excessive conditional compilation in status.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8026

Test Plan: added unit test, and running tests with ASSERT_STATUS_CHECKED

Reviewed By: mrambacher

Differential Revision: D26831687

Pulled By: pdillinger

fbshipit-source-id: ef749c265a7a26bb13cd44f6f0f97db2955f6f0f
2021-03-09 20:12:38 -08:00
Peter Dillinger
847ca9f964 Make default share_files_with_checksum=true (#8020)
Summary:
New comment for share_files_with_checksum:
// Only used if share_table_files is set to true. Setting to false is
// DEPRECATED and potentially dangerous because in that case BackupEngine
// can lose data if backing up databases with distinct or divergent
// history, for example if restoring from a backup other than the latest,
// writing to the DB, and creating another backup. Setting to true (default)
// prevents these issues by ensuring that different table files (SSTs) with
// the same number are treated as distinct. See
// share_files_with_checksum_naming and ShareFilesNaming.

I have also removed interim option kFlagMatchInterimNaming, which is no
longer needed and was never needed for correct+compatible operation
(just performance).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8020

Test Plan:
tests updated. Backward+forward compatibility verified with
SHORT_TEST=1 check_format_compatible.sh. ldb uses default backup
options, and I manually verified shared_checksum in
/tmp/rocksdb_format_compatible_peterd/bak/current/ after run.

Reviewed By: ajkr

Differential Revision: D26786331

Pulled By: pdillinger

fbshipit-source-id: 36f968dfef1f5cacbd65154abe1d846151a55130
2021-03-09 16:27:13 -08:00
Akanksha Mahajan
f19612970d Support retrieving checksums for blob files from the MANIFEST when checkpointing (#8003)
Summary:
The checkpointing logic supports passing file level checksums
to the copy_file_cb callback function which is used by the backup code
for detecting corruption during file copies.
However, this is currently implemented only for table files.

This PR extends the checksum retrieval to blob files as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8003

Test Plan: Add new test units

Reviewed By: ltamasi

Differential Revision: D26680701

Pulled By: akankshamahajan15

fbshipit-source-id: 1bd1e2464df6e9aa31091d35b8c72786d94cd1c5
2021-03-01 20:07:07 -08:00
Yanqin Jin
cef4a6c49f Compaction filter support for (new) BlobDB (#7974)
Summary:
Allow applications to implement a custom compaction filter and pass it to BlobDB.

The compaction filter's custom logic can operate on blobs.
To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method.
Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely
on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in
db/blob/db_blob_compaction_test.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D26509280

Pulled By: riversand963

fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39
2021-02-25 16:32:35 -08:00
Sergei Petrunia
c9878baa87 Fix an assertion failure in range locking, locktree code. (#7938)
Summary:
Fix this scenario:
trx1> acquire shared lock on $key
trx2> acquire shared lock on the same $key
trx1> attempt to acquire a unique lock on $key.

Lock acquisition will fail, and deadlock detection will start.
It will call iterate_and_get_overlapping_row_locks() which will
produce a list with two locks (shared locks by trx1 and trx2).

However the code in lock_request::build_wait_graph() was not prepared
to find the lock by the same transaction in the list of conflicting
locks. Fix it to ignore it.

(One may suggest to fix iterate_and_get_overlapping_row_locks() to not
include locks by trx1. This is not a good idea, because that function
is also used to report all locks currently held)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7938

Reviewed By: zhichao-cao

Differential Revision: D26529374

Pulled By: ajkr

fbshipit-source-id: d89cbed008db1a97a8f2351b9bfb75310750d16a
2021-02-18 18:15:19 -08:00
Jay Zhuang
59ba104e4a Fix txn MultiGet() return un-committed data with snapshot (#7963)
Summary:
TransactionDB uses read callback to filter out un-committed data before
a snapshot. But `MultiGet()` API doesn't use that at all, which causes
returning unwanted data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963

Test Plan: Added unittest to reproduce

Reviewed By: anand1976

Differential Revision: D26455851

Pulled By: jay-zhuang

fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55
2021-02-18 08:49:00 -08:00
Levi Tamasi
dab4fe5bcd Add checkpoint support to BlobDB (#7959)
Summary:
The patch adds checkpoint support to BlobDB. Blob files are hard linked or
copied, depending on whether the checkpoint directory is on the same filesystem
or not, similarly to table files.

TODO: Add support for blob files to `ExportColumnFamily` and to the checksum
verification logic used by backup/restore.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7959

Test Plan: Ran `make check` and the crash test for a while.

Reviewed By: riversand963

Differential Revision: D26434768

Pulled By: ltamasi

fbshipit-source-id: 994be55a8dc08133028250760fca440d2c7c4dc5
2021-02-17 12:42:36 -08:00
Zhichao Cao
d1c510baec Handoff checksum Implementation (#7523)
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523

Test Plan: add new unit test, pass make check/ make asan_check

Reviewed By: pdillinger

Differential Revision: D24313271

Pulled By: zhichao-cao

fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
2021-02-10 22:20:32 -08:00
Levi Tamasi
974458891c Revert "Turn on memtable bloom filter by default. (#6584)" (#7939)
Summary:
This reverts commit ee79a28963.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7939

Reviewed By: siying

Differential Revision: D26298564

Pulled By: ltamasi

fbshipit-source-id: 6d663516e82e6de436f8d5317932ca9a98e152bd
2021-02-06 22:34:30 -08:00
Andrew Kryczka
8d2bbdd04f Allow range deletions in *TransactionDB only when safe (#7929)
Summary:
Explicitly reject all range deletions on `TransactionDB` or `OptimisticTransactionDB`, except when the user provides sufficient promises that allow us to proceed safely. The necessary promises are described in the API doc for `TransactionDB::DeleteRange()`. There is currently no way to provide enough promises to make it safe in `OptimisticTransactionDB`.

Fixes https://github.com/facebook/rocksdb/issues/7913.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7929

Test Plan: unit tests covering the cases it's permitted/rejected

Reviewed By: ltamasi

Differential Revision: D26240254

Pulled By: ajkr

fbshipit-source-id: 2834a0ce64cc3e4c3799e35b885a5e79c2f4f6d9
2021-02-05 15:57:26 -08:00
sdong
ee79a28963 Turn on memtable bloom filter by default. (#6584)
Summary:
Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584

Test Plan: Run all existing tests.

Reviewed By: pdillinger

Differential Revision: D20626739

fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9
2021-02-05 12:59:46 -08:00
mrambacher
4a09d632c4 Remove Legacy and Custom FileWrapper classes from header files (#7851)
Summary:
Removed the uses of the Legacy FileWrapper classes from the source code.  The wrappers were creating an additional layer of indirection/wrapping, as the Env already has a FileSystem.

Moved the Custom FileWrapper classes into the CustomEnv, as these classes are really for the private use the the CustomEnv class.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7851

Reviewed By: anand1976

Differential Revision: D26114816

Pulled By: mrambacher

fbshipit-source-id: db32840e58d969d3a0fa6c25aaf13d6dcdc74150
2021-01-28 22:10:32 -08:00
mrambacher
12f1137355 Add a SystemClock class to capture the time functions of an Env (#7858)
Summary:
Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.

Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.

There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).

Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858

Reviewed By: pdillinger

Differential Revision: D26006406

Pulled By: mrambacher

fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
2021-01-25 22:09:11 -08:00
Levi Tamasi
431e8afba7 Do not explicitly flush blob files when using the integrated BlobDB (#7892)
Summary:
In the original stacked BlobDB implementation, which writes blobs to blob files
immediately and treats blob files as logs, it makes sense to flush the file after
writing each blob to protect against process crashes; however, in the integrated
implementation, which builds blob files in the background jobs, this unnecessarily
reduces performance. This patch fixes this by simply adding a `do_flush` flag to
`BlobLogWriter`, which is set to `true` by the stacked implementation and to `false`
by the new code. Note: the change itself is trivial but the tests needed some work;
since in the new implementation, blobs are now buffered, adding a blob to
`BlobFileBuilder` is no longer guaranteed to result in an actual I/O. Therefore, we can
no longer rely on `FaultInjectionTestEnv` when testing failure cases; instead, we
manipulate the return values of I/O methods directly using `SyncPoint`s.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7892

Test Plan: `make check`

Reviewed By: jay-zhuang

Differential Revision: D26022814

Pulled By: ltamasi

fbshipit-source-id: b3dce419f312137fa70d84cdd9b908fd5d60d8cd
2021-01-25 13:32:33 -08:00
Adam Retter
d5f5d6579a Fix compilation against musl lib C (#7875)
Summary:
See https://github.com/percona/PerconaFT/pull/450

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7875

Reviewed By: ajkr

Differential Revision: D25938020

Pulled By: jay-zhuang

fbshipit-source-id: 9014dbc7b23bf92c5e63bfbdda4565bb0d2f2b58
2021-01-21 08:39:42 -08:00
Tomas Kolda
d76a8eeef7 Fixing Windows build using CMake (#7854)
Summary:
Builds were not producing Windows binaries properly in 6.15 branch:

```
00:00:46.413 Tests run: 11, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.183 sec <<< FAILURE! - in org.rocksdb.EventListenerTest
00:00:46.414 testAllCallbacksInvocation(org.rocksdb.EventListenerTest)  Time elapsed: 0.012 sec  <<< ERROR!
00:00:46.414 java.lang.UnsatisfiedLinkError: org.rocksdb.test.TestableEventListener.invokeAllCallbacks(J)V
00:00:46.414 	at org.rocksdb.test.TestableEventListener.invokeAllCallbacks(Native Method)
00:00:46.414 	at org.rocksdb.test.TestableEventListener.invokeAllCallbacks(TestableEventListener.java:19)
00:00:46.414 	at org.rocksdb.EventListenerTest.testAllCallbacksInvocation(EventListenerTest.java:436)
```

```
00:00:41.497        "D:\j\workspace\RocksDB_Build_Windows\build\java\rocksdbjni_headers.vcxproj" (default target) (3) ->
00:00:41.497        (CustomBuild target) ->
00:00:41.497          CUSTOMBUILD : error : Could not find class file for 'org.rocksdb.TestableEventListener'. [D:\j\workspace\RocksDB_Build_Windows\build\java\rocksdbjni_headers.vcxproj]
```

Also failed on Linux as library was not initialized yet:

```
00:01:25.103 Running org.rocksdb.NativeComparatorWrapperTest
00:01:25.133 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.006 sec <<< FAILURE! - in org.rocksdb.NativeComparatorWrapperTest
00:01:25.133 rountrip(org.rocksdb.NativeComparatorWrapperTest)  Time elapsed: 0.002 sec  <<< ERROR!
00:01:25.133 java.lang.UnsatisfiedLinkError: org.rocksdb.NativeComparatorWrapperTest$NativeStringComparatorWrapper.newStringComparator()J
00:01:25.133 	at org.rocksdb.NativeComparatorWrapperTest$NativeStringComparatorWrapper.newStringComparator(Native Method)
00:01:25.133 	at org.rocksdb.NativeComparatorWrapperTest$NativeStringComparatorWrapper.initializeNative(NativeComparatorWrapperTest.java:87)
00:01:25.133 	at org.rocksdb.RocksCallbackObject.<init>(RocksCallbackObject.java:28)
00:01:25.133 	at org.rocksdb.AbstractComparator.<init>(AbstractComparator.java:20)
00:01:25.133 	at org.rocksdb.NativeComparatorWrapper.<init>(NativeComparatorWrapper.java:16)
00:01:25.133 	at org.rocksdb.NativeComparatorWrapperTest$NativeStringComparatorWrapper.<init>(NativeComparatorWrapperTest.java:82)
00:01:25.133 	at org.rocksdb.NativeComparatorWrapperTest.rountrip(NativeComparatorWrapperTest.java:30)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7854

Reviewed By: jay-zhuang

Differential Revision: D25873378

Pulled By: ajkr

fbshipit-source-id: 88afb08bfd30edff31f17da063e636df0769cbfe
2021-01-15 17:53:16 -08:00
Jay Zhuang
a3066ee75c Fix checkpoint_test hang (#7849)
Summary:
`CheckpointTest.CurrentFileModifiedWhileCheckpointing` could hang
because now create checkpoint triggers flush twice. The test should wait
both flush done.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7849

Test Plan: `gtest-parallel ./checkpoint_test --gtest_filter=CheckpointTest.CurrentFileModifiedWhileCheckpointing -r 100`

Reviewed By: ajkr

Differential Revision: D25860713

Pulled By: jay-zhuang

fbshipit-source-id: e1c2f23037dedc33e205519f4289a25e77816b41
2021-01-09 13:26:10 -08:00
Adam Retter
4926b33742 Improvements to Env::GetChildren (#7819)
Summary:
The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html

There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source.

Also some minor fixes to `Env::GetChildren`:
* Improve error handling in POSIX implementation
* Remove unnecessary array allocation on Windows
* Fix struct name for Windows Non-UTF-8 API

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819

Reviewed By: ajkr

Differential Revision: D25837394

Pulled By: jay-zhuang

fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e
2021-01-09 09:44:34 -08:00
Cheng Chang
b2e30bdb67 Get manifest size again after getting min_log_num during checkpoint (#7836)
Summary:
Currently, manifest size is determined before getting min_log_num.

But between getting manifest size and getting min_log_num, concurrently, a flush might succeed, which will write new records to manifest to make some WALs become outdated, then min_log_num will be correspondingly increased, but the new records in manifest will not be copied into the checkpoint because the manifest's size is determined before them, then the newly outdated WALs will still exist in the checkpoint's manifest, but they are not linked/copied to the checkpoint because their log number is < min_log_num, so a corruption of missing WAL will be reported when restoring from the checkpoint.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7836

Test Plan: make crash_test

Reviewed By: ajkr

Differential Revision: D25788204

Pulled By: cheng-chang

fbshipit-source-id: a4e5acf30f08270b3c0a95304ff559a9e655252f
2021-01-07 23:02:55 -08:00
mrambacher
cc2a180d00 Add more tests to the ASC pass list (#7834)
Summary:
Fixed the following  to now pass ASC checks:
* `ttl_test`
* `blob_db_test`
* `backupable_db_test`,
* `delete_scheduler_test`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7834

Reviewed By: jay-zhuang

Differential Revision: D25795398

Pulled By: ajkr

fbshipit-source-id: a10037817deda4fc7cbb353a2e00b62ed89b6476
2021-01-07 15:22:53 -08:00
DreaMer963
8f7b6c8339 fix typo (#7832)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7832

Reviewed By: jay-zhuang

Differential Revision: D25785459

Pulled By: zhichao-cao

fbshipit-source-id: 78658dcb5a5f24141395046f74d7d57f11ad0868
2021-01-06 19:28:38 -08:00
mrambacher
e628f59e87 Create a CustomEnv class; Add WinFileSystem; Make LegacyFileSystemWrapper private (#7703)
Summary:
This PR does the following:
-> Creates a WinFileSystem class.  This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems.
-> Introduces a CustomEnv class.  A CustomEnv is an Env that takes a FileSystem as constructor argument.  I believe there will only ever be two implementations of this class (PosixEnv and WinEnv).  There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls
-> Eliminates the public uses of the LegacyFileSystemWrapper.

With this change in place, there are effectively the following patterns of Env:
- "Base Env classes" (PosixEnv, WinEnv).  These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem.  These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system.
- Wrapped Composite Env classes (MemEnv).  These classes take in an Env and a FileSystem.  The core env functions are re-directed to the wrapped env.  The file system calls are redirected to the input file system
- Legacy Wrapped Env classes.  These classes take in an Env input (but no FileSystem).  The core env functions are re-directed to the wrapped env.  A "Legacy File System" is created using this env and the file system calls directed to the env itself.

With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created.  Any other use of the PosixEnv is via another wrapped env.  This cleans up some of the issues with the env construction and destruction.

Additionally, there were places in the code that required had an Env when they required a FileSystem.  Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem().  These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703

Reviewed By: zhichao-cao

Differential Revision: D25762190

Pulled By: anand1976

fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b
2021-01-06 10:49:32 -08:00
mrambacher
c1a65a4de4 Make StringEnv, StringSink, StringSource use FS classes (#7786)
Summary:
Change the StringEnv and related classes to be based on FileSystem APIs rather than the corresponding Env ones.  The StringSink and StringSource classes were changed to be based on the corresponding FS file classes.

Part of a cleanup to use the newer interfaces.  This change also eliminates some of the casts/wrappers to LegacyFile classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7786

Reviewed By: jay-zhuang

Differential Revision: D25761460

Pulled By: anand1976

fbshipit-source-id: 428ae8e32b3db97dbeeca08c9d3bb0d9d4d3a38f
2021-01-04 16:01:01 -08:00
mrambacher
81367a4616 Eliminate the creation of ImmutableDBOptions in WBWI::GetFromBatch (#6851)
Summary:
1. Made `WriteBatchWithIndexInternal` into a class that stores the `DB*` or `DBOptions*`.

2. Changed the `GetFromBatch` method to be non-static and use an instance of the class.  Added `MergeKey` methods to perform the merge itself and return any status.

This change unifies the multiple calls to the `MergeHelper` under a single wrapped API.

Closes https://github.com/facebook/rocksdb/issues/6683

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6851

Reviewed By: ajkr

Differential Revision: D21706574

Pulled By: pdillinger

fbshipit-source-id: 6860bd64d62669aaa591846e914eed3b674e68b1
2021-01-04 09:05:46 -08:00
cheng-chang
736c6dc59f Disable BasicLockEscalation if cannot determine whether TSAN is enabled (#7814)
Summary:
BasicLockEscalation will cause false-positive warnings under TSAN (this is a known issue in TSAN, see details in https://gist.github.com/spetrunia/77274cf2d5848e0a7e090d622695ed4e), skip this test if TSAN is enabled, or if we are not sure whether TSAN is enabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7814

Test Plan: watch the tsan contrun test to pass.

Reviewed By: zhichao-cao

Differential Revision: D25708094

Pulled By: cheng-chang

fbshipit-source-id: 4fc813ff373301d033d086154cc7bb60a5e95889
2020-12-27 16:18:00 -08:00
mrambacher
55e99688cc No elide constructors (#7798)
Summary:
Added "no-elide-constructors to the ASSERT_STATUS_CHECK builds.  This flag gives more errors/warnings for some of the Status checks where an inner class checks a Status and later returns it.  In this case,  without the elide check on, the returned status may not have been checked in the caller, thereby bypassing the checked code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7798

Reviewed By: jay-zhuang

Differential Revision: D25680451

Pulled By: pdillinger

fbshipit-source-id: c3f14ed9e2a13f0a8c54d839d5fb4d1fc1e93917
2020-12-23 16:55:53 -08:00
cheng-chang
bdb7e544bd Skip WALs according to MinLogNumberToKeep when creating checkpoint (#7789)
Summary:
In a stress test failure, we observe that a WAL is skipped when creating checkpoint, although its log number >= MinLogNumberToKeep(). This might happen in the following case:

1. when creating the checkpoint, there are 2 column families: CF0 and CF1, and there are 2 WALs: 1, 2;
2. CF0's log number is 1, CF0's active memtable is empty, CF1's log number is 2, CF1's active memtable is not empty, WAL 2 is not empty, the sequence number points to WAL 2;
2. the checkpoint process flushes CF0, since CF0' active memtable is empty, there is no need to SwitchMemtable, thus no new WAL will be created, so CF0's log number is now 2, concurrently, some data is written to CF0 and WAL 2;
3. the checkpoint process flushes CF1, WAL 3 is created and CF1's log number is now 3, CF0's log number is still 2 because CF0 is not empty and WAL 2 contains its unflushed data concurrently written in step 2;
4.  the checkpoint process determines that WAL 1 and 2 are no longer needed according to [live_wal_files[i]->StartSequence() >= *sequence_number](https://github.com/facebook/rocksdb/blob/master/utilities/checkpoint/checkpoint_impl.cc#L388), so it skips linking them to the checkpoint directory;
5. but according to `MinLogNumberToKeep()`, WAL 2 still needs to be kept because CF0's log number is 2.

If the checkpoint is reopened in read-only mode, and only read from the snapshot with the initial sequence number, then there will be no data loss or data inconsistency.

But if the checkpoint is reopened and read from the most recent sequence number, suppose in step 3, there are also data concurrently written to CF1 and WAL 3, then the most recent sequence number refers to the latest entry in WAL 3, so the data written in step 2 should also be visible, but since WAL 2 is discarded, those data are lost.

When tracking WAL in MANIFEST is enabled, when reopening the checkpoint, since WAL 2 is still tracked in MANIFEST as alive, but it's missing from the checkpoint directory, a corruption will be reported.

This PR makes the checkpoint process to only skip a WAL if its log number < `MinLogNumberToKeep`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7789

Test Plan: watch existing tests to pass.

Reviewed By: ajkr

Differential Revision: D25662346

Pulled By: cheng-chang

fbshipit-source-id: 136471095baa01886cf44809455cf855f24857a0
2020-12-23 11:33:26 -08:00
mrambacher
02418194d7 Add more tests for assert status checked (#7524)
Summary:
Added 10 more tests that pass the ASSERT_STATUS_CHECKED test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7524

Reviewed By: akankshamahajan15

Differential Revision: D24323093

Pulled By: ajkr

fbshipit-source-id: 28d4106d0ca1740c3b896c755edf82d504b74801
2020-12-22 23:45:58 -08:00
Sergei Petrunia
daab7603f6 Range Locking: Implementation of range locking (#7506)
Summary:
Range Locking - an implementation based on the locktree library

- Add a RangeTreeLockManager and RangeTreeLockTracker which implement
  range locking using the locktree library.
- Point locks are handled as locks on single-point ranges.
- Add a unit test: range_locking_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7506

Reviewed By: akankshamahajan15

Differential Revision: D25320703

Pulled By: cheng-chang

fbshipit-source-id: f86347384b42ba2b0257d67eca0f45f806b69da7
2020-12-22 19:12:36 -08:00
Adam Retter
81592d9ffa Add more tests to ASSERT_STATUS_CHECKED (4) (#7718)
Summary:
Fourth batch of adding more tests to ASSERT_STATUS_CHECKED.

* db_range_del_test
* db_write_test
* random_access_file_reader_test
* merge_test
* external_sst_file_test
* write_buffer_manager_test
* stringappend_test
* deletefile_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7718

Reviewed By: pdillinger

Differential Revision: D25671608

fbshipit-source-id: 687a794e98a9e0cd5428ead9898ef05ced987c31
2020-12-22 15:09:39 -08:00
Sergei Petrunia
1022090981 Apply the changes from: PS-5501 : Re-license PerconaFT 'locktree' to Apache V2 (#7801)
Summary:
commit d5178f513c0b4144a5ac9358ec0f6a3b54a28e76
Author: George O. Lorch III <george.lorch@percona.com>
Date:   Tue Mar 19 12:18:40 2019 -0700

    PS-5501 : Re-license PerconaFT 'locktree' to Apache V2

    - Fixed some incomplete relicensed files from previous round.

    - Added missing license text to some.

    - Relicensed more files to Apache V2 that locktree depends on.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7801

Reviewed By: jay-zhuang

Differential Revision: D25682430

Pulled By: cheng-chang

fbshipit-source-id: deb8a0de3e76f3638672997bfbd300e2fffbe5f5
2020-12-22 14:47:41 -08:00
Peter Dillinger
4d897e51df Migrate away from Travis+Linux+amd64 (#7791)
Summary:
This disables Linux/amd64 builds in Travis for PRs, and adds a
gcc-10+c++20 build in CircleCI, which should fill out sufficient coverage
vs. what we had in Travis

Fixed a use of std::is_pod, which is deprecated in c++20

Fixed ++ on a volatile in db_repl_stress.cc, with bigger refactoring.
Although ++ on this volatile was probably ok with one thread writer and
one thread reader, the code was still overly complex. There was a
deadcode check for error
`if (replThread.no_read < dataPump.no_records)` which can be proven
never to happen based on the structure of the code. It infinite loops
instead for the case intended to be checked. I just simplified the code
for what should be the same checking power.

Also most configurations seem to be using make parallelism = 2 * vcores,
so fixing / using that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7791

Test Plan:
CI
and `while ./db_repl_stress; do echo again; done` for a while

Reviewed By: siying

Differential Revision: D25669834

Pulled By: pdillinger

fbshipit-source-id: b2c688053d0b1d52c989903449d3cd27a04130d6
2020-12-22 00:20:57 -08:00
Zhichao Cao
04b3524ad0 Inject the random write error to stress test (#7653)
Summary:
Inject the random write error to stress test, it requires set reopen=0 and disable_wal=true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7653

Test Plan: pass db_stress and python3 db_crashtest.py blackbox

Reviewed By: ajkr

Differential Revision: D25354132

Pulled By: zhichao-cao

fbshipit-source-id: 44721104eecb416e27f65f854912c40e301dd669
2020-12-17 11:52:28 -08:00
Akanksha Mahajan
99f5a800c3 Fix clang_analyze error (#7777)
Summary:
Fix clang_analyze error

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7777

Test Plan:
USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j64
analyze

Reviewed By: jay-zhuang

Differential Revision: D25601675

Pulled By: akankshamahajan15

fbshipit-source-id: 30f58cf4d575a2d546c455fb43e856455eb72a07
2020-12-16 21:34:41 -08:00
Adam Retter
8ff6557e7f Add further tests to ASSERT_STATUS_CHECKED (2) (#7698)
Summary:
Second batch of adding more tests to ASSERT_STATUS_CHECKED.

* external_sst_file_basic_test
* checkpoint_test
* db_wal_test
* db_block_cache_test
* db_logical_block_size_cache_test
* db_blob_index_test
* optimistic_transaction_test
* transaction_test
* point_lock_manager_test
* write_prepared_transaction_test
* write_unprepared_transaction_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7698

Reviewed By: cheng-chang

Differential Revision: D25441664

Pulled By: pdillinger

fbshipit-source-id: 9e78867f32321db5d4833e95eb96c5734526ef00
2020-12-09 21:21:16 -08:00
Manuel Ung
71239908cf Invalidate iterator on transaction clear (#7733)
Summary:
Some clients do not close their iterators until after the transaction finishes. To handle this case, we will invalidate any iterators on transaction clear.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7733

Reviewed By: cheng-chang

Differential Revision: D25261158

Pulled By: lth

fbshipit-source-id: b91320f00c54cbe0e6882b794b34f3bb5640dbc0
2020-12-09 19:13:22 -08:00
Sergei Petrunia
98236fb10e LockTree library, originally from PerconaFT (#7753)
Summary:
To be used for implementing Range Locking.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7753

Reviewed By: zhichao-cao

Differential Revision: D25378980

Pulled By: cheng-chang

fbshipit-source-id: 801a9c5cd92a84654ca2586b73e8f69001e89320
2020-12-09 12:10:57 -08:00
Adam Retter
7b2216c906 Add further tests to ASSERT_STATUS_CHECKED (1) (#7679)
Summary:
First batch of adding more tests to ASSERT_STATUS_CHECKED.

* db_iterator_test
* db_memtable_test
* db_merge_operator_test
* db_merge_operand_test
* write_batch_test
* write_batch_with_index_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7679

Reviewed By: ajkr

Differential Revision: D25399270

Pulled By: pdillinger

fbshipit-source-id: 3017d0a686aec5cd2d743fc2acbbf75df239f3ba
2020-12-08 15:55:04 -08:00
Sergei Petrunia
d8bd9fc7b3 Range Locking: Allow different LockManagers, add Range Lock definitions (#7443)
Summary:
This PR has two commits:
1.  Modify the code to allow different Lock Managers (of any kind) to be used.  It is implied that a LockManager uses its own custom LockTracker.
2.  Add definitions for Range Locking (class Endpoint and GetRangeLock() function.

cheng-chang, is this what you've had in mind (should the PR have both item 1 and item 2?)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7443

Reviewed By: zhichao-cao

Differential Revision: D24123172

Pulled By: cheng-chang

fbshipit-source-id: c6548ad6d4cc3c25f68d13b29147bc6fdf357185
2020-12-07 20:18:07 -08:00
Akanksha Mahajan
20c7d7c58a Handling misuse of snprintf return value (#7686)
Summary:
Handle misuse of snprintf return value to avoid Out of bound
read/write.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7686

Test Plan: make check -j64

Reviewed By: riversand963

Differential Revision: D25030831

Pulled By: akankshamahajan15

fbshipit-source-id: 1a1d181c067c78b94d720323ae00b79566b57cfa
2020-12-07 13:43:55 -08:00
Levi Tamasi
61932cdf1d Add blob support to DBIter (#7731)
Summary:
The patch adds iterator support to the integrated BlobDB implementation.
Whenever a blob reference is encountered during iteration, the corresponding
blob is retrieved by calling `Version::GetBlob`, assuming the `expose_blob_index`
(formerly `allow_blob`) flag is *not* set. (Note: the flag is set by the old stacked
BlobDB implementation, which has its own blob file handling/blob retrieval logic.)

In addition, `DBIter` now uniformly returns `Status::NotSupported` with the error
message `"BlobDB does not support merge operator."` when encountering a
blob reference while performing a merge (instead of potentially returning a
message that implies the database should be opened using the stacked BlobDB's
`Open`.)

TODO: We can implement support for lazily retrieving the blob value (or in other
words, bypassing the retrieval of blob values based on key) by extending the `Iterator`
API with a new `PrepareValue` method (similarly to `InternalIterator`, which already
supports lazy values).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7731

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D25256293

Pulled By: ltamasi

fbshipit-source-id: c39cd782011495a526cdff99c16f5fca400c4811
2020-12-04 21:29:38 -08:00
Andrew Kryczka
1c5f13f2a5 Fail early when merge_operator not configured (#7667)
Summary:
An application may accidentally write merge operands without properly configuring `merge_operator`. We should alert them as early as possible that there's an API misuse. Previously RocksDB only notified them when a query or background operation needed to merge but couldn't. With this PR, RocksDB notifies them of the problem before applying the merge operand to the memtable (although it may already be in WAL, which seems it'd cause a crash loop until they enable `merge_operator`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7667

Reviewed By: riversand963

Differential Revision: D24933360

Pulled By: ajkr

fbshipit-source-id: 3a4a2ceb0b7aed184113dd03b8efd735a8332f7f
2020-11-16 20:39:01 -08:00
Cheng Chang
5e794b0841 Fix a recovery corner case (#7621)
Summary:
Consider the following sequence of events:

1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST.
2. Syncing MANIFEST failed and db crashed.
3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file.
4. Db crashed again.
5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed.

It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5.

After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621

Test Plan:
1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery.
2. a new unit test is added in db_basic_test to simulate step 3.

Reviewed By: riversand963

Differential Revision: D24668144

Pulled By: cheng-chang

fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b
2020-11-07 22:23:27 -08:00
Cheng Chang
da42eceabc Skip fsync in txn tests (#7641)
Summary:
The tests often times out in internal infra, skipping fsync should reduce test time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7641

Test Plan: watch existing tests to pass

Reviewed By: anand1976

Differential Revision: D24765098

Pulled By: cheng-chang

fbshipit-source-id: c62bf8110361aee901918d632cf4772435d05e8d
2020-11-06 14:25:14 -08:00
mrambacher
30beecef8c Return NotFound from TableFactory configuration errors during options loading (#7615)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7615

Reviewed By: riversand963

Differential Revision: D24637054

Pulled By: ajkr

fbshipit-source-id: 7da20d44289eaa2387af4edf8c3c48057425cc1c
2020-10-29 18:44:24 -07:00
mrambacher
7eb2824e3f Revert LoadLatestOptions handling of ignore_unknown_options if versions differ (#7612)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7612

Reviewed By: zhichao-cao

Differential Revision: D24627054

Pulled By: riversand963

fbshipit-source-id: 451b4da742e3e84c7442bc7cc4959d39089b89d0
2020-10-29 13:46:26 -07:00
Yanqin Jin
394210f280 Remove unused includes (#7604)
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D24579392

Pulled By: riversand963

fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
2020-10-28 23:22:27 -07:00
Ramkumar Vadivelu
9a690a74e1 In ParseInternalKey(), include corrupt key info in Status (#7515)
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497

When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`

Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.

Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test

Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515

Reviewed By: ajkr

Differential Revision: D24240264

Pulled By: ramvadiv

fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
2020-10-28 10:12:58 -07:00
mrambacher
f35f7f2704 Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:

1.  The MockFileSystem class was split out from the MockEnv.  This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one).  The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.

2.  Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set.  To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).

3.  Updated the test framework to have a ROCKSDB_GTEST_SKIP macro.  This can be used to flag tests that are skipped.  Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).

I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV,  both, and neither under both MacOS and RedHat.  A few tests were disabled/skipped for the MEM/ENCRYPTED cases.  The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem.  (I will also push a change to disable those tests soon).  There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.

Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.

Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale.  I do not know how to do that, so if someone could write that job, it would be appreciated :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566

Reviewed By: zhichao-cao

Differential Revision: D24408980

Pulled By: jay-zhuang

fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 10:33:09 -07:00
Zhichao Cao
d8ec0a760a Make FileType Public and Replace kLogFile with kWalFile (#7580)
Summary:
As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in https://github.com/facebook/rocksdb/issues/7523

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7580

Test Plan: make check

Reviewed By: riversand963

Differential Revision: D24485420

Pulled By: zhichao-cao

fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799
2020-10-22 17:06:20 -07:00
Cheng Chang
5227b315ec Fix unchecked statuses for transaction_test (#7572)
Summary:
When `ASSERT_STATUS_CHECKED` is enabled, `transaction_test` does not pass without this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7572

Test Plan: `ASSERT_STATUS_CHECKED=1 make   -j32 transaction_test && ./transaction_test`

Reviewed By: zhichao-cao

Differential Revision: D24404319

Pulled By: cheng-chang

fbshipit-source-id: 13689035995366ab06d8eada3ea404e45fef8bc5
2020-10-21 14:03:59 -07:00
Cheng Chang
73dbe10bbf Fix write_batch_test when ASSERT_STATUS_CHECKED=1 (#7575)
Summary:
Without this PR, `ASSERT_STATUS_CHECKED=1 make   -j32 write_batch_test && ./write_batch_test` fails.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7575

Test Plan: ASSERT_STATUS_CHECKED=1 make   -j32 write_batch_test && ./write_batch_test

Reviewed By: zhichao-cao

Differential Revision: D24411442

Pulled By: cheng-chang

fbshipit-source-id: f67dc43c44d6afcc6d7e5ff15c6ae9bbf4dfc943
2020-10-20 13:18:41 -07:00
mrambacher
1eda625eab Revert Statuses returned from pre-Configurable options functions (#7563)
Summary:
Further refinement of the earlier PR.  Now the Status is NotFound with a subcode of PathNotFound. Also the existing functions for options parsing/loading are reverted to return InvalidArgument no matter in which way the user-provided arguments are deemed invalid.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7563

Reviewed By: zhichao-cao

Differential Revision: D24422491

Pulled By: ajkr

fbshipit-source-id: ba6b237cd0584d3f925c5ba0d349aeb8c250af67
2020-10-20 11:53:28 -07:00
Cheng Chang
0ea7db768e Abstract out LockManager interface (#7532)
Summary:
In order to be able to introduce more locking protocols, we need to abstract out the locking subsystem in TransactionDB into a set of interfaces.

PR https://github.com/facebook/rocksdb/pull/7013 introduces interface `LockTracker`. This PR is a follow up to take the first step to abstract out a `LockManager` interface.

Further modifications to the interface may be needed when introducing the first implementation of range lock. But the idea here is to put the range lock implementation based on range tree under the `utilities/transactions/lock/range/range_tree`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7532

Test Plan: point_lock_manager_test

Reviewed By: ajkr

Differential Revision: D24238731

Pulled By: cheng-chang

fbshipit-source-id: 2a9458cd8b3fb008d9529dbc4d3b28c24631f463
2020-10-19 10:14:42 -07:00
Levi Tamasi
e8cb32ed67 Introduce BlobFileCache and add support for blob files to Get() (#7540)
Summary:
The patch adds blob file support to the `Get` API by extending `Version` so that
whenever a blob reference is read from a file, the blob is retrieved from the corresponding
blob file and passed back to the caller. (This is assuming the blob reference is valid
and the blob file is actually part of the given `Version`.) It also introduces a cache
of `BlobFileReader`s called `BlobFileCache` that enables sharing `BlobFileReader`s
between callers. `BlobFileCache` uses the same backing cache as `TableCache`, so
`max_open_files` (if specified) limits the total number of open (table + blob) files.

TODO: proactively open/cache blob files and pin the cache handles of the readers in the
metadata objects similarly to what `VersionBuilder::LoadTableHandlers` does for
table files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7540

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24260219

Pulled By: ltamasi

fbshipit-source-id: a8a2a4f11d3d04d6082201b52184bc4d7b0857ba
2020-10-15 13:04:47 -07:00
mrambacher
a8c89cc969 Test for LoadLatestOptions (#7554)
Summary:
Make LoadLatestOptions return PathNotFound if the options file does not exist.  Added tests for the LoadOptions related methods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7554

Reviewed By: akankshamahajan15

Differential Revision: D24298985

Pulled By: zhichao-cao

fbshipit-source-id: c9ae3cb12fc4a5bbef07743e1c1300f98a2441b3
2020-10-14 22:28:55 -07:00
Akanksha Mahajan
24498ab1ec Add few unit test cases in ASSERT_STATUS_CHECKED (#7500)
Summary:
Add status enforcement for following tests:
  1. import_column_family_test
  2. memory_test
  3. table_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7500

Reviewed By: zhichao-cao

Differential Revision: D24095887

Pulled By: akankshamahajan15

fbshipit-source-id: db8e1ec595852df143fad78a0c07bfdd27dc3c84
2020-10-08 11:22:44 -07:00
Levi Tamasi
1f84611e5d Clean up BlobLogReader and rename it to BlobLogSequentialReader (#7517)
Summary:
The patch does some cleanup in and around the legacy `BlobLogReader` class:
* It renames the class to `BlobLogSequentialReader` to emphasize that it is for
sequentially iterating through blobs in a blob file, as opposed to doing random
point reads using `BlobIndex`es (which is `BlobFileReader`'s jurisdiction).
* It removes some dead code from the old BlobDB implementation that references
`BlobLogReader` (namely the method `BlobFile::OpenRandomAccessReader`).
* It cleans up some `#include`s and forward declarations.
* It fixes some incorrect/outdated comments related to the reader class.
* It adds a few assertions to the `Read` methods of the class.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7517

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24172611

Pulled By: ltamasi

fbshipit-source-id: 43e2ae1eba5c3dd30c1070cb00f217edc45bd64f
2020-10-07 17:48:16 -07:00
Zhichao Cao
b7062f0b2c Status check enforcement for error_handler_fs_test (#7342)
Summary:
Added status check enforcement for error_test_fs_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7342

Test Plan: ASSERT_STATUS_CHECKED=1 make -j48 error_test_fs_test

Reviewed By: akankshamahajan15

Differential Revision: D23972231

Pulled By: zhichao-cao

fbshipit-source-id: fa41bfe440012e0c55f2c9507c1d0104e5e93f84
2020-10-02 16:41:13 -07:00
Ramkumar Vadivelu
e04a50923d Change ParseInternalKey() to return Status instead of bool (#7457)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/7430

Change ParseInternalKey() to return Status instead of bool.

db_bench (seekrandom) based before/after results with value size of 100 bytes and 16 bytes can be found at (tests ran on an udb server):
https://www.dropbox.com/s/47bwamdy5ozngph/PIK_ret_Status_results.xlsx?dl=0

![db_bench_results](https://user-images.githubusercontent.com/62277872/94642825-2a21a800-029a-11eb-88f2-124136c83fd3.png)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7457

Reviewed By: ajkr

Differential Revision: D24002433

Pulled By: ramvadiv

fbshipit-source-id: ac253ecf577a29044c47c3fe254a01e71404c44c
2020-09-30 19:16:47 -07:00
Jay Zhuang
e127fe18c3 Fix TSAN failure for backupable_db_test (#7478)
Summary:
It's a transient failure, but can be reproduce with running the test 100
times:
https://app.circleci.com/pipelines/github/facebook/rocksdb/3760/workflows/de909685-f22b-45ba-a8f3-6ebb78a54e96/jobs/37039

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7478

Test Plan: re-run the test 100 times

Reviewed By: ajkr

Differential Revision: D24035758

Pulled By: jay-zhuang

fbshipit-source-id: 6b31983d5c3f7faa8d5481306098513485d0d69d
2020-09-30 17:22:56 -07:00
sdong
d08a9005b7 Make db_basic_test pass assert status checked (#7452)
Summary:
Add db_basic_test status check list. Some of the warnings are suppressed. It is possible that some of them are due to real bugs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7452

Test Plan: See CI tests pass.

Reviewed By: zhichao-cao

Differential Revision: D23979764

fbshipit-source-id: 6151570c2a9b931b0fbb3fe939a94b2bd1583cbe
2020-09-29 09:49:04 -07:00
Levi Tamasi
5e221a98b5 Support injecting read errors for RandomAccessFile when using FaultInjectionTestEnv (#7447)
Summary:
The patch adds support for injecting errors when reading from `RandomAccessFile`
using `FaultInjectionTestEnv`. (This functionality was curiously missing
w/r/t `RandomAccessFile`, even though it was implemented for `RandomRWFile`.)
The patch also fixes up a test case in `blob_db_test` which uses `FaultInjectionTestEnv`
but has so far relied on reads from `RandomAccessFile`s succeeding even after
deactivating the filesystem.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7447

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D23971740

Pulled By: ltamasi

fbshipit-source-id: 8492736cb64b1ee138c658822535f3ff4fe560c6
2020-09-28 17:32:06 -07:00
Peter Dillinger
08552b19d3 Genericize and clean up FastRange (#7436)
Summary:
A generic algorithm in progress depends on a templatized
version of fastrange, so this change generalizes it and renames
it to fit our style guidelines, FastRange32, FastRange64, and now
FastRangeGeneric.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7436

Test Plan: added a few more test cases

Reviewed By: jay-zhuang

Differential Revision: D23958153

Pulled By: pdillinger

fbshipit-source-id: 8c3b76101653417804997e5f076623a25586f3e8
2020-09-28 11:35:00 -07:00
Zhichao Cao
0ce9b3a22d Add AppendWithVerify and PositionedAppendWithVerify to Env and FileSystem (#7419)
Summary:
Add new AppendWithVerify and PositionedAppendWithVerify APIs to Env and FileSystem to bring the data verification information (data checksum information) from upper layer (e.g., WritableFileWriter) to the storage layer. This PR only include the API definition, no functional codes are added to unblock other developers which depend on these APIs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7419

Test Plan: make -j32

Reviewed By: pdillinger

Differential Revision: D23883196

Pulled By: zhichao-cao

fbshipit-source-id: 94676c26bc56144cc32e3661f84f21eccd790411
2020-09-23 19:02:26 -07:00
Peter Dillinger
9d8eb77c4d Less I/O for incremental backups, slightly better corruption detection (#7413)
Summary:
Two relatively simple functional changes to incremental backup
behavior, integrated with a minor refactoring to reduce code redundancy and
improve error/log message. There are nuances to the impact of these changes,
but I believe they are fundamentally good and generally safe. Those functional
changes:

* Incremental backups no longer read DB table files that are already saved to a
shared part of the backup directory, unless `share_files_with_checksum` is used
with `kLegacyCrc32cAndFileSize` naming (discouraged) where crc32c full file
checksums are needed to determine file naming.
  * Justification: incremental backups should not need to read the whole DB,
especially without rate limiting. (Although other BackupEngine reads are not
rate limited either, other non-trivial reads are generally limited by a
corresponding write, as in copying files.) Also, the fact that this is not
already fixed was arguably a bug/oversight in the implementation of https://github.com/facebook/rocksdb/issues/7110.

* When considering whether a table file is already backed up in a shared part
of backup directory, BackupEngine would already query the sizes of source (DB)
and pre-existing destination (backup) files. BackupEngine now uses these file
sizes to detect corruption, as at least one of (a) old backup, (b) backup in
progress, or (c) current DB is corrupt if there's a size mismatch.
  * Justification: a random related fix that also helps to cover a small hole
in corruption checking uncovered by the other functional change:
  * For `share_table_files` without "checksum" (not recommended), the other
change regresses in detecting fundamentally unsafe use of this option
combination: when you might generate different versions of same SST file
number. As demonstrated by `BackupableDBTest.FailOverwritingBackups,` this
regression is greatly mitigated by the new file size checking. Nevertheless,
almost no reason to use `share_files_with_checksum=false` should remain, and
comments are updated appropriately.

Also, this change renames internal function `CalculateChecksum` to
`ReadFileAndComputeChecksum` to make the performance impact of this function
clear in code reviews.

It is not clear what 'same_path' is for in backupable_db.cc, and I suspect it
cannot be true for a DB with unique file names (like DBImpl). Nevertheless,
I've tried to keep its functionality intact when `true` to minimize risk for
now, despite having no unit tests for which it is true.

Select impact details (much more in unit tests): For
`share_files_with_checksum`, I am confident there is no regression (vs.
pre-6.12) in detecting DB or backup corruption at backup creation time, mostly
because the old design did not leverage this extra checksum computation for
detecting inconsistencies at backup creation time. (With computed checksums in
names, a recently corrupted file just looked like a different file vs. what was
already backed up.)

Even in the hypothetical case of DB session id collision (~100 bits entropy
collision), file size in name and/or our file size check add an extra layer of
protection against false success in creating an accurate new backup. (Unit test
included.)

`DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking
are still able to catch corruptions that `CreateNewBackup` does not. Note that
when custom file checksum support is added to BackupEngine, that will
essentially give the same power as `DB::VerifyChecksum` into `CreateNewBackup`.
We could add options for `CreateNewBackup` to cover some of what would be
caught by `VerifyBackup` with checksum checking.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7413

Test Plan:
Two new unit tests included, both of which fail without these
changes. Although we don't test the I/O improvement directly, we test it
indirectly in DB corruption detection power that was inadvertently unlocked
with new backup file naming PLUS computing current content checksums (now
removed). (I don't think that case of DB corruption detection justifies reading
the whole DB on incremental backup.)

Reviewed By: zhichao-cao

Differential Revision: D23818480

Pulled By: pdillinger

fbshipit-source-id: 148aff16f001af5b9fd4b22f155311c2461f1bac
2020-09-21 16:19:24 -07:00
Peter Dillinger
b475a83f9d Postponing custom checksum support in BackupEngine (#7411)
Summary:
This change reverts BackupEngine to 6.12 state to accommodate a
higher-priority fix that does not easily merge with this custom checksum
support. We intend to reinstate this support soon, by merging a revert
of this change.

For backupable_db_test, I've removed the tests depending on this
feature.

I've also removed relevant HISTORY.md entry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7411

Test Plan: unit tests

Reviewed By: ajkr

Differential Revision: D23793835

Pulled By: pdillinger

fbshipit-source-id: 7e861436539584799b13d1a8ae559b81b6d08052
2020-09-18 15:27:03 -07:00
Peter Dillinger
93719fc953 Restore file size in backup table file names (and other cleanup) (#7400)
Summary:
Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '\_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.

This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.

We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).

We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.

Two final auxiliary notes:
Recognizing that the backup file names have become a de facto part of
the backup meta schema, this change makes them easier to parse and
extend by putting a distinct marker, 's', before DB session ids embedded
in the name. When we extend this to allow custom checksums in the name,
they can get their own marker to ensure safe parsing. For backward
compatibility, file size does not get a marker but is assumed for
`_[0-9]+[.]`

Another change from initial 6.12 default behavior is never including
file custom checksum in the file name. Looking ahead to 6.13, we do not
want the default behavior to cause backup space amplification for
someone turning on file custom checksum checking in BackupEngine; we
want that to be an easy decision. When implemented, including file
custom checksums in backup file names will be a non-default option.

Actual file name patterns and priorities, as regexes:

    kLegacyCrc32cAndFileSize OR pre-6.12 SST file ->
      [0-9]+_[0-9]+_[0-9]+[.]sst
    kFlagMatchInterimNaming set (default) AND early 6.12 SST file ->
      [0-9]+_[0-9a-fA-F-]+[.]sst
    kUseDbSessionId AND NOT kFlagIncludeFileSize ->
      [0-9]+_s[0-9A-Z]{20}[.]sst
    kUseDbSessionId AND kFlagIncludeFileSize (default) ->
      [0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst

We might add opt-in options for more '\_' separated data in the name,
but embedded file size, if present, will always be after last '\_' and
before '.sst'.

This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400

Test Plan:
unit tests included. Sync point callbacks are used to mimic
previous version SST files.

Reviewed By: ajkr

Differential Revision: D23759587

Pulled By: pdillinger

fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025
2020-09-17 10:24:22 -07:00
mrambacher
a08d6f18f0 Add more tests to ASSERT_STATUS_CHECKED (#7367)
Summary:
db_options_test
options_file_test
auto_roll_logger_test
options_util_test
persistent_cache_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7367

Reviewed By: jay-zhuang

Differential Revision: D23712520

Pulled By: zhichao-cao

fbshipit-source-id: 99b331e357f5d6a6aabee89d1bd933002cbb3908
2020-09-16 15:48:07 -07:00
mrambacher
7d472accdc Bring the Configurable options together (#5753)
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts

Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753

Reviewed By: ajkr

Differential Revision: D23385030

Pulled By: zhichao-cao

fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
2020-09-14 17:01:01 -07:00
Levi Tamasi
5ce246c716 Expose the start of the expiration range for TTL blob files through LiveFileMetaData (#7365)
Summary:
The patch adds support for exposing the start of the expiration range
for TTL blob files through the `GetLiveFilesMetaData` API. This can be
used for monitoring purposes, i.e. to make sure TTL blob files are
deleted in a timely manner. The patch also fixes a couple of uninitialized
variable issues.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7365

Test Plan: `make check`

Reviewed By: pdillinger

Differential Revision: D23605465

Pulled By: ltamasi

fbshipit-source-id: 97a9612bf5f4b058423debdd3f28f576bb23a70f
2020-09-10 11:33:33 -07:00
Peter Dillinger
4e258d3e63 Fix backup/restore in stress/crash test (#7357)
Summary:
(1) Skip check on specific key if restoring an old backup
(small minority of cases) because it can fail in those cases. (2) Remove
an old assertion about number of column families and number of keys
passed in, which is broken by atomic flush (cf_consistency) test. Like
other code (for better or worse) assume a single key and iterate over
column families. (3) Apply mock_direct_io to NewSequentialFile so that
db_stress backup works on /dev/shm.

Also add more context to output in case of backup/restore db_stress
failure.

Also a minor fix to BackupEngine to report first failure status in
creating new backup, and drop another clue about the potential
source of a "Backup failed" status.

Reverts "Disable backup/restore stress test (https://github.com/facebook/rocksdb/issues/7350)"

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7357

Test Plan:
Using backup_one_in=10000,
"USE_CLANG=1 make crash_test_with_atomic_flush" for 30+ minutes
"USE_CLANG=1 make blackbox_crash_test" for 30+ minutes
And with use_direct_reads with TEST_TMPDIR=/dev/shm/rocksdb

Reviewed By: riversand963

Differential Revision: D23567244

Pulled By: pdillinger

fbshipit-source-id: e77171c2e8394d173917e36898c02dead1c40b77
2020-09-08 10:50:19 -07:00
Jay Zhuang
55bf42a80c Recompress blobs during GC if compression changed (#7331)
Summary:
Recompress blobs if compression type is changed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7331

Test Plan: `make check`

Reviewed By: ltamasi

Differential Revision: D23437102

Pulled By: jay-zhuang

fbshipit-source-id: bb699ebdad137721d422e42e331d4de8a82a7c5f
2020-09-01 18:03:50 -07:00
Levi Tamasi
5043960623 Add a blob file builder class that can be used in background jobs (#7306)
Summary:
The patch adds a class called `BlobFileBuilder` that can be used to build
and cut blob files in background jobs (flushes/compactions). The class
enforces a value size threshold (`min_blob_size`; smaller blobs will be inlined
in the LSM tree itself), and supports specifying a blob file size limit (`blob_file_size`),
as well as compression (`blob_compression_type`) and checksums for blob files.
It also keeps track of the generated blob files and their associated `BlobFileAddition`
metadata, which can be applied as part of the background job's `VersionEdit`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7306

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23298817

Pulled By: ltamasi

fbshipit-source-id: 38f35d81dab1ba81f15236240612ec173d7f21b5
2020-08-27 11:55:54 -07:00
Peter Dillinger
9aad24da55 Real fix for race in backup custom checksum checking (#7309)
Summary:
This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294.
To get DB checksum info for live files, we now read the manifest file
that will become part of the checkpoint/backup. This requires a little
extra handling in taking a custom checkpoint, including only reading the
manifest file up to the size prescribed by the checkpoint.

This moves GetFileChecksumsFromManifest from backup code to
file_checksum_helper.{h,cc} and removes apparently unnecessary checking
related to column families.

Updated HISTORY.md and warned potential future users of
DB::GetLiveFilesChecksumInfo()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309

Test Plan: updated unit test, before and after

Reviewed By: ajkr

Differential Revision: D23311994

Pulled By: pdillinger

fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5
2020-08-26 10:39:20 -07:00
mrambacher
b7e1c5213f Add some simulator cache and block tracer tests to ASSERT_STATUS_CHECKED (#7305)
Summary:
More tests now pass.  When in doubt, I added a TODO comment to check what should happen with an ignored error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7305

Reviewed By: akankshamahajan15

Differential Revision: D23301262

Pulled By: ajkr

fbshipit-source-id: 5f120edc7393560aefc0633250277bbc7e8de9e6
2020-08-24 16:43:31 -07:00
rockeet
e653af7164 DBWithTTL::Open() param ttls: vector<int32_t> to const vector<int32_t>& (#7196)
Summary:
fix DBWithTTL::Open() param ttls: vector<int32_t> to const vector<int32_t>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7196

Reviewed By: akankshamahajan15

Differential Revision: D23277772

Pulled By: ajkr

fbshipit-source-id: bf69834b5c2062c7e166dab21fbfd40416c7872d
2020-08-24 16:24:16 -07:00
sdong
5aacef9712 Disable fsync in SeqAdvanceConcurrentTest (#7302)
Summary:
SeqAdvanceConcurrentTest sometimes runs too long on some platforms. Disable fsync to speed it up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7302

Test Plan: Run the tests and watch CI.

Reviewed By: ajkr

Differential Revision: D23298192

fbshipit-source-id: 2185eed4e0958c3de5e8a3f94ceed5be5945ed37
2020-08-24 13:22:06 -07:00
Peter Dillinger
a1b5484811 Work around a backup bug with DB custom checksums (#7294)
Summary:
On a read-write DB configured with
DBOptions::file_checksum_gen_factory, BackupEngine::CreateNewBackup can
fail intermittently, with non-OK status. This is due to a race between
GetLiveFiles and GetLiveFilesChecksumInfo in creating backups.

For patching 6.12 release (as this commit is intended for, except this is a
forward-merged version), we can simply treat files for which we falsely failed
to get checksum info as legacy files lacking checksum info.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7294

Test Plan: unit test reproducer included

Reviewed By: ajkr

Differential Revision: D23253489

Pulled By: pdillinger

fbshipit-source-id: 9e4945dad120b776ad3e753be10b962f61f28e14
2020-08-21 08:16:04 -07:00
mrambacher
e9befdebbf Add EnvTestWithParam::OptionsTest to the ASSERT_STATUS_CHECKED passes (#7283)
Summary:
This test uses database functionality and required more extensive work to get it to pass than the other tests.  The DB functionality required for this test now passes the check.

When it was unclear what the proper behavior was for unchecked status codes, a TODO was added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283

Reviewed By: akankshamahajan15

Differential Revision: D23251497

Pulled By: ajkr

fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523
2020-08-20 19:18:35 -07:00
Peter Dillinger
7d0ecab570 Fix some flaky tests in BackupableDBTest with intentional flushing (#7273)
Summary:
Some tests like BackupableDBTest.FileCollision and
ShareTableFilesWithChecksumsNewNaming are intermittently failing,
probably due to unpredictable flushing with FillDB. This change
should fix the failures seen and help to prevent similar flakiness in
future tests in the file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7273

Test Plan: make check, and with valgrind

Reviewed By: siying

Differential Revision: D23176947

Pulled By: pdillinger

fbshipit-source-id: 654b73a64db475f2b9b065ed53a889a8b9083c59
2020-08-17 22:07:17 -07:00
sdong
b194c21bba Whole DBTest to skip fsync (#7274)
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.

This commit reinstates https://github.com/facebook/rocksdb/issues/7049, whose un-revert was lost in an automatic
infrastructure mis-merge.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7274

Test Plan: Run all existing files.

Reviewed By: pdillinger

Differential Revision: D23177444

fbshipit-source-id: 1f61690b2ac6333c3b2c87176fef6b2cba086b33
2020-08-17 18:42:25 -07:00
Zitan Chen
500eeb6fd3 Re-enable param tests for backup engine (#7260)
Summary:
The param tests did not take any effect previously. This PR re-enables it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7260

Test Plan: Some manual tests and `./backupable_db_test`.

Reviewed By: siying

Differential Revision: D23140902

Pulled By: pdillinger

fbshipit-source-id: cd62b11b926affed25127d9074fa97a1c7f748c4
2020-08-17 13:59:21 -07:00
Zitan Chen
15245e9018 Fix flaky BackupableDBTest.CustomChecksumTransition (#7254)
Summary:
The flaky test in the title is caused by two problems. First, there is a bug in the BackupEngine that results in skipping computing the default crc32 checksum when `share_table_files` is enabled and the table is already backed up. Second, when `RestoreDBFromBackup` fails and the backup was being restored to the DB directory, it is likely that `RestoreDBFromBackup` has cleaned up the DB directory before it fails, and therefore, files in old backups may collide with files to be backed up if `share_files_with_checksum` is not enabled.

New tests that cover the above problems are added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7254

Test Plan: `./backupable_db_test`

Reviewed By: pdillinger

Differential Revision: D23118715

Pulled By: gg814

fbshipit-source-id: 7be8de912808944be59e93d602c7431a54c079eb
2020-08-14 13:34:15 -07:00
Zitan Chen
b578ca2e4d BackupEngine supports custom file checksums (#7085)
Summary:
A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups.

Tests are added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085

Test Plan: Passed make check

Reviewed By: pdillinger

Differential Revision: D22390756

Pulled By: gg814

fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f
2020-08-12 13:31:09 -07:00
Peter Dillinger
6ac1d25fd0 Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.

This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.

More specifics:

Removed some unused test classes, and updated comments on the general
problem.

Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.

Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env

Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc

stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)

Intended follow-up:

Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)

With that change, we can simplify/consolidate the scattered work-arounds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101

Test Plan: make check on Linux and MacOS

Reviewed By: zhichao-cao

Differential Revision: D23032815

Pulled By: pdillinger

fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 12:41:30 -07:00
Cheng Chang
71c7e4935e Replace tracked_keys with a new LockTracker interface in TransactionDB (#7013)
Summary:
We're going to support more locking protocols such as range lock in transaction.

However, in current design, `TransactionBase` has a member `tracked_keys` which assumes that point lock (lock a single key) is used, and is used in snapshot checking (isolation protocol). When using range lock, we may use read committed instead of snapshot checking as the isolation protocol.

The most significant usage scenarios of `tracked_keys` are:
1. pessimistic transaction uses it to track the locked keys, and unlock these keys when commit or rollback.
2. optimistic transaction does not lock keys upfront, it only tracks the lock intentions in tracked_keys, and do write conflict checking when commit.
3. each `SavePoint` tracks the keys that are locked since the `SavePoint`, `RollbackToSavePoint` or `PopSavePoint` relies on both the tracked keys in `SavePoint`s and `tracked_keys`.

Based on these scenarios, if we can abstract out a `LockTracker` interface to hold a set of tracked locks (can be keys or key ranges), and have methods that can be composed together to implement the scenarios, then `tracked_keys` can be an internal data structure of one implementation of `LockTracker`. See `utilities/transactions/lock/lock_tracker.h` for the detailed interface design, and `utilities/transactions/lock/point_lock_tracker.cc` for the implementation.

In the future, a `RangeLockTracker` can be implemented to track range locks without affecting other components.

After this PR, a clean interface for lock manager should be possible, and then ideally, we can have pluggable locking protocols.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7013

Test Plan: Run `transaction_test` and `optimistic_transaction_test`.

Reviewed By: ajkr

Differential Revision: D22163706

Pulled By: cheng-chang

fbshipit-source-id: f2860577b5334e31dd2994f5bc6d7c40d502b1b4
2020-08-06 12:38:00 -07:00
Levi Tamasi
124fbd96d8 Remove assertion from FaultInjectionTestFS::NewDirectory (#7220)
Summary:
FaultInjectionTestFS::NewDirectory currently asserts that the directory
creation on the target filesystem succeeds. This is actually not
guaranteed since there might be a legitimate I/O error when creating the
directory. The patch removes this assertion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7220

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D22957990

Pulled By: ltamasi

fbshipit-source-id: b2e221320d8ce7235cb4897ef5936072412a25b6
2020-08-05 16:25:14 -07:00
Andrew Kryczka
a4a4a2dabd dedup ReadOptions in iterator hierarchy (#7210)
Summary:
Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator`
and every `LevelIterator`. This redundancy consumes extra memory,
resulting in the `Arena` making more allocations, and iteration
observing worse cache performance.

This PR migrates callers of `NewInternalIterator()` and
`MakeInputIterator()` to provide a `ReadOptions` object guaranteed to
outlive the returned iterator. When the iterator's lifetime will be managed by the
user, this lifetime guarantee is achieved by storing the `ReadOptions`
value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and
`MakeInputIterator()` can hold a reference-to-const `ReadOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210

Test Plan:
- `make check` under ASAN and valgrind
- benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes.

Reviewed By: anand1976

Differential Revision: D22861323

Pulled By: ajkr

fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce
2020-08-03 15:23:04 -07:00
mrambacher
d9d190742c Make env*_test work with ASSERT_STATUS_CHECKED (#7176)
Summary:
Make (most of) the env*_test pass when ASSERT_STATUS_CHECKED is enabled.

One test that opens a database is currently disabled in this mode, as there are many errors that need revisited for DB tests and status checks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7176

Reviewed By: cheng-chang

Differential Revision: D22799278

Pulled By: ajkr

fbshipit-source-id: 16d8a02eaeecd6df1060249b6a5811292801f2ed
2020-07-28 22:59:48 -07:00