Commit Graph

3924 Commits

Author SHA1 Message Date
anand76
d70011bccc Check KeyContext status in MultiGet (#6387)
Summary:
Currently, any IO errors and checksum mismatches while reading data
blocks, are being ignored by the batched MultiGet. Its only looking at
the GetContext state. Fix that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6387

Test Plan: Add unit tests

Differential Revision: D19799819

Pulled By: anand1976

fbshipit-source-id: 46133dccbb04e64067b9fe6cda73e282203db969
2020-02-07 16:48:16 -08:00
Robert Yang
4e457278fa db/write_thread.cc: Initialize state (#6275)
Summary:
Fixed an error when compiled with -Og:
db/write_thread.cc:183:14: error: 'state' may be used uninitialized in this function [-Werror=maybe-uninitialized]

Signed-off-by: Robert Yang <liezhi.yang@windriver.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6275

Differential Revision: D19381755

fbshipit-source-id: a90bf3cd4a7248d9d71219e918fc6253deb97e3c
2020-02-07 15:20:38 -08:00
Levi Tamasi
752c87af78 Clean up VersionEdit a bit (#6383)
Summary:
This is a bunch of small improvements to `VersionEdit`. Namely, the patch

* Makes the names and order of variables, methods, and code chunks related
  to the various information elements more consistent, and adds missing
  getters for the sake of completeness.
* Initializes previously uninitialized stack variables.
* Marks all getters const to improve const correctness.
* Adds in-class initializers and removes the default ctor that would
  create an object with uninitialized built-in fields and call `Clear`
  afterwards.
* Adds a new type alias for new files and changes the existing `typedef`
  for deleted files into a type alias as well.
* Makes the helper method `DecodeNewFile4From` private.
* Switches from long-winded iterator syntax to range based loops in a
  couple of places.
* Fixes a couple of assignments where an integer 0 was assigned to
  boolean members.
* Fixes a getter which used to return a `const std::string` instead of
the intended `const std::string&`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6383

Test Plan: make check

Differential Revision: D19780537

Pulled By: ltamasi

fbshipit-source-id: b0b4f09fee0ec0e7c7b7a6d76bfe5346e91824d0
2020-02-07 13:27:06 -08:00
Cheng Chang
5f478b9f75 Remove outdated comment (#6379)
Summary:
Since the logic for handling IDENTITY file is now inside `NewDB`, the comment above `NewDB` is no longer relevant.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6379

Test Plan: not needed

Differential Revision: D19795440

Pulled By: cheng-chang

fbshipit-source-id: 0b1cca87ac6d92474701c46aa4c8d4d708bfa19b
2020-02-07 13:18:43 -08:00
Cheng Chang
0a74e1b958 Add status checks during DB::Open (#6380)
Summary:
Several statuses were not checked during DB::Open.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6380

Test Plan: make check

Differential Revision: D19780237

Pulled By: cheng-chang

fbshipit-source-id: c8d189d20344bd1607890dd1449345bda2ef96b9
2020-02-07 12:32:09 -08:00
Yanqin Jin
f361cedf06 Atomic flush rollback once on failure (#6385)
Summary:
Before this fix, atomic flush codepath may hit an assertion failure on a specific failure case.
If all flush jobs within an atomic flush succeed (they do not write to MANIFEST), but batch writing version edits to MANIFEST fails, then `cfd->imm()->RollbackMemTableFlush()` will be called twice, and the second invocation hits assertion failure `assert(m->flush_in_progress_)` since the first invocation resets the variable `flush_in_progress_` to false already.

Test plan (dev server):
```
./db_flush_test --gtest_filter=DBAtomicFlushTest/DBAtomicFlushTest.RollbackAfterFailToInstallResults
make check
```
Both must succeed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6385

Differential Revision: D19782943

Pulled By: riversand963

fbshipit-source-id: 84e1592625e729d1b70fdc8479959387a74cb121
2020-02-07 10:52:10 -08:00
Cheng Chang
f5f79f01a2 Be able to read compatible leveldb sst files (#6370)
Summary:
In `DBSSTTest.SSTsWithLdbSuffixHandling`, some sst files are renamed to ldb files, the original intention of the test is to test that the ldb files can be loaded along with the sst files.

The original test checks this by `ASSERT_NE("NOT_FOUND", Get(Key(k)))`, but the problem is `Get(Key(k))` returns IO error due to path not found instead of NOT_FOUND, so the success of ASSERT_NE does not mean the key can be retrieved.

This PR updates the test to make sure Get(Key(k)) returns the original value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6370

Test Plan: make db_sst_test && ./db_sst_test

Differential Revision: D19726278

Pulled By: cheng-chang

fbshipit-source-id: 993127f56457b315e669af4eeb92d6f956b7a4b7
2020-02-06 10:15:44 -08:00
Mike Kolupaev
1ed7d9b1b5 Avoid lots of calls to Env::GetFileSize() in SstFileManagerImpl when opening DB (#6363)
Summary:
Before this PR it calls GetFileSize() once for each sst file in the DB. This can take a long time if there are be tens of thousands of sst files (e.g. in thousands of column families), and even longer if Env is talking to some remote service rather than local filesystem. This PR makes DB::Open() use sst file sizes that are already known from manifest (typically almost all files in the DB) and only call GetFileSize() for non-sst or obsolete files. Note that GetFileSize() is also called and checked against manifest in CheckConsistency(), so the calls in SstFileManagerImpl were completely redundant.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6363

Test Plan: deployed to a test cluster, looked at a dump of Env calls (from a custom instrumented Env) - no more thousands of GetFileSize()s.

Differential Revision: D19702509

Pulled By: al13n321

fbshipit-source-id: 99f8110620cb2e9d0c092dfcdbb11f3af4ff8b73
2020-02-04 13:41:53 -08:00
sdong
69c8614815 Avoid to get manifest file size when recovering from it. (#6369)
Summary:
Right now RocksDB gets manifest file size before recovering from it. The information is available in LogReader. Use it instead to prevent one file system call.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6369

Test Plan: Run all existing tests

Differential Revision: D19714872

fbshipit-source-id: 0144be324d403c99e3da875ea2feccc8f64e883d
2020-02-04 11:39:23 -08:00
Mike Kolupaev
637e64b9ac Add an option to prevent DB::Open() from querying sizes of all sst files (#6353)
Summary:
When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open.

If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be *really* slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes.

We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353

Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble.

Differential Revision: D19656425

Pulled By: al13n321

fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82
2020-02-04 01:27:26 -08:00
anand76
7330ec0ff1 Fix a test failure in error_handler_test (#6367)
Summary:
Fix an intermittent failure in
DBErrorHandlingTest.CompactionManifestWriteError due to a race between
background error recovery and the main test thread calling
TEST_WaitForCompact().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6367

Test Plan: Run the test using gtest_parallel

Differential Revision: D19713802

Pulled By: anand1976

fbshipit-source-id: 29e35dc26e0984fe8334c083e059f4fa1f335d68
2020-02-03 18:16:52 -08:00
sdong
f195d8d523 Use ReadFileToString() to get content from IDENTITY file (#6365)
Summary:
Right now when reading IDENTITY file, we use a very similar logic as ReadFileToString() while it does an extra file size check, which may be expensive in some file systems. There is no reason to duplicate the logic. Use ReadFileToString() instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6365

Test Plan: RUn all existing tests.

Differential Revision: D19709399

fbshipit-source-id: 3bac31f3b2471f98a0d2694278b41e9cd34040fe
2020-02-03 17:40:49 -08:00
sdong
36c504be17 Avoid create directory for every column families (#6358)
Summary:
A relatively recent regression causes for every CF, create and open directory is called for the DB directory, unless CF has a private directory. This doesn't scale well with large number of column families.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6358

Test Plan: Run all existing tests and see it pass. strace with db_bench --num_column_families and observe it doesn't open directory for number of column families.

Differential Revision: D19675141

fbshipit-source-id: da01d9216f1dae3f03d4064fbd88ce71245bd9be
2020-02-03 14:13:39 -08:00
Huisheng Liu
eb4d6af5ae Error handler test fix (#6266)
Summary:
MultiDBCompactionError fails when it verifies the number of files on level 0 and level 1 without waiting for compaction to finish.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6266

Differential Revision: D19701639

Pulled By: riversand963

fbshipit-source-id: e96d511bcde705075f073e0b550cebcd2ecfccdc
2020-02-03 13:32:53 -08:00
sdong
800d24ddc5 Fix DBTest2.ChangePrefixExtractor LITE build (#6356)
Summary:
DBTest2.ChangePrefixExtractor fails in LITE build because LITE build doesn't support adaptive build. Fix it by removing the stats check but only check correctness.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6356

Test Plan: Run the test with both of LITE and non-LITE build.

Differential Revision: D19669537

fbshipit-source-id: 6d7dd6c8a79f18e80ca1636864b9c71922030d8e
2020-01-31 15:44:14 -08:00
sdong
ec496347bc Add a unit test for prefix extractor changes (#6323)
Summary:
Add a unit test for prefix extractor change, including a check that fails due to a bug.
Also comment out the partitioned filter case which will fail the test too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6323

Test Plan: Run the test and it passes (and fails if the SeekForPrev() part is uncommented)

Differential Revision: D19509744

fbshipit-source-id: 678202ca97b5503e9de73b54b90de9e5ba822b72
2020-01-31 11:02:03 -08:00
Maysam Yabandeh
3316d29221 Disable recycle_log_file_num when it is incompatible with recovery mode (#6351)
Summary:
Non-zero recycle_log_file_num is incompatible with kPointInTimeRecovery and kAbsoluteConsistency recovery modes. Currently SanitizeOptions changes the recovery mode to kTolerateCorruptedTailRecords, while to resolve this option conflict it makes more sense to compromise recycle_log_file_num, which is a performance feature, instead of wal_recovery_mode, which is a safety feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6351

Differential Revision: D19648931

Pulled By: maysamyabandeh

fbshipit-source-id: dd0bf78349edc007518a00c4d63931fd69294ad7
2020-01-31 07:28:30 -08:00
Yanqin Jin
f2fbc5d668 Shorten certain test names to avoid infra failure (#6352)
Summary:
Unit test names, together with other components,  are used to create log files
during some internal testing. Overly long names cause infra failure due to file
names being too long.

Look for internal tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6352

Differential Revision: D19649307

Pulled By: riversand963

fbshipit-source-id: 6f29de096e33c0eaa87d9c8702f810eda50059e7
2020-01-30 23:10:24 -08:00
anand76
fb05b5a652 Force a new manifest file if append to current one fails (#6331)
Summary:
Fix for issue https://github.com/facebook/rocksdb/issues/6316

When an append/sync of the manifest file fails due to an IO error such
as NoSpace, we don't always put the DB in read-only mode. This is true
for flush and compactions, as well as foreground operatons such as column family
add/drop, CompactFiles etc. Subsequent changes to the DB will be
recorded in the same manifest file, which would have a corrupted record
in the middle due to the previous failure. On next DB::Open(), it will
fail to process the full manifest and data will be lost.

To fix this, we reset VersionSet::descriptor_log_ on append/sync
failure, which will force a new manifest file to be written on the next
append.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6331

Test Plan: Add new unit tests in error_handler_test.cc

Differential Revision: D19632951

Pulled By: anand1976

fbshipit-source-id: 68d527cb6e59a94cbbbf9f5a17a7f464381d51e3
2020-01-30 10:56:29 -08:00
sdong
71874c5aaf Fix LITE build with DBTest2.AutoPrefixMode1 (#6346)
Summary:
DBTest2.AutoPrefixMode1 doesn't pass because auto prefix mode is not supported there.
Fix it by disabling the test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6346

Test Plan: Run DBTest2.AutoPrefixMode1 in lite mode

Differential Revision: D19627486

fbshipit-source-id: fbde75260aeecb7e6fc406e09c19a71a95aa5f08
2020-01-29 16:43:42 -08:00
sdong
02ac6c9a3c Fix db_bloom_filter_test clang LITE build (#6340)
Summary:
db_bloom_filter_test break with clang LITE build with following message:

db/db_bloom_filter_test.cc:23:29: error: unused variable 'kPlainTable' [-Werror,-Wunused-const-variable]
static constexpr PseudoMode kPlainTable = -1;
                            ^

Fix it by moving the declaration out of LITE build
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6340

Test Plan:
USE_CLANG=1 LITE=1 make db_bloom_filter_test
and without LITE=1

Differential Revision: D19609834

fbshipit-source-id: 0e88f5c6759238a94f9880d84c785ac18e7cdd7e
2020-01-29 12:57:48 -08:00
Maysam Yabandeh
2f973ca96e Double Crash in kPointInTimeRecovery with TransactionDB (#6313)
Summary:
In WritePrepared there could be gap in sequence numbers. This breaks the trick we use in kPointInTimeRecovery which assume the first seq in the log right after the corrupted log is one larger than the last seq we read from the logs. To let this trick keep working, we add a dummy entry with the expected sequence to the first log right after recovery.
Also in WriteCommitted, if the log right after the corrupted log is empty, since it has no sequence number to let the sequential trick work, it is assumed as unexpected behavior. This is however expected to happen if we close the db after recovering from a corruption and before writing anything new to it. To remedy that, we apply the same technique by writing a dummy entry to the log that is created after the corrupted log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6313

Differential Revision: D19458291

Pulled By: maysamyabandeh

fbshipit-source-id: 09bc49e574690085df45b034ca863ff315937e2d
2020-01-29 11:40:55 -08:00
sdong
8f2bee6747 Add ReadOptions.auto_prefix_mode (#6314)
Summary:
Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314

Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run.

Differential Revision: D19458717

fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce
2020-01-28 14:44:05 -08:00
Sagar Vemuri
4f6c86226c Use the same oldest ancestor time in table properties and manifest
Summary:
./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions passed 96 / 100 times.
```
With the fix: all runs (tried 100, 1000, 10000) succeed.
```
$ TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.LevelTtlCascadingCompactions --repeat=1000
[1000/1000] DBCompactionTest.LevelTtlCascadingCompactions (1895 ms)
```

Test Plan:
Build:
```
COMPILE_WITH_TSAN=1 make db_compaction_test -j100
```
Without the fix: a few runs out of 100 fail:
```
$ TEST_TMPDIR=/dev/shm KEEP_DB=1 ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.LevelTtlCascadingCompactions --repeat=100
...
...
Note: Google Test filter = DBCompactionTest.LevelTtlCascadingCompactions
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBCompactionTest
[ RUN      ] DBCompactionTest.LevelTtlCascadingCompactions
db/db_compaction_test.cc:3687: Failure
Expected equality of these values:
  oldest_time
    Which is: 1580155869
  level_to_files[6][0].oldest_ancester_time
    Which is: 1580155870
DB is still at /dev/shm//db_compaction_test_6337001442947696266
[  FAILED  ] DBCompactionTest.LevelTtlCascadingCompactions (1432 ms)
[----------] 1 test from DBCompactionTest (1432 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1433 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] DBCompactionTest.LevelTtlCascadingCompactions

 1 FAILED TEST
[80/100] DBCompactionTest.LevelTtlCascadingCompactions returned/aborted with exit code 1 (1489 ms)
[100/100] DBCompactionTest.LevelTtlCascadingCompactions (1522 ms)
FAILED TESTS (4/100):
    1419 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/90)
    1434 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/84)
    1457 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/82)
    1489 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/74)

Differential Revision: D19587040

Pulled By: sagar0

fbshipit-source-id: 11191ae9940837643bff47ebe18b299b4be3d950
2020-01-27 19:58:53 -08:00
Andrew Kryczka
5b33cfa1e3 fix WriteBufferManager flush log message (#6335)
Summary:
It chooses the oldest memtable, not the largest one. This is an
important difference for users whose CFs receive non-uniform write
rates.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6335

Differential Revision: D19588865

Pulled By: maysamyabandeh

fbshipit-source-id: 62ad4325b0182f5f27858584cd73fd5978fb2cec
2020-01-27 15:49:22 -08:00
sdong
f10f135938 Fix regression bug of hash index with iterator total order seek (#6328)
Summary:
https://github.com/facebook/rocksdb/pull/6028 introduces a bug for hash index in SST files. If a table reader is created when total order seek is used, prefix_extractor might be passed into table reader as null. While later when prefix seek is used, the same table reader used, hash index is checked but prefix extractor is null and the program would crash.
Fix the issue by fixing http://github.com/facebook/rocksdb/pull/6028 in the way that prefix_extractor is preserved but ReadOptions.total_order_seek is checked

Also, a null pointer check is added so that a bug like this won't cause segfault in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6328

Test Plan: Add a unit test that would fail without the fix. Stress test that reproduces the crash would pass.

Differential Revision: D19586751

fbshipit-source-id: 8de77690167ddf5a77a01e167cf89430b1bfba42
2020-01-27 15:44:54 -08:00
Levi Tamasi
f34782a67d Fix the "records dropped" statistics (#6325)
Summary:
The earlier code used two conflicting definitions for the number of
input records going into a compaction, one based on the
`rocksdb.num.entries` table property and one based on
`CompactionIterationStats`. The first one is correct and in line
with how output records are counted, while the second one incorrectly
ignores input records in various cases when the `CompactionIterator`
advances or reseeks the input iterator (this can happen, amongst other
cases, when dealing with `SingleDelete`s, regular `Delete`s, `Merge`s,
and compaction filters). This can result in the code undercounting the
input records and computing an incorrect value for "records dropped"
during the compaction. The patch fixes this by switching over to the
correct (table property based) input record count for "records dropped".
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6325

Test Plan: Tested using `make check` and `db_bench`.

Differential Revision: D19525491

Pulled By: ltamasi

fbshipit-source-id: 4340b0b2f41546db8e356db70ca02199e48fa636
2020-01-23 15:27:22 -08:00
anand76
0672a6db64 Fix queue manipulation in WriteThread::BeginWriteStall() (#6322)
Summary:
When there is a write stall, the active write group leader calls ```BeginWriteStall()``` to walk the queue of writers and remove any with the ```no_slowdown``` option set. There was a bug in the code which updated the back pointer but not the forward pointer (```link_newer```), corrupting the list and causing some threads to wait forever. This PR fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6322

Test Plan: Add a unit test in db_write_test

Differential Revision: D19538313

Pulled By: anand1976

fbshipit-source-id: 6fbed819e594913f435886606f5d36f74f235c3a
2020-01-23 14:01:28 -08:00
matthewvon
e6e8b9e871 Correct pragma once problem with Bazel on Windows (#6321)
Summary:
This is a simple edit to have two #include file paths be consistent within range_del_aggregator.{h,cc} with everywhere else.

The impact of this inconsistency is that it actual breaks a Bazel based build on the Windows platform. The same pragma once failure occurs with both Windows Visual C++ 2019 and clang for Windows 9.0. Bazel's "sandboxing" of the builds causes both compilers to not properly recognize "rocksdb/types.h" and "include/rocksdb/types.h" to be the same file (also comparator.h). My guess is that the backslash versus forward slash mixing within path names is the underlying issue.

But, everything builds fine once the include paths in these two source files are consistent with the rest of the repository.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6321

Differential Revision: D19506585

Pulled By: ltamasi

fbshipit-source-id: 294c346607edc433ab99eaabc9c880ee7426817a
2020-01-21 16:12:43 -08:00
Levi Tamasi
d305f13e21 Make DBCompactionTest.SkipStatsUpdateTest more robust (#6306)
Summary:
Currently, this test case tries to infer whether
`VersionStorageInfo::UpdateAccumulatedStats` was called during open by
checking the number of files opened against an arbitrary threshold (10).
This makes the test brittle and results in sporadic failures. The patch
changes the test case to use sync points to directly test whether
`UpdateAccumulatedStats` was called.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6306

Test Plan: `make check`

Differential Revision: D19439544

Pulled By: ltamasi

fbshipit-source-id: ceb7adf578222636a0f51740872d0278cd1a914f
2020-01-21 12:55:55 -08:00
chenyou-fdu
931876e86e Separate enable-WAL and disable-WAL writer to avoid unwanted data in log files (#6290)
Summary:
When we do concurrently writes, and different write operations will have WAL enable or disable.
But the data from write operation with WAL disabled will still be logged into log files, which will lead to extra disk write/sync since we do not want any guarantee for these part of data.

Detail can be found in https://github.com/facebook/rocksdb/issues/6280. This PR avoid mixing the two types in a write group. The advantage is simpler reasoning about the write group content
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6290

Differential Revision: D19448598

Pulled By: maysamyabandeh

fbshipit-source-id: 3d990a0f79a78ea1bfc90773f6ebafc1884c20de
2020-01-17 15:54:55 -08:00
Matt Bell
7e5b04d04f Expose atomic flush option in C API (#6307)
Summary:
This PR adds a `rocksdb_options_set_atomic_flush` function to the C API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6307

Differential Revision: D19451313

Pulled By: ltamasi

fbshipit-source-id: 750495642ef55b1ea7e13477f85c38cd6574849c
2020-01-17 12:57:48 -08:00
sdong
d87cffaea4 Fix another bug caused by recent hash index fix (#6305)
Summary:
Recent bug fix related to hash index introduced a new bug: hash index can return NotFound but it is not handled by BlockBasedTable::Get(). The end result is that Get() stops being executed too early. Fix it by ignoring NotFound code in Get().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6305

Test Plan: A problematic DB used to return NotFound incorrectly, and now able to return correct result. Will try to construct a unit test too.0

Differential Revision: D19438925

fbshipit-source-id: e751afa8c13728d56511cfeb1bc811ecb99f3217
2020-01-17 01:41:04 -08:00
Levi Tamasi
73f65b457e Adjust thread pool sizes when setting max_background_jobs dynamically (#6300)
Summary:
https://github.com/facebook/rocksdb/pull/2205 introduced a new
configuration option called `max_background_jobs`, superseding the
earlier options `max_background_flushes` and
`max_background_compactions`. However, unlike
`max_background_compactions`, setting `max_background_jobs` dynamically
through the `SetDBOptions` interface does not adjust the size of the
thread pools (see https://github.com/facebook/rocksdb/issues/6298). The
patch fixes this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6300

Test Plan: Extended unit test.

Differential Revision: D19430899

Pulled By: ltamasi

fbshipit-source-id: 704006605b3c13c3d1b997ccc0831ee369721074
2020-01-16 14:35:10 -08:00
sdong
f8b5ef85ec Fix a bug caused by recent fix of Prefix Hash (#6302)
Summary:
Recent fix to Prefix Hash https://github.com/facebook/rocksdb/pull/6292 caused a bug that the newly created NotFound status in hash index is never reset. This causes reseek or implict reseek to return wrong results sometimes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6302

Test Plan:
Add a unit test that would fail. Not fix.
crash test with hash test would fail in several seconds. With the fix, it will run about several minutes before failing with another failure.

Differential Revision: D19424572

fbshipit-source-id: c5276f36a95fd0e2837e30190476d2fe21ed8566
2020-01-16 10:47:20 -08:00
sdong
d2b4d42d4b Fix kHashSearch bug with SeekForPrev (#6297)
Summary:
When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target.
Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297

Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix.

Differential Revision: D19404695

fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3
2020-01-15 14:28:39 -08:00
sdong
76c117b24b Fix LITE test build broken by recent commit (#6295)
Summary:
A recent commit adds a unit test that uses a function not available in LITE build. Fix it by avoiding the call
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6295

Test Plan: Run the test in LITE build and see it passes.

Differential Revision: D19395678

fbshipit-source-id: 37b42835bae02511630d80f7cafb1179401bc033
2020-01-14 13:17:04 -08:00
sdong
894c6d21af Bug when multiple files at one level contains the same smallest key (#6285)
Summary:
The fractional cascading index is not correctly generated when two files at the same level contains the same smallest or largest user key.
The result would be that it would hit an assertion in debug mode and lower level files might be skipped.
This might cause wrong results when the same user keys are of merge operands and Get() is called using the exact user key. In that case, the lower files would need to further checked.
The fix is to fix the fractional cascading index.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6285

Test Plan: Add a unit test which would cause the assertion which would be fixed.

Differential Revision: D19358426

fbshipit-source-id: 39b2b1558075fd95e99491d462a67f9f2298c48e
2020-01-13 16:27:42 -08:00
Qinfan Wu
6733be033e More const pointers in C API (#6283)
Summary:
This makes it easier to call the functions from Rust as otherwise they require mutable types.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6283

Differential Revision: D19349991

Pulled By: wqfish

fbshipit-source-id: e8da7a75efe8cd97757baef8ca844a054f2519b4
2020-01-10 19:27:09 -08:00
Sagar Vemuri
cfa585611d Consider all compaction input files to compute the oldest ancestor time (#6279)
Summary:
Look at all compaction input files to compute the oldest ancestor time.

In https://github.com/facebook/rocksdb/issues/5992 we changed how creation_time (aka oldest-ancestor-time) table property of compaction output files is computed from max(creation-time-of-all-compaction-inputs) to min(creation-time-of-all-inputs). This exposed a bug where, during compaction, the creation_time:s of only the L0 compaction inputs were being looked at, and all other input levels were being ignored. This PR fixes the issue.
Some TTL compactions when using Level-Style compactions might not have run due to this bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6279

Test Plan: Enhanced the unit tests to validate that the correct time is propagated to the compaction outputs.

Differential Revision: D19337812

Pulled By: sagar0

fbshipit-source-id: edf8a72f11e405e93032ff5f45590816debe0bb4
2020-01-10 19:02:42 -08:00
Maysam Yabandeh
eff5e076f5 unordered_write incompatible with max_successive_merges (#6284)
Summary:
unordered_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions.
The patch fixes that and also reverts the changes performed by https://github.com/facebook/rocksdb/pull/6254, in which max_successive_merges was mistakenly declared incompatible with unordered_write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6284

Differential Revision: D19356115

Pulled By: maysamyabandeh

fbshipit-source-id: f06dadec777622bd75f267361c022735cf8cecb6
2020-01-10 16:53:19 -08:00
Yanqin Jin
6a9989381f Fix compilation under LITE (#6277)
Summary:
Fix compilation under LITE by putting `#ifndef ROCKSDB_LITE` around a code block.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6277

Differential Revision: D19334157

Pulled By: riversand963

fbshipit-source-id: 947111ed68aa550f5ea424b216c1442a8af9e32b
2020-01-09 15:57:39 -08:00
Yanqin Jin
cfd9732f65 Remove inaccurate code comment (#6274)
Summary:
Remove a comment.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6274

Differential Revision: D19323151

Pulled By: riversand963

fbshipit-source-id: d0d804d6882edcd94e35544ef45578b32ff1caae
2020-01-08 17:51:42 -08:00
Huisheng Liu
e5b476f551 Update file indexer to take timestamp into consideration (#6205)
Summary:
Exclude timestamp in key comparison during boundary calculation to avoid key versions being excluded.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6205

Differential Revision: D19166765

Pulled By: riversand963

fbshipit-source-id: bbe08816fef8de349a83ebd59a595ad844021f24
2020-01-08 16:31:23 -08:00
Yanqin Jin
a8b1085ae2 Fix test in LITE mode (#6267)
Summary:
Currently, the recently-added test DBTest2.SwitchMemtableRaceWithNewManifest
fails in LITE mode since SetOptions() returns "Not supported". I do not want to
put `#ifndef ROCKSDB_LITE` because it reduces test coverage. Instead, just
trigger compaction on a different column family. The bg compaction thread
calling LogAndApply() may race with thread calling SwitchMemtable().

Test Plan (dev server):
make check
OPT=-DROCKSDB_LITE make check

or run DBTest2.SwitchMemtableRaceWithNewManifest 100 times.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6267

Differential Revision: D19301309

Pulled By: riversand963

fbshipit-source-id: 88cedcca2f985968ed3bb234d324ffa2aa04ca50
2020-01-07 13:47:03 -08:00
Yanqin Jin
bce5189f4d Fix error message (#6264)
Summary:
Fix an error message when CURRENT is not found.

Test plan (dev server)
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6264

Differential Revision: D19300699

Pulled By: riversand963

fbshipit-source-id: 303fa206386a125960ecca1dbdeff07422690caf
2020-01-07 12:32:20 -08:00
Connor1996
3e26a94ba1 Add oldest snapshot sequence property (#6228)
Summary:
Add oldest snapshot sequence property, so we can use `db.GetProperty("rocksdb.oldest-snapshot-sequence")` to get the sequence number of the oldest snapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6228

Differential Revision: D19264145

Pulled By: maysamyabandeh

fbshipit-source-id: 67fbe5304d89cbc475bd404e30d1299f7b11c010
2020-01-07 08:36:44 -08:00
Yanqin Jin
1aaa145877 Fix a data race for cfd->log_number_ (#6249)
Summary:
A thread calling LogAndApply may release db mutex when calling
WriteCurrentStateToManifest() which reads cfd->log_number_. Another thread can
call SwitchMemtable() and writes to cfd->log_number_.
Solution is to cache the cfd->log_number_ before releasing mutex in
LogAndApply.

Test Plan (on devserver):
```
$COMPILE_WITH_TSAN=1 make db_stress
$./db_stress --acquire_snapshot_one_in=10000 --avoid_unnecessary_blocking_io=1 --block_size=16384 --bloom_bits=16 --bottommost_compression_type=zstd --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_pipelined_write=0  --flush_one_in=1000000 --format_version=5 --get_live_files_and_wal_files_one_in=1000000 --index_block_restart_interval=5 --index_type=0 --log2_keys_per_lock=22 --long_running_snapshots=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000000 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --memtablerep=skip_list --mmap_read=0 --nooverwritepercent=1 --open_files=500000 --ops_per_thread=100000000 --partition_filters=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --set_options_one_in=10000 --snapshot_hold_ops=100000 --subcompactions=2 --sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 --write_dbid_to_manifest=1 --writepercent=35
```
Then repeat the following multiple times, e.g. 100 after compiling with tsan.
```
$./db_test2 --gtest_filter=DBTest2.SwitchMemtableRaceWithNewManifest
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6249

Differential Revision: D19235077

Pulled By: riversand963

fbshipit-source-id: 79467b52f48739ce7c27e440caa2447a40653173
2020-01-06 20:09:51 -08:00
Qinfan Wu
edaaa1fff2 Add range delete function to C-API (#6259)
Summary:
It seems that the C-API doesn't expose the range delete functionality at the moment, so add the API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6259

Differential Revision: D19290320

Pulled By: pdillinger

fbshipit-source-id: 3f403a4c3446d2042d55f1ece7cdc9c040f40c27
2020-01-06 10:46:21 -08:00
Maysam Yabandeh
28e5a9a9fb Increase max_log_size in FlushJob to 1024 bytes (#6258)
Summary:
When measure_io_stats_ is enabled, the volume of logging is beyond the default limit of 512 size. The patch allows the EventLoggerStream to change the limit, and also sets it to 1024 for FlushJob.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6258

Differential Revision: D19279269

Pulled By: maysamyabandeh

fbshipit-source-id: 3fb5d468dad488f289ac99d713378177eb7504d6
2020-01-06 10:16:52 -08:00
Maysam Yabandeh
48a678b7c9 Prevent an incompatible combination of options (#6254)
Summary:
allow_concurrent_memtable_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6254

Differential Revision: D19265819

Pulled By: maysamyabandeh

fbshipit-source-id: 47f2e2dc26fe0972c7152f4da15dadb9703f1179
2020-01-02 16:15:06 -08:00
sdong
ef91894798 Fix potential overflow in CalculateSSTWriteHint() (#6212)
Summary:
level passed into ColumnFamilyData::CalculateSSTWriteHint() can be smaller than base_level in current version, which would cause overflow.
We see ubsan complains:

db/compaction/compaction_job.cc:1511:39: runtime error: load of value 4294967295, which is not a valid value for type 'Env::WriteLifeTimeHint'

and I hope this commit fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6212

Test Plan: Run existing tests and see them to pass.

Differential Revision: D19168442

fbshipit-source-id: bf8fd86f85478ecfa7556db46dc3242de8c83dc9
2019-12-18 17:04:15 -08:00
Jermy Li
f453bcb40d Add unit tests for concurrent CF iteration and drop (#6180)
Summary:
improve https://github.com/facebook/rocksdb/issues/6147
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6180

Differential Revision: D19148936

fbshipit-source-id: f691c9879fd51d54e96c1a99670cf85ca4485a89
2019-12-18 11:54:35 -08:00
Mike Kolupaev
ce63eda6f0 Fix use-after-free and double-deleting files in BackgroundCallPurge() (#6193)
Summary:
The bad code was:

```
mutex.Lock(); // `mutex` protects `container`
for (auto& x : container) {
  mutex.Unlock();
  // do stuff to x
  mutex.Lock();
}
```

It's incorrect because both `x` and the iterator may become invalid if another thread modifies the container while this thread is not holding the mutex.

Broken by https://github.com/facebook/rocksdb/pull/5796 - it replaced a `while (!container.empty())` loop with a `for (auto x : container)`.

(RocksDB code does a lot of such unlocking+re-locking of mutexes, and this type of bugs comes up a lot :/ )
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6193

Test Plan: Ran some logdevice integration tests that were crashing without this fix.

Differential Revision: D19116874

Pulled By: al13n321

fbshipit-source-id: 9672bc4227c1b68f46f7436db2b96811adb8c703
2019-12-17 20:08:56 -08:00
Adam Retter
2d16709487 Small tidy and speed up of the travis build (#6181)
Summary:
Cuts about 30-60 seconds to from each Travis Linux build, and about 15 minutes from each macOS build
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6181

Differential Revision: D19098357

Pulled By: pdillinger

fbshipit-source-id: 863dd1ab09076ad9b03c2b7914908359628315ae
2019-12-17 13:56:45 -08:00
解轶伦
39fcaf8246 delete superversions in BackgroundCallPurge (#6146)
Summary:
I found that CleanupSuperVersion() may block Get() for 30ms+ (per MemTable is 256MB).

Then I found "delete sv" in ~SuperVersion() takes the time.

The backtrace looks like this

DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() ->
DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion()

I think it's better to delete in a background thread,  please review it。
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146

Differential Revision: D18972066

fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c
2019-12-17 13:22:57 -08:00
Levi Tamasi
02aa22957a Set CompactionIterator::valid_ to false when PrepareBlobOutput indicates error
Summary:
With https://github.com/facebook/rocksdb/pull/6121, errors returned by `PrepareBlobValue`
result in `CompactionIterator::status_` being set to `Corruption` or `IOError`
as appropriate, however, `valid_` is not set to `false`. The error is eventually propagated in
`CompactionJob::ProcessKeyValueCompaction` but only after the main loop completes.
Setting `valid_` to `false` upon errors enables us to terminate the loop early and fail the
compaction sooner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6170

Test Plan:
Ran `make check` and used `db_bench` in BlobDB mode.

fbshipit-source-id: a2ca88a3ca71115e2605bd34a4c795d8a28bef27
2019-12-17 10:20:16 -08:00
Yanqin Jin
7678cf2df7 Use Env::LoadEnv to create custom Env objects (#6196)
Summary:
As title. Previous assumption was that the underlying lib can always return
a shared_ptr<Env>. This is too strong. Therefore, we use Env::LoadEnv to relax
it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6196

Test Plan: make check

Differential Revision: D19133199

Pulled By: riversand963

fbshipit-source-id: c83a0c02a42610d077054f2de1acfc45126b3a75
2019-12-16 20:03:14 -08:00
Zhichao Cao
cddd637997 Merge adjacent file block reads in RocksDB MultiGet() and Add uncompressed block to cache (#6089)
Summary:
In the current MultiGet, if the KV-pairs do not belong to the data blocks in the block cache, multiple blocks are read from a SST. It will trigger one block read for each block request and read them in parallel. In some cases, if some data blocks are adjacent in the SST, the reads for these blocks can be combined to a single large read, which can reduce the system calls and reduce the read latency if possible.

Considering to fill the block cache, if multiple data blocks are in the same memory buffer, we need to copy them to the heap separately. Therefore, only in the case that 1) data block compression is enabled, and 2) compressed block cache is null, we can do combined read. Otherwise, extra memory copy is needed, which may cause extra overhead. In the current case, data blocks will be uncompressed to a new memory space.

Also, in the case that 1) data block compression is enabled, and 2) compressed block cache is null, it is possible the data block is actually not compressed. In the current logic, these data blocks will not be added to the uncompressed_cache. So if memory buffer is shared and the data block is not compressed, the data block are copied to the head and fill the cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6089

Test Plan: Added test case to ParallelIO.MultiGet. Pass make asan_check

Differential Revision: D18734668

Pulled By: zhichao-cao

fbshipit-source-id: 67c5615ed373e51e42635fd74b36f8f3a66d5da4
2019-12-16 16:26:03 -08:00
Levi Tamasi
db7c687523 Fix a data race related to memtable trimming (#6187)
Summary:
https://github.com/facebook/rocksdb/pull/6177 introduced a data race
involving `MemTableList::InstallNewVersion` and `MemTableList::NumFlushed`.
The patch fixes this by caching whether the current version has any
memtable history (i.e. flushed memtables that are kept around for
transaction conflict checking) in an `std::atomic<bool>` member called
`current_has_history_`, similarly to how `current_memory_usage_excluding_last_`
is handled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6187

Test Plan:
```
make clean
COMPILE_WITH_TSAN=1 make db_test -j24
./db_test
```

Differential Revision: D19084059

Pulled By: ltamasi

fbshipit-source-id: 327a5af9700fb7102baea2cc8903c085f69543b9
2019-12-16 13:16:31 -08:00
Levi Tamasi
bd8404feff Do not schedule memtable trimming if there is no history (#6177)
Summary:
We have observed an increase in CPU load caused by frequent calls to
`ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory`
when using `max_write_buffer_size_to_maintain` to limit the amount of
memtable history maintained for transaction conflict checking. Part of the issue
is that trimming can potentially be scheduled even if there is no memtable
history. The patch adds a check that fixes this.

See also https://github.com/facebook/rocksdb/pull/6169.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6177

Test Plan:
Compared `perf` output for

```
./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32
```

before and after the change. There is a significant reduction for the call chain
`rocksdb::DBImpl::TrimMemtableHistory` -> `rocksdb::ColumnFamilyData::InstallSuperVersion` ->
`rocksdb::ThreadLocalPtr::StaticMeta::Scrape` even without https://github.com/facebook/rocksdb/pull/6169.

Differential Revision: D19057445

Pulled By: ltamasi

fbshipit-source-id: dff81882d7b280e17eda7d9b072a2d4882c50f79
2019-12-13 19:11:19 -08:00
anand76
afa2420c2b Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:48:41 -08:00
Peter Dillinger
58d46d1915 Add useful idioms to Random API (OneInOpt, PercentTrue) (#6154)
Summary:
And clean up related code, especially in stress test.

(More clean up of db_stress_test_base.cc coming after this.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6154

Test Plan: make check, make blackbox_crash_test for a bit

Differential Revision: D18938180

Pulled By: pdillinger

fbshipit-source-id: 524d27621b8dbb25f6dff40f1081e7c00630357e
2019-12-13 14:30:14 -08:00
Levi Tamasi
6d54eb3dc2 Do not create/install new SuperVersion if nothing was deleted during memtable trim (#6169)
Summary:
We have observed an increase in CPU load caused by frequent calls to
`ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory`
when using `max_write_buffer_size_to_maintain` to limit the amount of
memtable history maintained for transaction conflict checking. As it turns out,
this is caused by the code creating and installing a new `SuperVersion` even if
no memtables were actually trimmed. The patch adds a check to avoid this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6169

Test Plan:
Compared `perf` output for

```
./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32
```

before and after the change. With the fix, the call chain `rocksdb::DBImpl::TrimMemtableHistory` ->
`rocksdb::ColumnFamilyData::InstallSuperVersion` -> `rocksdb::ThreadLocalPtr::StaticMeta::Scrape`
no longer registers in the `perf` report.

Differential Revision: D19031509

Pulled By: ltamasi

fbshipit-source-id: 02686fce594e5b50eba0710e4b28a9b808c8aa20
2019-12-13 13:29:29 -08:00
Levi Tamasi
583c6953d8 Move out valid blobs from the oldest blob files during compaction (#6121)
Summary:
The patch adds logic that relocates live blobs from the oldest N non-TTL
blob files as they are encountered during compaction (assuming the BlobDB
configuration option `enable_garbage_collection` is `true`), where N is defined
as the number of immutable non-TTL blob files multiplied by the value of
a new BlobDB configuration option called `garbage_collection_cutoff`.
(The default value of this parameter is 0.25, that is, by default the valid blobs
residing in the oldest 25% of immutable non-TTL blob files are relocated.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6121

Test Plan: Added unit test and tested using the BlobDB mode of `db_bench`.

Differential Revision: D18785357

Pulled By: ltamasi

fbshipit-source-id: 8c21c512a18fba777ec28765c88682bb1a5e694e
2019-12-13 10:13:05 -08:00
Jermy Li
c2029f9716 Support concurrent CF iteration and drop (#6147)
Summary:
It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI.

This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early.

fixed https://github.com/facebook/rocksdb/issues/5982
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6147

Differential Revision: D18926378

fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15
2019-12-12 19:04:48 -08:00
奏之章
c4ce8e637f Fix RangeDeletion bug (#6062)
Summary:
Read keys from a snapshot that a range deletion were added after the snapshot  was created and this range deletion was inside an immutable memtable, we will get wrong key set.
More detail rest in codes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6062

Differential Revision: D18966785

Pulled By: pdillinger

fbshipit-source-id: 38a60bb1e2d0a1dbfc8ec641617200b6a02b86c3
2019-12-12 15:18:02 -08:00
Connor
a844591201 wait pending memtable writes on file ingestion or compact range (#6113)
Summary:
**Summary:**
This PR fixes two unordered_write related issues:
- ingestion job may skip the necessary memtable flush https://github.com/facebook/rocksdb/issues/6026
- compact range may cause memtable is flushed before pending unordered write finished
    1. `CompactRange` triggers memtable flush but doesn't wait for pending-writes
    2.  there are some pending writes but memtable is already flushed
    3.  the memtable related WAL is removed( note that the pending-writes were recorded in that WAL).
    4.  pending-writes write to newer created memtable
    5. there is a restart
    6. missing the previous pending-writes because WAL is removed but they aren't included in SST.

**How to solve:**
- Wait pending memtable writes before ingestion job check memtable key range
- Wait pending memtable writes before flush memtable.
**Note that: `CompactRange` calls `RangesOverlapWithMemtables` too without waiting for pending waits, but I'm not sure whether it affects the correctness.**

**Test Plan:**
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6113

Differential Revision: D18895674

Pulled By: maysamyabandeh

fbshipit-source-id: da22b4476fc7e06c176020e7cc171eb78189ecaf
2019-12-12 14:08:02 -08:00
Levi Tamasi
e1dfe80fe0 Mark BlobIndex::DebugString const
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6157

Test Plan: make check

Differential Revision: D18944259

Pulled By: ltamasi

fbshipit-source-id: 7fb29447b52d801215bd6ab811e229a7fa2c763d
2019-12-11 17:19:43 -08:00
Peter Dillinger
d0ad3c59d8 Fix c_test:filter for various CACHE_LINE_SIZEs (#6153)
Summary:
This test was recently updated but failed to account for Bloom
schema variance by CACHE_LINE_SIZE. (Since CACHE_LINE_SIZE is not
defined in our C code, the test now simply allows a valid result for any
CACHE_LINE_SIZE, not just the current one.)

Unblock https://github.com/facebook/rocksdb/issues/5932
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6153

Test Plan:
ran unit test with builds TEST_CACHE_LINE_SIZE=128, =256, and
unset (64 on Intel)

Differential Revision: D18936015

Pulled By: pdillinger

fbshipit-source-id: e5e3852f95283d34d624632c1ae8d3adb2f2662c
2019-12-11 15:17:08 -08:00
奏之章
3717a88289 Fix UniversalCompaction trivial move bug (#6067)
Summary:
`curr.level` is `c->inputs_` index, not real level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6067

Differential Revision: D18935726

fbshipit-source-id: 4354e6e9cd900ca56c96e9d770f0ab6634e45daf
2019-12-11 11:27:53 -08:00
Yi Wu
05a86318a7 Remove unused low_pri_write_rate_limiter_ (#6068)
Summary:
`low_pri_write_rate_limiter_` is not being used. Removing. `WriteController` has an internal low_pri rate limiter which is the real rate limiter for low-pri writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6068

Test Plan: make

Differential Revision: D18664120

fbshipit-source-id: dfe3e4de033cf3522b67781b383aad7d0936034c
2019-12-11 10:28:33 -08:00
sdong
a68dff5c35 Apply formatter to some recent commits (#6138)
Summary:
Formatter somehow complains some recent lines changed. Apply them to make the formatter happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6138

Test Plan: See CI passes.

Differential Revision: D18895950

fbshipit-source-id: 7d1696cf3e3a682bc10a30cdca748a23c6565255
2019-12-09 15:49:49 -08:00
Peter Dillinger
e43d2c4424 Fix & test rocksdb_filterpolicy_create_bloom_full (#6132)
Summary:
Add overrides needed in FilterPolicy wrapper to fix
rocksdb_filterpolicy_create_bloom_full (see issue https://github.com/facebook/rocksdb/issues/6129). Re-enabled
assertion in BloomFilterPolicy::CreateFilter that was being violated.
Expanded c_test to identify Bloom filter implementations by FP counts.
(Without the fix, updated test will trigger assertion and fail otherwise
without the assertion.)

Fixes https://github.com/facebook/rocksdb/issues/6129
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6132

Test Plan: updated c_test, also run under valgrind.

Differential Revision: D18864911

Pulled By: pdillinger

fbshipit-source-id: 08e81d7b5368b08e501cd402ef5583f2650c19fa
2019-12-09 12:21:14 -08:00
Ziyue Yang
7e2f831924 Fix wrong ExtractUserKey usage in BlockBasedTableBuilder::EnterUnbuff… (#6100)
Summary:
BlockBasedTableBuilder uses ExtractUserKey in EnterUnbuffered. This would
cause index filter building error, since user-provided timestamp is supported
by ExtractUserKeyAndStripTimestamp, and it's used in Add. This commit changes
ExtractUserKey to ExtractUserKeyAndStripTimestamp.

A test case is also added by modifying DBBasicTestWithTimestampWithParam_
PutAndGet test in db_basic_test to cover ExtractUserKeyAndStripTimestamp usage
in both kBuffered and kUnbuffered state of BlockBasedTableBuilder.

Before the ExtractUserKeyAndStripTimstamp fix:

```
$ ./db_basic_test --gtest_filter="*PutAndGet*"
Note: Google Test filter = *PutAndGet*
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam
[ RUN      ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
[  FAILED  ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0, where GetParam() = false (1177 ms)
[ RUN      ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1
[       OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1 (1056 ms)
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam (2233 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (2233 ms total)
[  PASSED  ] 1 test.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0, where GetParam() = false

 1 FAILED TEST
```

After the ExtractUserKeyAndStripTimstamp fix:

```
$ ./db_basic_test --gtest_filter="*PutAndGet*"
Note: Google Test filter = *PutAndGet*
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam
[ RUN      ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0
[       OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0 (1417 ms)
[ RUN      ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1
[       OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1 (1041 ms)
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam (2458 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (2458 ms total)
[  PASSED  ] 2 tests.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6100

Differential Revision: D18769654

Pulled By: riversand963

fbshipit-source-id: 76c2cf2c9a5e0d85db95d98e812e6af0c2a15c6b
2019-12-09 10:57:02 -08:00
Peter Dillinger
3a6d9436e8 Use SpecialSkipListFactory in RecalculateScoreAfterPicking (#6125)
Summary:
Test DBTestUniversalCompaction.RecalculateScoreAfterPicking was
flaky on ARM, so it now uses SpecialSkipListFactory (like other tests)
for predictable memtable flushes.

Fixes https://github.com/facebook/rocksdb/issues/5736
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6125

Test Plan:
while ./db_universal_compaction_test; do :; done # for a
while on ARM and on Intel (both Linux)

Differential Revision: D18864821

Pulled By: pdillinger

fbshipit-source-id: 2f3ca0ea66ce420dcd6d41b0ec12377112a5a79f
2019-12-09 09:23:50 -08:00
sdong
7d79b32618 Break db_stress_tool.cc to a list of source files (#6134)
Summary:
db_stress_tool.cc now is a giant file. In order to main it easier to improve and maintain, break it down to multiple source files.
Most classes are turned into their own files. Separate .h and .cc files are created for gflag definiations. Another .h and .cc files are created for some common functions. Some test execution logic that is only loosely related to class StressTest is moved to db_stress_driver.h and db_stress_driver.cc. All the files are located under db_stress_tool/. The directory name is created as such because if we end it with either stress or test, .gitignore will ignore any file under it and makes it prone to issues in developements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6134

Test Plan: Build under GCC7 with and without LITE on using GNU Make. Build with GCC 4.8. Build with cmake with -DWITH_TOOL=1

Differential Revision: D18876064

fbshipit-source-id: b25d0a7451840f31ac0f5ebb0068785f783fdf7d
2019-12-08 23:51:01 -08:00
Yanqin Jin
fe1147db1c Let DBSecondary close files after catch up (#6114)
Summary:
After secondary instance replays the logs from primary, certain files become
obsolete. The secondary should find these files, evict their table readers from
table cache and close them. If this is not done, the secondary will hold on to
these files and prevent their space from being freed.

Test plan (devserver):
```
$./db_secondary_test --gtest_filter=DBSecondaryTest.SecondaryCloseFiles
$make check
$./db_stress -ops_per_thread=100000 -enable_secondary=true -threads=32 -secondary_catch_up_one_in=10000 -clear_column_family_one_in=1000 -reopen=100
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6114

Differential Revision: D18769998

Pulled By: riversand963

fbshipit-source-id: 5d1f151567247196164e1b79d8402fa2045b9120
2019-12-02 17:45:03 -08:00
David Palm
048472f620 Add missing DataBlock-releated functions to the C-API (#6101)
Summary:
Adds two missing functions to the C-API:

- `rocksdb_block_based_options_set_data_block_index_type`
- `rocksdb_block_based_options_set_data_block_hash_ratio`

This enables users in other languages to enjoy the new(-ish) feature.

The changes here are partially overlapping with [another PR](https://github.com/facebook/rocksdb/pull/5630) but are more focused on the DataBlock indexing options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6101

Differential Revision: D18765639

fbshipit-source-id: 4a8947e71b179f26fa1eb83c267dd47ee64ac3b3
2019-12-02 11:00:09 -08:00
Yanqin Jin
09fcf4fb6b Fix a potential bug scheduling unnecessary threads (#6104)
Summary:
RocksDB should decrement the counter `unscheduled_flushes_` as soon as the bg
thread is scheduled. Before this fix, the counter is decremented only when the
bg thread starts and picks an element from the flush queue. This may cause more
than necessary bg threads to be scheduled. Not a correctness issue, but may
affect flush thread count.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6104

Test Plan:
```
make check
```

Differential Revision: D18735584

Pulled By: riversand963

fbshipit-source-id: d36272d4a08a494aeeab6200a3cff7a3d1a2dc10
2019-11-27 14:48:49 -08:00
sdong
aa1857e2df Support options.max_open_files = -1 with periodic_compaction_seconds (#6090)
Summary:
options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090

Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2.

Differential Revision: D18702268

fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb
2019-11-26 21:39:56 -08:00
Peter Dillinger
ca3b6c28c9 Expose and elaborate FilterBuildingContext (#6088)
Summary:
This change enables custom implementations of FilterPolicy to
wrap a variety of NewBloomFilterPolicy and select among them based on
contextual information such as table level and compaction style.

* Moves FilterBuildingContext to public API and elaborates it with more
useful data. (It would be nice to put more general options-like data,
but at the time this object is constructed, we are using internal APIs
ImmutableCFOptions and MutableCFOptions and don't have easy access to
ColumnFamilyOptions that I can tell.)

* Renames BloomFilterPolicy::GetFilterBitsBuilderInternal to
GetBuilderWithContext, because it's now public.

* Plumbs through the table's "level_at_creation" for filter building
context.

* Simplified some tests by adding GetBuilder() to
MockBlockBasedTableTester.

* Adds test as DBBloomFilterTest.ContextCustomFilterPolicy, including
sample wrapper class LevelAndStyleCustomFilterPolicy.

* Fixes a cross-test bug in DBBloomFilterTest.OptimizeFiltersForHits
where it does not reset perf context.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6088

Test Plan: make check, valgrind on db_bloom_filter_test

Differential Revision: D18697817

Pulled By: pdillinger

fbshipit-source-id: 5f987a2d7b07cc7a33670bc08ca6b4ca698c1cf4
2019-11-26 18:24:10 -08:00
sdong
77eab5c85a Make default value of options.ttl to be 30 days when it is supported. (#6073)
Summary:
By default options.ttl is disabled. We believe a better default will be 30 days, which means deleted data the database will be removed from SST files slightly after 30 days, for most of the cases.

Make the default UINT64_MAX - 1 to indicate that it is not overridden by users.

Change periodic_compaction_seconds to be UINT64_MAX - 1 to UINT64_MAX  too to be consistent. Also fix a small bug in the previous periodic_compaction_seconds default code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6073

Test Plan: Add unit tests for it.

Differential Revision: D18669626

fbshipit-source-id: 957cd4374cafc1557d45a0ba002010552a378cc8
2019-11-26 10:00:32 -08:00
Sagar Vemuri
669ea77d9f Support ttl in Universal Compaction (#6071)
Summary:
`options.ttl` is now supported in universal compaction, similar to how periodic compactions are implemented in PR https://github.com/facebook/rocksdb/issues/5970 .
Setting `options.ttl` will simply set `options.periodic_compaction_seconds` to execute the periodic compactions code path.
Discarded PR https://github.com/facebook/rocksdb/issues/4749 in lieu of this.

This is a short term work-around/hack of falling back to periodic compactions when ttl is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6071

Test Plan: Added a unit test.

Differential Revision: D18668336

Pulled By: sagar0

fbshipit-source-id: e75f5b81ba949f77ef9eff05e44bb1c757f58612
2019-11-22 22:13:35 -08:00
sdong
d8c28e692a Support options.ttl with options.max_open_files = -1 (#6060)
Summary:
Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead.

Note that, this change will break forward compatibility for release 5.1 and older.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060

Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0.

Differential Revision: D18631623

fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830
2019-11-22 21:23:00 -08:00
Little-Wallace
e50b64bdba fix unstable unittest caused by #5958 (#6061)
Summary:
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

This PR is to fix unstable unit test added by  (https://github.com/facebook/rocksdb/pull/5958).
I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger,  the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6061

Differential Revision: D18642301

fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148
2019-11-21 15:24:01 -08:00
Yanqin Jin
0ce0edbe12 Fix a data race between GetColumnFamilyMetaData and MarkFilesBeingCompacted (#6056)
Summary:
Use db mutex to protect the execution of Version::GetColumnFamilyMetaData()
called in DBImpl::GetColumnFamilyMetaData().
Without mutex, GetColumnFamilyMetaData() races with MarkFilesBeingCompacted()
for access to FileMetaData::being_compacted.
Other than mutex, there are several more alternatives.

- Make FileMetaData::being_compacted an atomic variable. This will make
  FileMetaData non-copy-able.

- Separate being_compacted from FileMetaData. This requires re-organizing data
  structures that are already used in many places.

Test Plan (dev server):
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6056

Differential Revision: D18620488

Pulled By: riversand963

fbshipit-source-id: 87f89660b5d5e2ab4ef7962b7b2a7d00e346aa3b
2019-11-20 16:36:29 -08:00
sdong
27ec3b3466 Sanitize input in DB::MultiGet() API (#6054)
Summary:
The new DB::MultiGet() doesn't validate input for num_keys > 1 and GCC-9 complains about it. Fix it by directly return when num_keys == 0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6054

Test Plan: Build with GCC-9 and see it passes.

Differential Revision: D18608958

fbshipit-source-id: 1c279aff3c7fe6e9d5a6d085ed02550ecea4fdb2
2019-11-20 10:38:01 -08:00
Peter Dillinger
0306e01233 Fixes for g++ 4.9.2 compatibility (#6053)
Summary:
Taken from merryChris in https://github.com/facebook/rocksdb/issues/6043

Stackoverflow ref on {{}} vs. {}:
https://stackoverflow.com/questions/26947704/implicit-conversion-failure-from-initializer-list

Note to reader: .clear() does not empty out an ostringstream, but .str("")
suffices because we don't have to worry about clearing error flags.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6053

Test Plan: make check, manual run of filter_bench

Differential Revision: D18602259

Pulled By: pdillinger

fbshipit-source-id: f6190f83b8eab4e80e7c107348839edabe727841
2019-11-19 15:43:37 -08:00
Little-Wallace
ec3e3c3e02 Fix corruption with intra-L0 on ingested files (#5958)
Summary:
## Problem Description

Our process was abort when it call `CheckConsistency`. And the information in  `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ".  Here are the causes of the accident I investigated.

* RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested.
* When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
* `CheckConsistency`  determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
* If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst,  the `smallest_seqno` of this new file will be smaller than his `largest_seqno`.
    * If more than one ingested file was ingested before memtable schedule flush,  and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable.  So `CheckConsistency` will return a `Corruption`.
    * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start,  but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2).  **But there was still some data in s1  written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files:
    > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno  < s1.largest_seqno

So I skip pick sst file which was ingested in function `FindIntraL0Compaction `

## Reason

Here is my bug report: https://github.com/facebook/rocksdb/issues/5913

There are two situations that can cause the check to fail.

### First situation:
- First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.

- If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.

- Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
- RocksDB check consistency , and find the `smallest_seqno` of B is  less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.

### Secondary situaion

- First we have flushed many sst in L0,  we call them [s1, s2, s3].

- There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1.  And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.

- m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.

- [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction.  We call it s6.
  - compacted 4@0 files to L0
- When s6 is added into manifest,  the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5.  But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
   - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958

Differential Revision: D18601316

fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
2019-11-19 15:09:11 -08:00
Levi Tamasi
019eb1f402 Disable blob iterator test with max_sequential_skip_in_iterations==0 in LITE mode (#6052)
Summary:
The SetOptions API used by the test is not supported in LITE mode,
so we should skip the new chunk in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6052

Test Plan: Ran the unit tests both in regular and LITE mode.

Differential Revision: D18601763

Pulled By: ltamasi

fbshipit-source-id: 883d6882771e0fb4aae72bb77ba4e63d9febec04
2019-11-19 15:02:41 -08:00
tabokie
20b48c6478 Fix blob context when db_iter uses seek (#6051)
Summary:
Fix: when `db_iter` falls back to using seek by `FindValueForCurrentKeyUsingSeek`, `is_blob_` flag is not properly set on encountering BlobIndex.
Also patch existing test for the mentioned code path.
Signed-off-by: tabokie <xy.tao@outlook.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6051

Differential Revision: D18596274

Pulled By: ltamasi

fbshipit-source-id: 8e4714af263b99dc2c379707d50db88fe6799278
2019-11-19 11:39:02 -08:00
anand76
38cc611297 Fix test failure in LITE mode (#6050)
Summary:
GetSupportedCompressions() is not available in LITE build, so check and use Snappy compression in db_basic_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6050

Test Plan:
make LITE=1 check
make check

Differential Revision: D18588114

Pulled By: anand1976

fbshipit-source-id: a193de58c44f91bcc237107f25dbc1b9458eef3d
2019-11-19 10:13:24 -08:00
anand76
5b9233bfe8 Fix a test failure on systems that don't have Snappy compression libraries (#6038)
Summary:
The ParallelIO/DBBasicTestWithParallelIO.MultiGet/11 test fails if Snappy compression library is not installed, since RocksDB defaults to Snappy if none is specified. So dynamically determine the supported compression types and pick the first one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6038

Differential Revision: D18532370

Pulled By: anand1976

fbshipit-source-id: a0a735114d1f8892ea09f7c4af8688d7bcc5b075
2019-11-18 09:37:18 -08:00
Little-Wallace
f65ec09ef8 Fix IngestExternalFile's bug with two_write_queue (#5976)
Summary:
When two_write_queue enable, IngestExternalFile performs EnterUnbatched on both write queues. SwitchMemtable also EnterUnbatched on 2nd write queue when this option is enabled. When the call stack includes IngestExternalFile -> FlushMemTable -> SwitchMemtable, this results into a deadlock.
The implemented solution is to pass on the existing writes_stopped argument in FlushMemTable to skip EnterUnbatched in SwitchMemtable.
Fixes https://github.com/facebook/rocksdb/issues/5974
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5976

Differential Revision: D18535943

Pulled By: maysamyabandeh

fbshipit-source-id: a4f9d4964c10d4a7ca06b1e0102ca2ec395512bc
2019-11-15 14:00:37 -08:00
Peter Dillinger
00d58a370e Abandon use of folly::Optional (#6036)
Summary:
Had complications with LITE build and valgrind test.
Reverts/fixes small parts of PR https://github.com/facebook/rocksdb/issues/6007
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6036

Test Plan:
make LITE=1 all check
and
ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 make -j24 db_bloom_filter_test && ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 ./db_bloom_filter_test

Differential Revision: D18512238

Pulled By: pdillinger

fbshipit-source-id: 37213cf0d309edf11c483fb4b2fb6c02c2cf2b28
2019-11-14 14:04:15 -08:00
Peter Dillinger
f059c7d9b9 New Bloom filter implementation for full and partitioned filters (#6007)
Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.

Speed

The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.

Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):

$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
  Single filter net ns/op: 26.2825
  Random filter net ns/op: 150.459
    Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
  Single filter net ns/op: 63.2978
  Random filter net ns/op: 188.038
    Average FP rate %: 1.13823

Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.

The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.

Accuracy

The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.

Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)

Fixes https://github.com/facebook/rocksdb/issues/5857
Fixes https://github.com/facebook/rocksdb/issues/4120

Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.

Compatibility

Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007

Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).

Differential Revision: D18294749

Pulled By: pdillinger

fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
2019-11-13 16:44:01 -08:00
sdong
bb23bfe63c Fix a regression bug on total order seek with prefix enabled and range delete (#6028)
Summary:
Recent change https://github.com/facebook/rocksdb/pull/5861 mistakely use "prefix_extractor_ != nullptr" as the condition to determine whehter prefix bloom filter isused. It fails to consider read_options.total_order_seek, so it is wrong. The result is that an optimization for non-total-order seek is mistakely applied to total order seek, and introduces a bug in following corner case:
Because of RangeDelete(), a file's largest key is extended. Seek key falls into the range deleted file, so level iterator seeks into the previous file without getting any key. The correct behavior is to place the iterator to the first key of the next file. However, an optimization is triggered and invalidates the iterator because it is out of the prefix range, causing wrong results. This behavior is reproduced in the unit test added.
Fix the bug by setting prefix_extractor to be null if total order seek is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6028

Test Plan: Add a unit test which fails without the fix.

Differential Revision: D18479063

fbshipit-source-id: ac075f013029fcf69eb3a598f14c98cce3e810b3
2019-11-13 10:11:34 -08:00
anand76
6c7b1a0cc7 Batched MultiGet API for multiple column families (#5816)
Summary:
Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.

As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816

Test Plan:
make check
make asan_check
asan_crash_test

Differential Revision: D18408676

Pulled By: anand1976

fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
2019-11-12 13:52:55 -08:00
anand76
03ce7fb292 Fix a buffer overrun problem in BlockBasedTable::MultiGet (#6014)
Summary:
The calculation in BlockBasedTable::MultiGet for the required buffer length for reading in compressed blocks is incorrect. It needs to take the 5-byte block trailer into account.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6014

Test Plan: Add a unit test DBBasicTest.MultiGetBufferOverrun that fails in asan_check before the fix, and passes after.

Differential Revision: D18412753

Pulled By: anand1976

fbshipit-source-id: 754dfb66be1d5f161a7efdf87be872198c7e3b72
2019-11-11 16:59:15 -08:00