Commit Graph

3921 Commits

Author SHA1 Message Date
Mike Kolupaev
ce63eda6f0 Fix use-after-free and double-deleting files in BackgroundCallPurge() (#6193)
Summary:
The bad code was:

```
mutex.Lock(); // `mutex` protects `container`
for (auto& x : container) {
  mutex.Unlock();
  // do stuff to x
  mutex.Lock();
}
```

It's incorrect because both `x` and the iterator may become invalid if another thread modifies the container while this thread is not holding the mutex.

Broken by https://github.com/facebook/rocksdb/pull/5796 - it replaced a `while (!container.empty())` loop with a `for (auto x : container)`.

(RocksDB code does a lot of such unlocking+re-locking of mutexes, and this type of bugs comes up a lot :/ )
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6193

Test Plan: Ran some logdevice integration tests that were crashing without this fix.

Differential Revision: D19116874

Pulled By: al13n321

fbshipit-source-id: 9672bc4227c1b68f46f7436db2b96811adb8c703
2019-12-17 20:08:56 -08:00
Adam Retter
2d16709487 Small tidy and speed up of the travis build (#6181)
Summary:
Cuts about 30-60 seconds to from each Travis Linux build, and about 15 minutes from each macOS build
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6181

Differential Revision: D19098357

Pulled By: pdillinger

fbshipit-source-id: 863dd1ab09076ad9b03c2b7914908359628315ae
2019-12-17 13:56:45 -08:00
解轶伦
39fcaf8246 delete superversions in BackgroundCallPurge (#6146)
Summary:
I found that CleanupSuperVersion() may block Get() for 30ms+ (per MemTable is 256MB).

Then I found "delete sv" in ~SuperVersion() takes the time.

The backtrace looks like this

DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() ->
DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion()

I think it's better to delete in a background thread,  please review it。
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146

Differential Revision: D18972066

fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c
2019-12-17 13:22:57 -08:00
Levi Tamasi
02aa22957a Set CompactionIterator::valid_ to false when PrepareBlobOutput indicates error
Summary:
With https://github.com/facebook/rocksdb/pull/6121, errors returned by `PrepareBlobValue`
result in `CompactionIterator::status_` being set to `Corruption` or `IOError`
as appropriate, however, `valid_` is not set to `false`. The error is eventually propagated in
`CompactionJob::ProcessKeyValueCompaction` but only after the main loop completes.
Setting `valid_` to `false` upon errors enables us to terminate the loop early and fail the
compaction sooner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6170

Test Plan:
Ran `make check` and used `db_bench` in BlobDB mode.

fbshipit-source-id: a2ca88a3ca71115e2605bd34a4c795d8a28bef27
2019-12-17 10:20:16 -08:00
Yanqin Jin
7678cf2df7 Use Env::LoadEnv to create custom Env objects (#6196)
Summary:
As title. Previous assumption was that the underlying lib can always return
a shared_ptr<Env>. This is too strong. Therefore, we use Env::LoadEnv to relax
it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6196

Test Plan: make check

Differential Revision: D19133199

Pulled By: riversand963

fbshipit-source-id: c83a0c02a42610d077054f2de1acfc45126b3a75
2019-12-16 20:03:14 -08:00
Zhichao Cao
cddd637997 Merge adjacent file block reads in RocksDB MultiGet() and Add uncompressed block to cache (#6089)
Summary:
In the current MultiGet, if the KV-pairs do not belong to the data blocks in the block cache, multiple blocks are read from a SST. It will trigger one block read for each block request and read them in parallel. In some cases, if some data blocks are adjacent in the SST, the reads for these blocks can be combined to a single large read, which can reduce the system calls and reduce the read latency if possible.

Considering to fill the block cache, if multiple data blocks are in the same memory buffer, we need to copy them to the heap separately. Therefore, only in the case that 1) data block compression is enabled, and 2) compressed block cache is null, we can do combined read. Otherwise, extra memory copy is needed, which may cause extra overhead. In the current case, data blocks will be uncompressed to a new memory space.

Also, in the case that 1) data block compression is enabled, and 2) compressed block cache is null, it is possible the data block is actually not compressed. In the current logic, these data blocks will not be added to the uncompressed_cache. So if memory buffer is shared and the data block is not compressed, the data block are copied to the head and fill the cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6089

Test Plan: Added test case to ParallelIO.MultiGet. Pass make asan_check

Differential Revision: D18734668

Pulled By: zhichao-cao

fbshipit-source-id: 67c5615ed373e51e42635fd74b36f8f3a66d5da4
2019-12-16 16:26:03 -08:00
Levi Tamasi
db7c687523 Fix a data race related to memtable trimming (#6187)
Summary:
https://github.com/facebook/rocksdb/pull/6177 introduced a data race
involving `MemTableList::InstallNewVersion` and `MemTableList::NumFlushed`.
The patch fixes this by caching whether the current version has any
memtable history (i.e. flushed memtables that are kept around for
transaction conflict checking) in an `std::atomic<bool>` member called
`current_has_history_`, similarly to how `current_memory_usage_excluding_last_`
is handled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6187

Test Plan:
```
make clean
COMPILE_WITH_TSAN=1 make db_test -j24
./db_test
```

Differential Revision: D19084059

Pulled By: ltamasi

fbshipit-source-id: 327a5af9700fb7102baea2cc8903c085f69543b9
2019-12-16 13:16:31 -08:00
Levi Tamasi
bd8404feff Do not schedule memtable trimming if there is no history (#6177)
Summary:
We have observed an increase in CPU load caused by frequent calls to
`ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory`
when using `max_write_buffer_size_to_maintain` to limit the amount of
memtable history maintained for transaction conflict checking. Part of the issue
is that trimming can potentially be scheduled even if there is no memtable
history. The patch adds a check that fixes this.

See also https://github.com/facebook/rocksdb/pull/6169.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6177

Test Plan:
Compared `perf` output for

```
./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32
```

before and after the change. There is a significant reduction for the call chain
`rocksdb::DBImpl::TrimMemtableHistory` -> `rocksdb::ColumnFamilyData::InstallSuperVersion` ->
`rocksdb::ThreadLocalPtr::StaticMeta::Scrape` even without https://github.com/facebook/rocksdb/pull/6169.

Differential Revision: D19057445

Pulled By: ltamasi

fbshipit-source-id: dff81882d7b280e17eda7d9b072a2d4882c50f79
2019-12-13 19:11:19 -08:00
anand76
afa2420c2b Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:48:41 -08:00
Peter Dillinger
58d46d1915 Add useful idioms to Random API (OneInOpt, PercentTrue) (#6154)
Summary:
And clean up related code, especially in stress test.

(More clean up of db_stress_test_base.cc coming after this.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6154

Test Plan: make check, make blackbox_crash_test for a bit

Differential Revision: D18938180

Pulled By: pdillinger

fbshipit-source-id: 524d27621b8dbb25f6dff40f1081e7c00630357e
2019-12-13 14:30:14 -08:00
Levi Tamasi
6d54eb3dc2 Do not create/install new SuperVersion if nothing was deleted during memtable trim (#6169)
Summary:
We have observed an increase in CPU load caused by frequent calls to
`ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory`
when using `max_write_buffer_size_to_maintain` to limit the amount of
memtable history maintained for transaction conflict checking. As it turns out,
this is caused by the code creating and installing a new `SuperVersion` even if
no memtables were actually trimmed. The patch adds a check to avoid this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6169

Test Plan:
Compared `perf` output for

```
./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32
```

before and after the change. With the fix, the call chain `rocksdb::DBImpl::TrimMemtableHistory` ->
`rocksdb::ColumnFamilyData::InstallSuperVersion` -> `rocksdb::ThreadLocalPtr::StaticMeta::Scrape`
no longer registers in the `perf` report.

Differential Revision: D19031509

Pulled By: ltamasi

fbshipit-source-id: 02686fce594e5b50eba0710e4b28a9b808c8aa20
2019-12-13 13:29:29 -08:00
Levi Tamasi
583c6953d8 Move out valid blobs from the oldest blob files during compaction (#6121)
Summary:
The patch adds logic that relocates live blobs from the oldest N non-TTL
blob files as they are encountered during compaction (assuming the BlobDB
configuration option `enable_garbage_collection` is `true`), where N is defined
as the number of immutable non-TTL blob files multiplied by the value of
a new BlobDB configuration option called `garbage_collection_cutoff`.
(The default value of this parameter is 0.25, that is, by default the valid blobs
residing in the oldest 25% of immutable non-TTL blob files are relocated.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6121

Test Plan: Added unit test and tested using the BlobDB mode of `db_bench`.

Differential Revision: D18785357

Pulled By: ltamasi

fbshipit-source-id: 8c21c512a18fba777ec28765c88682bb1a5e694e
2019-12-13 10:13:05 -08:00
Jermy Li
c2029f9716 Support concurrent CF iteration and drop (#6147)
Summary:
It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI.

This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early.

fixed https://github.com/facebook/rocksdb/issues/5982
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6147

Differential Revision: D18926378

fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15
2019-12-12 19:04:48 -08:00
奏之章
c4ce8e637f Fix RangeDeletion bug (#6062)
Summary:
Read keys from a snapshot that a range deletion were added after the snapshot  was created and this range deletion was inside an immutable memtable, we will get wrong key set.
More detail rest in codes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6062

Differential Revision: D18966785

Pulled By: pdillinger

fbshipit-source-id: 38a60bb1e2d0a1dbfc8ec641617200b6a02b86c3
2019-12-12 15:18:02 -08:00
Connor
a844591201 wait pending memtable writes on file ingestion or compact range (#6113)
Summary:
**Summary:**
This PR fixes two unordered_write related issues:
- ingestion job may skip the necessary memtable flush https://github.com/facebook/rocksdb/issues/6026
- compact range may cause memtable is flushed before pending unordered write finished
    1. `CompactRange` triggers memtable flush but doesn't wait for pending-writes
    2.  there are some pending writes but memtable is already flushed
    3.  the memtable related WAL is removed( note that the pending-writes were recorded in that WAL).
    4.  pending-writes write to newer created memtable
    5. there is a restart
    6. missing the previous pending-writes because WAL is removed but they aren't included in SST.

**How to solve:**
- Wait pending memtable writes before ingestion job check memtable key range
- Wait pending memtable writes before flush memtable.
**Note that: `CompactRange` calls `RangesOverlapWithMemtables` too without waiting for pending waits, but I'm not sure whether it affects the correctness.**

**Test Plan:**
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6113

Differential Revision: D18895674

Pulled By: maysamyabandeh

fbshipit-source-id: da22b4476fc7e06c176020e7cc171eb78189ecaf
2019-12-12 14:08:02 -08:00
Levi Tamasi
e1dfe80fe0 Mark BlobIndex::DebugString const
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6157

Test Plan: make check

Differential Revision: D18944259

Pulled By: ltamasi

fbshipit-source-id: 7fb29447b52d801215bd6ab811e229a7fa2c763d
2019-12-11 17:19:43 -08:00
Peter Dillinger
d0ad3c59d8 Fix c_test:filter for various CACHE_LINE_SIZEs (#6153)
Summary:
This test was recently updated but failed to account for Bloom
schema variance by CACHE_LINE_SIZE. (Since CACHE_LINE_SIZE is not
defined in our C code, the test now simply allows a valid result for any
CACHE_LINE_SIZE, not just the current one.)

Unblock https://github.com/facebook/rocksdb/issues/5932
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6153

Test Plan:
ran unit test with builds TEST_CACHE_LINE_SIZE=128, =256, and
unset (64 on Intel)

Differential Revision: D18936015

Pulled By: pdillinger

fbshipit-source-id: e5e3852f95283d34d624632c1ae8d3adb2f2662c
2019-12-11 15:17:08 -08:00
奏之章
3717a88289 Fix UniversalCompaction trivial move bug (#6067)
Summary:
`curr.level` is `c->inputs_` index, not real level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6067

Differential Revision: D18935726

fbshipit-source-id: 4354e6e9cd900ca56c96e9d770f0ab6634e45daf
2019-12-11 11:27:53 -08:00
Yi Wu
05a86318a7 Remove unused low_pri_write_rate_limiter_ (#6068)
Summary:
`low_pri_write_rate_limiter_` is not being used. Removing. `WriteController` has an internal low_pri rate limiter which is the real rate limiter for low-pri writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6068

Test Plan: make

Differential Revision: D18664120

fbshipit-source-id: dfe3e4de033cf3522b67781b383aad7d0936034c
2019-12-11 10:28:33 -08:00
sdong
a68dff5c35 Apply formatter to some recent commits (#6138)
Summary:
Formatter somehow complains some recent lines changed. Apply them to make the formatter happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6138

Test Plan: See CI passes.

Differential Revision: D18895950

fbshipit-source-id: 7d1696cf3e3a682bc10a30cdca748a23c6565255
2019-12-09 15:49:49 -08:00
Peter Dillinger
e43d2c4424 Fix & test rocksdb_filterpolicy_create_bloom_full (#6132)
Summary:
Add overrides needed in FilterPolicy wrapper to fix
rocksdb_filterpolicy_create_bloom_full (see issue https://github.com/facebook/rocksdb/issues/6129). Re-enabled
assertion in BloomFilterPolicy::CreateFilter that was being violated.
Expanded c_test to identify Bloom filter implementations by FP counts.
(Without the fix, updated test will trigger assertion and fail otherwise
without the assertion.)

Fixes https://github.com/facebook/rocksdb/issues/6129
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6132

Test Plan: updated c_test, also run under valgrind.

Differential Revision: D18864911

Pulled By: pdillinger

fbshipit-source-id: 08e81d7b5368b08e501cd402ef5583f2650c19fa
2019-12-09 12:21:14 -08:00
Ziyue Yang
7e2f831924 Fix wrong ExtractUserKey usage in BlockBasedTableBuilder::EnterUnbuff… (#6100)
Summary:
BlockBasedTableBuilder uses ExtractUserKey in EnterUnbuffered. This would
cause index filter building error, since user-provided timestamp is supported
by ExtractUserKeyAndStripTimestamp, and it's used in Add. This commit changes
ExtractUserKey to ExtractUserKeyAndStripTimestamp.

A test case is also added by modifying DBBasicTestWithTimestampWithParam_
PutAndGet test in db_basic_test to cover ExtractUserKeyAndStripTimestamp usage
in both kBuffered and kUnbuffered state of BlockBasedTableBuilder.

Before the ExtractUserKeyAndStripTimstamp fix:

```
$ ./db_basic_test --gtest_filter="*PutAndGet*"
Note: Google Test filter = *PutAndGet*
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam
[ RUN      ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
[  FAILED  ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0, where GetParam() = false (1177 ms)
[ RUN      ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1
[       OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1 (1056 ms)
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam (2233 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (2233 ms total)
[  PASSED  ] 1 test.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0, where GetParam() = false

 1 FAILED TEST
```

After the ExtractUserKeyAndStripTimstamp fix:

```
$ ./db_basic_test --gtest_filter="*PutAndGet*"
Note: Google Test filter = *PutAndGet*
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam
[ RUN      ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0
[       OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0 (1417 ms)
[ RUN      ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1
[       OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1 (1041 ms)
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam (2458 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (2458 ms total)
[  PASSED  ] 2 tests.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6100

Differential Revision: D18769654

Pulled By: riversand963

fbshipit-source-id: 76c2cf2c9a5e0d85db95d98e812e6af0c2a15c6b
2019-12-09 10:57:02 -08:00
Peter Dillinger
3a6d9436e8 Use SpecialSkipListFactory in RecalculateScoreAfterPicking (#6125)
Summary:
Test DBTestUniversalCompaction.RecalculateScoreAfterPicking was
flaky on ARM, so it now uses SpecialSkipListFactory (like other tests)
for predictable memtable flushes.

Fixes https://github.com/facebook/rocksdb/issues/5736
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6125

Test Plan:
while ./db_universal_compaction_test; do :; done # for a
while on ARM and on Intel (both Linux)

Differential Revision: D18864821

Pulled By: pdillinger

fbshipit-source-id: 2f3ca0ea66ce420dcd6d41b0ec12377112a5a79f
2019-12-09 09:23:50 -08:00
sdong
7d79b32618 Break db_stress_tool.cc to a list of source files (#6134)
Summary:
db_stress_tool.cc now is a giant file. In order to main it easier to improve and maintain, break it down to multiple source files.
Most classes are turned into their own files. Separate .h and .cc files are created for gflag definiations. Another .h and .cc files are created for some common functions. Some test execution logic that is only loosely related to class StressTest is moved to db_stress_driver.h and db_stress_driver.cc. All the files are located under db_stress_tool/. The directory name is created as such because if we end it with either stress or test, .gitignore will ignore any file under it and makes it prone to issues in developements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6134

Test Plan: Build under GCC7 with and without LITE on using GNU Make. Build with GCC 4.8. Build with cmake with -DWITH_TOOL=1

Differential Revision: D18876064

fbshipit-source-id: b25d0a7451840f31ac0f5ebb0068785f783fdf7d
2019-12-08 23:51:01 -08:00
Yanqin Jin
fe1147db1c Let DBSecondary close files after catch up (#6114)
Summary:
After secondary instance replays the logs from primary, certain files become
obsolete. The secondary should find these files, evict their table readers from
table cache and close them. If this is not done, the secondary will hold on to
these files and prevent their space from being freed.

Test plan (devserver):
```
$./db_secondary_test --gtest_filter=DBSecondaryTest.SecondaryCloseFiles
$make check
$./db_stress -ops_per_thread=100000 -enable_secondary=true -threads=32 -secondary_catch_up_one_in=10000 -clear_column_family_one_in=1000 -reopen=100
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6114

Differential Revision: D18769998

Pulled By: riversand963

fbshipit-source-id: 5d1f151567247196164e1b79d8402fa2045b9120
2019-12-02 17:45:03 -08:00
David Palm
048472f620 Add missing DataBlock-releated functions to the C-API (#6101)
Summary:
Adds two missing functions to the C-API:

- `rocksdb_block_based_options_set_data_block_index_type`
- `rocksdb_block_based_options_set_data_block_hash_ratio`

This enables users in other languages to enjoy the new(-ish) feature.

The changes here are partially overlapping with [another PR](https://github.com/facebook/rocksdb/pull/5630) but are more focused on the DataBlock indexing options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6101

Differential Revision: D18765639

fbshipit-source-id: 4a8947e71b179f26fa1eb83c267dd47ee64ac3b3
2019-12-02 11:00:09 -08:00
Yanqin Jin
09fcf4fb6b Fix a potential bug scheduling unnecessary threads (#6104)
Summary:
RocksDB should decrement the counter `unscheduled_flushes_` as soon as the bg
thread is scheduled. Before this fix, the counter is decremented only when the
bg thread starts and picks an element from the flush queue. This may cause more
than necessary bg threads to be scheduled. Not a correctness issue, but may
affect flush thread count.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6104

Test Plan:
```
make check
```

Differential Revision: D18735584

Pulled By: riversand963

fbshipit-source-id: d36272d4a08a494aeeab6200a3cff7a3d1a2dc10
2019-11-27 14:48:49 -08:00
sdong
aa1857e2df Support options.max_open_files = -1 with periodic_compaction_seconds (#6090)
Summary:
options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090

Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2.

Differential Revision: D18702268

fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb
2019-11-26 21:39:56 -08:00
Peter Dillinger
ca3b6c28c9 Expose and elaborate FilterBuildingContext (#6088)
Summary:
This change enables custom implementations of FilterPolicy to
wrap a variety of NewBloomFilterPolicy and select among them based on
contextual information such as table level and compaction style.

* Moves FilterBuildingContext to public API and elaborates it with more
useful data. (It would be nice to put more general options-like data,
but at the time this object is constructed, we are using internal APIs
ImmutableCFOptions and MutableCFOptions and don't have easy access to
ColumnFamilyOptions that I can tell.)

* Renames BloomFilterPolicy::GetFilterBitsBuilderInternal to
GetBuilderWithContext, because it's now public.

* Plumbs through the table's "level_at_creation" for filter building
context.

* Simplified some tests by adding GetBuilder() to
MockBlockBasedTableTester.

* Adds test as DBBloomFilterTest.ContextCustomFilterPolicy, including
sample wrapper class LevelAndStyleCustomFilterPolicy.

* Fixes a cross-test bug in DBBloomFilterTest.OptimizeFiltersForHits
where it does not reset perf context.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6088

Test Plan: make check, valgrind on db_bloom_filter_test

Differential Revision: D18697817

Pulled By: pdillinger

fbshipit-source-id: 5f987a2d7b07cc7a33670bc08ca6b4ca698c1cf4
2019-11-26 18:24:10 -08:00
sdong
77eab5c85a Make default value of options.ttl to be 30 days when it is supported. (#6073)
Summary:
By default options.ttl is disabled. We believe a better default will be 30 days, which means deleted data the database will be removed from SST files slightly after 30 days, for most of the cases.

Make the default UINT64_MAX - 1 to indicate that it is not overridden by users.

Change periodic_compaction_seconds to be UINT64_MAX - 1 to UINT64_MAX  too to be consistent. Also fix a small bug in the previous periodic_compaction_seconds default code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6073

Test Plan: Add unit tests for it.

Differential Revision: D18669626

fbshipit-source-id: 957cd4374cafc1557d45a0ba002010552a378cc8
2019-11-26 10:00:32 -08:00
Sagar Vemuri
669ea77d9f Support ttl in Universal Compaction (#6071)
Summary:
`options.ttl` is now supported in universal compaction, similar to how periodic compactions are implemented in PR https://github.com/facebook/rocksdb/issues/5970 .
Setting `options.ttl` will simply set `options.periodic_compaction_seconds` to execute the periodic compactions code path.
Discarded PR https://github.com/facebook/rocksdb/issues/4749 in lieu of this.

This is a short term work-around/hack of falling back to periodic compactions when ttl is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6071

Test Plan: Added a unit test.

Differential Revision: D18668336

Pulled By: sagar0

fbshipit-source-id: e75f5b81ba949f77ef9eff05e44bb1c757f58612
2019-11-22 22:13:35 -08:00
sdong
d8c28e692a Support options.ttl with options.max_open_files = -1 (#6060)
Summary:
Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead.

Note that, this change will break forward compatibility for release 5.1 and older.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060

Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0.

Differential Revision: D18631623

fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830
2019-11-22 21:23:00 -08:00
Little-Wallace
e50b64bdba fix unstable unittest caused by #5958 (#6061)
Summary:
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

This PR is to fix unstable unit test added by  (https://github.com/facebook/rocksdb/pull/5958).
I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger,  the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6061

Differential Revision: D18642301

fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148
2019-11-21 15:24:01 -08:00
Yanqin Jin
0ce0edbe12 Fix a data race between GetColumnFamilyMetaData and MarkFilesBeingCompacted (#6056)
Summary:
Use db mutex to protect the execution of Version::GetColumnFamilyMetaData()
called in DBImpl::GetColumnFamilyMetaData().
Without mutex, GetColumnFamilyMetaData() races with MarkFilesBeingCompacted()
for access to FileMetaData::being_compacted.
Other than mutex, there are several more alternatives.

- Make FileMetaData::being_compacted an atomic variable. This will make
  FileMetaData non-copy-able.

- Separate being_compacted from FileMetaData. This requires re-organizing data
  structures that are already used in many places.

Test Plan (dev server):
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6056

Differential Revision: D18620488

Pulled By: riversand963

fbshipit-source-id: 87f89660b5d5e2ab4ef7962b7b2a7d00e346aa3b
2019-11-20 16:36:29 -08:00
sdong
27ec3b3466 Sanitize input in DB::MultiGet() API (#6054)
Summary:
The new DB::MultiGet() doesn't validate input for num_keys > 1 and GCC-9 complains about it. Fix it by directly return when num_keys == 0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6054

Test Plan: Build with GCC-9 and see it passes.

Differential Revision: D18608958

fbshipit-source-id: 1c279aff3c7fe6e9d5a6d085ed02550ecea4fdb2
2019-11-20 10:38:01 -08:00
Peter Dillinger
0306e01233 Fixes for g++ 4.9.2 compatibility (#6053)
Summary:
Taken from merryChris in https://github.com/facebook/rocksdb/issues/6043

Stackoverflow ref on {{}} vs. {}:
https://stackoverflow.com/questions/26947704/implicit-conversion-failure-from-initializer-list

Note to reader: .clear() does not empty out an ostringstream, but .str("")
suffices because we don't have to worry about clearing error flags.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6053

Test Plan: make check, manual run of filter_bench

Differential Revision: D18602259

Pulled By: pdillinger

fbshipit-source-id: f6190f83b8eab4e80e7c107348839edabe727841
2019-11-19 15:43:37 -08:00
Little-Wallace
ec3e3c3e02 Fix corruption with intra-L0 on ingested files (#5958)
Summary:
## Problem Description

Our process was abort when it call `CheckConsistency`. And the information in  `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ".  Here are the causes of the accident I investigated.

* RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested.
* When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
* `CheckConsistency`  determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
* If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst,  the `smallest_seqno` of this new file will be smaller than his `largest_seqno`.
    * If more than one ingested file was ingested before memtable schedule flush,  and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable.  So `CheckConsistency` will return a `Corruption`.
    * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start,  but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2).  **But there was still some data in s1  written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files:
    > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno  < s1.largest_seqno

So I skip pick sst file which was ingested in function `FindIntraL0Compaction `

## Reason

Here is my bug report: https://github.com/facebook/rocksdb/issues/5913

There are two situations that can cause the check to fail.

### First situation:
- First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.

- If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.

- Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
- RocksDB check consistency , and find the `smallest_seqno` of B is  less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.

### Secondary situaion

- First we have flushed many sst in L0,  we call them [s1, s2, s3].

- There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1.  And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.

- m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.

- [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction.  We call it s6.
  - compacted 4@0 files to L0
- When s6 is added into manifest,  the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5.  But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
   - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958

Differential Revision: D18601316

fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
2019-11-19 15:09:11 -08:00
Levi Tamasi
019eb1f402 Disable blob iterator test with max_sequential_skip_in_iterations==0 in LITE mode (#6052)
Summary:
The SetOptions API used by the test is not supported in LITE mode,
so we should skip the new chunk in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6052

Test Plan: Ran the unit tests both in regular and LITE mode.

Differential Revision: D18601763

Pulled By: ltamasi

fbshipit-source-id: 883d6882771e0fb4aae72bb77ba4e63d9febec04
2019-11-19 15:02:41 -08:00
tabokie
20b48c6478 Fix blob context when db_iter uses seek (#6051)
Summary:
Fix: when `db_iter` falls back to using seek by `FindValueForCurrentKeyUsingSeek`, `is_blob_` flag is not properly set on encountering BlobIndex.
Also patch existing test for the mentioned code path.
Signed-off-by: tabokie <xy.tao@outlook.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6051

Differential Revision: D18596274

Pulled By: ltamasi

fbshipit-source-id: 8e4714af263b99dc2c379707d50db88fe6799278
2019-11-19 11:39:02 -08:00
anand76
38cc611297 Fix test failure in LITE mode (#6050)
Summary:
GetSupportedCompressions() is not available in LITE build, so check and use Snappy compression in db_basic_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6050

Test Plan:
make LITE=1 check
make check

Differential Revision: D18588114

Pulled By: anand1976

fbshipit-source-id: a193de58c44f91bcc237107f25dbc1b9458eef3d
2019-11-19 10:13:24 -08:00
anand76
5b9233bfe8 Fix a test failure on systems that don't have Snappy compression libraries (#6038)
Summary:
The ParallelIO/DBBasicTestWithParallelIO.MultiGet/11 test fails if Snappy compression library is not installed, since RocksDB defaults to Snappy if none is specified. So dynamically determine the supported compression types and pick the first one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6038

Differential Revision: D18532370

Pulled By: anand1976

fbshipit-source-id: a0a735114d1f8892ea09f7c4af8688d7bcc5b075
2019-11-18 09:37:18 -08:00
Little-Wallace
f65ec09ef8 Fix IngestExternalFile's bug with two_write_queue (#5976)
Summary:
When two_write_queue enable, IngestExternalFile performs EnterUnbatched on both write queues. SwitchMemtable also EnterUnbatched on 2nd write queue when this option is enabled. When the call stack includes IngestExternalFile -> FlushMemTable -> SwitchMemtable, this results into a deadlock.
The implemented solution is to pass on the existing writes_stopped argument in FlushMemTable to skip EnterUnbatched in SwitchMemtable.
Fixes https://github.com/facebook/rocksdb/issues/5974
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5976

Differential Revision: D18535943

Pulled By: maysamyabandeh

fbshipit-source-id: a4f9d4964c10d4a7ca06b1e0102ca2ec395512bc
2019-11-15 14:00:37 -08:00
Peter Dillinger
00d58a370e Abandon use of folly::Optional (#6036)
Summary:
Had complications with LITE build and valgrind test.
Reverts/fixes small parts of PR https://github.com/facebook/rocksdb/issues/6007
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6036

Test Plan:
make LITE=1 all check
and
ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 make -j24 db_bloom_filter_test && ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 ./db_bloom_filter_test

Differential Revision: D18512238

Pulled By: pdillinger

fbshipit-source-id: 37213cf0d309edf11c483fb4b2fb6c02c2cf2b28
2019-11-14 14:04:15 -08:00
Peter Dillinger
f059c7d9b9 New Bloom filter implementation for full and partitioned filters (#6007)
Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.

Speed

The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.

Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):

$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
  Single filter net ns/op: 26.2825
  Random filter net ns/op: 150.459
    Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
  Single filter net ns/op: 63.2978
  Random filter net ns/op: 188.038
    Average FP rate %: 1.13823

Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.

The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.

Accuracy

The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.

Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)

Fixes https://github.com/facebook/rocksdb/issues/5857
Fixes https://github.com/facebook/rocksdb/issues/4120

Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.

Compatibility

Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007

Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).

Differential Revision: D18294749

Pulled By: pdillinger

fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
2019-11-13 16:44:01 -08:00
sdong
bb23bfe63c Fix a regression bug on total order seek with prefix enabled and range delete (#6028)
Summary:
Recent change https://github.com/facebook/rocksdb/pull/5861 mistakely use "prefix_extractor_ != nullptr" as the condition to determine whehter prefix bloom filter isused. It fails to consider read_options.total_order_seek, so it is wrong. The result is that an optimization for non-total-order seek is mistakely applied to total order seek, and introduces a bug in following corner case:
Because of RangeDelete(), a file's largest key is extended. Seek key falls into the range deleted file, so level iterator seeks into the previous file without getting any key. The correct behavior is to place the iterator to the first key of the next file. However, an optimization is triggered and invalidates the iterator because it is out of the prefix range, causing wrong results. This behavior is reproduced in the unit test added.
Fix the bug by setting prefix_extractor to be null if total order seek is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6028

Test Plan: Add a unit test which fails without the fix.

Differential Revision: D18479063

fbshipit-source-id: ac075f013029fcf69eb3a598f14c98cce3e810b3
2019-11-13 10:11:34 -08:00
anand76
6c7b1a0cc7 Batched MultiGet API for multiple column families (#5816)
Summary:
Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.

As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816

Test Plan:
make check
make asan_check
asan_crash_test

Differential Revision: D18408676

Pulled By: anand1976

fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
2019-11-12 13:52:55 -08:00
anand76
03ce7fb292 Fix a buffer overrun problem in BlockBasedTable::MultiGet (#6014)
Summary:
The calculation in BlockBasedTable::MultiGet for the required buffer length for reading in compressed blocks is incorrect. It needs to take the 5-byte block trailer into account.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6014

Test Plan: Add a unit test DBBasicTest.MultiGetBufferOverrun that fails in asan_check before the fix, and passes after.

Differential Revision: D18412753

Pulled By: anand1976

fbshipit-source-id: 754dfb66be1d5f161a7efdf87be872198c7e3b72
2019-11-11 16:59:15 -08:00
蔡渠棠
f29e6b3be2 bugfix: MemTableList::RemoveOldMemTables invalid iterator after remov… (#6013)
Summary:
Fix issue https://github.com/facebook/rocksdb/issues/6012.

I found that it may be caused by the following codes in function _RemoveOldMemTables()_ in **db/memtable_list.cc**  :
```
  for (auto it = memlist.rbegin(); it != memlist.rend(); ++it) {
    MemTable* mem = *it;
    if (mem->GetNextLogNumber() > log_number) {
      break;
    }
    current_->Remove(mem, to_delete);
```

The iterator **it** turns invalid after `current_->Remove(mem, to_delete);`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6013

Test Plan:
```
make check
```

Differential Revision: D18401107

Pulled By: riversand963

fbshipit-source-id: bf0da3b868ed70f7aff24cf7b3e2049c0c5c7a4e
2019-11-11 15:57:38 -08:00
Sagar Vemuri
c17384fea4 Cascade TTL Compactions to move expired key ranges to bottom levels faster (#5992)
Summary:
When users use Level-Compaction-with-TTL by setting `cf_options.ttl`, the ttl-expired data could take n*ttl time to reach the bottom level (where n is the number of levels) due to how the `creation_time` table property was calculated for the newly created files during compaction. The creation time of new files was set to a max of all compaction-input-files-creation-times which essentially resulted in resetting the ttl as the key range moves across levels. This behavior is now fixed by changing the `creation_time` to be based on minimum of all compaction-input-files-creation-times; this will cause cascading compactions across levels for the ttl-expired data to move to the bottom level, resulting in getting rid of tombstones/deleted-data faster.

This will help start cascading compactions to move the expired key range to the bottom-most level faster.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5992

Test Plan: `make check`

Differential Revision: D18257883

Pulled By: sagar0

fbshipit-source-id: 00df0bb8d0b7e14d9fc239df2cba8559f3e54cbc
2019-11-11 14:09:01 -08:00
Yi Wu
72de842ac9 Fix DBFlushTest::FireOnFlushCompletedAfterCommittedResult hang (#6018)
Summary:
The test would fire two flushes to let them run in parallel. Previously it wait for the first job to be scheduled before firing the second. It is possible the job is not started before the second job being scheduled, making the two job combine into one. Change to wait for the first job being started.

Fixes https://github.com/facebook/rocksdb/issues/6017
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6018

Test Plan:
```
while ./db_flush_test --gtest_filter=*FireOnFlushCompletedAfterCommittedResult*; do :; done
```
and let it run for a while.

Signed-off-by: Yi Wu <yiwu@pingcap.com>

Differential Revision: D18405576

Pulled By: riversand963

fbshipit-source-id: 6ebb6262e033d5dc2ef81cb3eb410b314f2de4c9
2019-11-08 13:47:29 -08:00
Levi Tamasi
f80050fa8f Add file number/oldest referenced blob file number to {Sst,Live}FileMetaData (#6011)
Summary:
The patch exposes the file numbers of the SSTs as well as the oldest blob
files they contain a reference to through the GetColumnFamilyMetaData/
GetLiveFilesMetaData interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6011

Test Plan:
Fixed and extended the existing unit tests. (The earlier ColumnFamilyMetaDataTest
wasn't really testing anything because the generated memtables were never
flushed, so the metadata structure was essentially empty.)

Differential Revision: D18361697

Pulled By: ltamasi

fbshipit-source-id: d5ed1d94ac70858b84393c48711441ddfe1251e9
2019-11-07 14:04:16 -08:00
anand76
9836a1fa33 Fix MultiGet crash when no_block_cache is set (#5991)
Summary:
This PR fixes https://github.com/facebook/rocksdb/issues/5975. In ```BlockBasedTable::RetrieveMultipleBlocks()```, we were calling ```MaybeReadBlocksAndLoadToCache()```, which is a no-op if neither uncompressed nor compressed block cache are configured.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5991

Test Plan:
1. Add unit tests that fail with the old code and pass with the new
2. make check and asan_check

Cc spetrunia

Differential Revision: D18272744

Pulled By: anand1976

fbshipit-source-id: e62fa6090d1a6adf84fcd51dfd6859b03c6aebfe
2019-11-07 12:02:21 -08:00
sdong
982a7532a7 Add two test cases for single sorted universal periodic compaction (#6002)
Summary:
It's useful to add test coverage for universal compaction's periodic compaction. Add two tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6002

Test Plan: Run the two tests

Differential Revision: D18363544

fbshipit-source-id: bbd04b54057315f64f959709006412db1f76d170
2019-11-07 11:14:14 -08:00
sdong
f0b469e563 Turn on periodic compaction in universal by default if compaction filter is used. (#5994)
Summary:
Recently, periodic compaction got turned on by default for leveled compaction is compaction filter is used. Since periodic compaction is now supported in universal compaction too, we do the same default for universal now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5994

Test Plan: Add a new unit test.

Differential Revision: D18363744

fbshipit-source-id: 5093288ce990ee3cab0e44ffd92d8489fbcd6a48
2019-11-07 10:58:10 -08:00
Sergei Petrunia
230bcae7b6 Add a limited support for iteration bounds into BaseDeltaIterator (#5403)
Summary:
For MDEV-19670: MyRocks: key lookups into deleted data are very slow

BaseDeltaIterator remembers iterate_upper_bound and will not let delta_iterator_
walk above the iterate_upper_bound if base_iterator_ is not valid
anymore.

== Rationale ==
The most straightforward way would be to make the delta_iterator
(which is a rocksdb::WBWIIterator) to support iterator bounds. But
checking for bounds has an extra CPU overhead.

So we put the check into BaseDeltaIterator, and only make it when
base_iterator_ is not valid.

(note: We could take it even further, and move the check a few lines
down, and only check iterator bounds ourselves if base_iterator_ is
not valid AND delta_iterator_ hit a tombstone).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5403

Differential Revision: D15863092

Pulled By: maysamyabandeh

fbshipit-source-id: 8da458e7b9af95ff49356666f69664b4a6ccf49b
2019-11-05 11:39:36 -08:00
sdong
0d91a981e9 Fix assertion in universal compaction periodic compaction (#6000)
Summary:
We recently added periodic compaction to universal compaction. An old assertion that we can't onlyl compact the last sorted run triggered. However, with periodic compaction, it is possible that we only compact the last sorted run, so the assertion now became stricter than needed. Relaxing this assertion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6000

Test Plan: This should be a low risk change. Will observe whether stress test will pass after it.

Differential Revision: D18285396

fbshipit-source-id: 9a6863debdf104c40a7f6c46ab62d84cdf5d8592
2019-11-01 18:33:12 -07:00
Levi Tamasi
a44670e71b Use aggregate initialization for FlushJobInfo/CompactionJobInfo (#5997)
Summary:
FlushJobInfo and CompactionJobInfo are aggregates; we should use the
aggregate initialization syntax to ensure members (specifically those of
built-in types) are value-initialized.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5997

Test Plan: make check

Differential Revision: D18273398

Pulled By: ltamasi

fbshipit-source-id: 35b1a63ad9ca01605d288329858af72fffd7f392
2019-11-01 11:46:19 -07:00
Peter Dillinger
685e895652 Prepare filter tests for more implementations (#5967)
Summary:
This change sets up for alternate implementations underlying
BloomFilterPolicy:

* Refactor BloomFilterPolicy and expose in internal .h file so that it's easy to iterate over / select implementations for testing, regardless of what the best public interface will look like. Most notably updated db_bloom_filter_test to use this.
* Hide FullFilterBitsBuilder from unit tests (alternate derived classes planned); expose the part important for testing (CalculateSpace), as abstract class BuiltinFilterBitsBuilder. (Also cleaned up internally exposed interface to CalculateSpace.)
* Rename BloomTest -> BlockBasedBloomTest for clarity (despite ongoing confusion between block-based table and block-based filter)
* Assert that block-based filter construction interface is only used on BloomFilterPolicy appropriately constructed. (A couple of tests updated to add ", true".)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5967

Test Plan: make check

Differential Revision: D18138704

Pulled By: pdillinger

fbshipit-source-id: 55ef9273423b0696309e251f50b8c1b5e9ec7597
2019-10-31 14:12:33 -07:00
sdong
aa6f7d0995 Support periodic compaction in universal compaction (#5970)
Summary:
Previously, periodic compaction is not supported in universal compaction. Add the support using following approach: if any file is marked as qualified for periodid compaction, trigger a full compaction. If a full compaction is prevented by files being compacted, try to compact the higher levels than files currently being compacted. If in this way we can only compact the last sorted run and none of the file to be compacted qualifies for periodic compaction, skip the compact. This is to prevent the same single level compaction from being executed again and again.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5970

Test Plan: Add several test cases.

Differential Revision: D18147097

fbshipit-source-id: 8ecc308154d9aca96fb192c51fbceba3947550c1
2019-10-31 11:31:37 -07:00
sdong
2a9e5caffe Make FIFO compaction take default 30 days TTL by default (#5987)
Summary:
Right now, by default FIFO compaction has no TTL. We believe that a default TTL of 30 days will be better. With this patch, the default will be changed to 30 days. Default of Options.periodic_compaction_seconds will mean the same as options.ttl. If Options.ttl and Options.periodic_compaction_seconds left default, a default 30 days TTL will be used. If both options are set, the stricter value of the two will be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5987

Test Plan: Add an option sanitize test to cover the case.

Differential Revision: D18237935

fbshipit-source-id: a6dcea1f36c3849e13c0a69e413d73ad8eab58c9
2019-10-31 11:13:12 -07:00
Maysam Yabandeh
dccaf9f03c Turn compaction asserts to runtime check (#5935)
Summary:
Compaction iterator has many assert statements that are active only during test runs. Some rare bugs would show up only at runtime could violate the assert condition but go unnoticed since assert statements are not compiled in release mode. Turning the assert statements to runtime check sone pors and cons:
Pros:
- A bug that would result into incorrect data would be detected early before the incorrect data is written to the disk.

Cons:
- Runtime overhead: which should be negligible since compaction cpu is the minority in the overall cpu usage
- The assert statements might already being violated at runtime, and turning them to runtime failure might result into reliability issues.

The patch takes a conservative step in this direction by logging the assert violations at runtime. If we see any violation reported in logs, we investigate. Otherwise, we can go ahead turning them to runtime error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5935

Differential Revision: D18229697

Pulled By: maysamyabandeh

fbshipit-source-id: f1890eca80ccd7cca29737f1825badb9aa8038a8
2019-10-30 13:48:38 -07:00
sdong
a3960fc875 Move pipeline write waiting logic into WaitForPendingWrites() (#5716)
Summary:
In pipeline writing mode, memtable switching needs to wait for memtable writing to finish to make sure that when memtables are made immutable, inserts are not going to them. This is currently done in DBImpl::SwitchMemtable(). This is done after flush_scheduler_.TakeNextColumnFamily() is called to fetch the list of column families to switch. The function flush_scheduler_.TakeNextColumnFamily() itself, however, is not thread-safe when being called together with flush_scheduler_.ScheduleFlush().
This change provides a fix, which moves the waiting logic before flush_scheduler_.TakeNextColumnFamily(). WaitForPendingWrites() is a natural place where the logic can happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5716

Test Plan: Run all tests with ASAN and TSAN.

Differential Revision: D18217658

fbshipit-source-id: b9c5e765c9989645bf10afda7c5c726c3f82f6c3
2019-10-29 18:16:36 -07:00
Sagar Vemuri
4c9aa30a62 Auto enable Periodic Compactions if a Compaction Filter is used (#5865)
Summary:
- Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction.
- The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions.

Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage.

`periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used.

This is done only for Level Compaction style.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865

Test Plan:
- Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set.
- `COMPILE_WITH_ASAN=1 make check`

Differential Revision: D17659180

Pulled By: sagar0

fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee
2019-10-29 15:05:51 -07:00
Vijay Nadimpalli
1075c376ef Fix for lite build (#5971)
Summary:
Fix for lite build
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5971

Test Plan: make J=1 -j64  LITE=1 all check

Differential Revision: D18148306

Pulled By: vjnadimpalli

fbshipit-source-id: 5b9a3edc3e73e054fee6b96e6f6e583cecc898f3
2019-10-25 18:22:24 -07:00
Vijay Nadimpalli
ec880436c1 API to get file_creation_time of the oldest file in the DB (#5948)
Summary:
Adding a new API to db.h that allows users to get file_creation_time of the oldest file in the DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5948

Test Plan: Added unit test.

Differential Revision: D18056151

Pulled By: vjnadimpalli

fbshipit-source-id: 448ec9d34cb6772e1e5a62db399ace00dcbfbb5d
2019-10-25 11:53:57 -07:00
Yanqin Jin
2309fd63bf Update column families' log number altogether after flushing during recovery (#5856)
Summary:
A bug occasionally shows up in crash test, and https://github.com/facebook/rocksdb/issues/5851 reproduces it.
The bug can surface in the following way.
1. Database has multiple column families.
2. Between one DB restart, the last log file is corrupted in the middle (not the tail)
3. During restart, DB crashes between flushing between two column families.

Then DB will fail to be opened again with error "SST file is ahead of WALs".
Solution is to update the log number associated with each column family altogether after flushing all column families' memtables. The version edits should be written to a new MANIFEST. Only after writing to all these version edits succeed does RocksDB (atomically) points the CURRENT file to the new MANIFEST.

Test plan (on devserver):
```
$make all && make check
```
Specifically
```
$make db_test2
$./db_test2 --gtest_filter=DBTest2.CrashInRecoveryMultipleCF
```
Also checked for compatibility as follows.
Use this branch, run DBTest2.CrashInRecoveryMultipleCF and preserve the db directory.
Then checkout 5.4, build ldb, and dump the MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5856

Differential Revision: D17620818

Pulled By: riversand963

fbshipit-source-id: b52ce5969c9a8052cacec2bd805fcfb373589039
2019-10-24 18:29:30 -07:00
Peter Dillinger
ca7ccbe2ea Misc hashing updates / upgrades (#5909)
Summary:
- Updated our included xxhash implementation to version 0.7.2 (== the latest dev version as of 2019-10-09).
- Using XXH_NAMESPACE (like other fb projects) to avoid potential name collisions.
- Added fastrange64, and unit tests for it and fastrange32. These are faster alternatives to hash % range.
- Use preview version of XXH3 instead of MurmurHash64A for NPHash64
-- Had to update cache_test to increase probability of passing for any given hash function.
- Use fastrange64 instead of % with uses of NPHash64
-- Had to fix WritePreparedTransactionTest.CommitOfDelayedPrepared to avoid deadlock apparently caused by new hash collision.
- Set default seed for NPHash64 because specifying a seed rarely makes sense for it.
- Removed unnecessary include xxhash.h in a popular .h file
- Rename preview version of XXH3 to XXH3p for clarity and to ease backward compatibility in case final version of XXH3 is integrated.

Relying on existing unit tests for NPHash64-related changes. Each new implementation of fastrange64 passed unit tests when manipulating my local build to select it. I haven't done any integration performance tests, but I consider the improved performance of the pieces being swapped in to be well established.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5909

Differential Revision: D18125196

Pulled By: pdillinger

fbshipit-source-id: f6bf83d49d20cbb2549926adf454fd035f0ecc0d
2019-10-24 17:16:46 -07:00
Levi Tamasi
f7e7b34ebe Propagate SST and blob file numbers through the EventListener interface (#5962)
Summary:
This patch adds a number of new information elements to the FlushJobInfo and
CompactionJobInfo structures that are passed to EventListeners via the
OnFlush{Begin, Completed} and OnCompaction{Begin, Completed} callbacks.
Namely, for flushes, the file numbers of the new SST and the oldest blob file it
references are propagated. For compactions, the new pieces of information are
the file number, level, and the oldest blob file referenced by each compaction
input and output file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5962

Test Plan:
Extended the EventListener unit tests with logic that checks that these information
elements are correctly propagated from the corresponding FileMetaData.

Differential Revision: D18095568

Pulled By: ltamasi

fbshipit-source-id: 6874359a6aadb53366b5fe87adcb2f9bd27a0a56
2019-10-24 14:44:15 -07:00
Dan Lambright
2509531123 Add test showing range tombstones can create excessively large compactions (#5956)
Summary:
For more information on the original problem see this [link](https://github.com/facebook/rocksdb/issues/3977).

This change adds two new tests. They are identical other than one uses range tombstones and the other does not. Each test generates sub files at L2 which overlap with keys L3. The test that uses range tombstones generates a single file at L2. This single file will generate a very large range overlap that will in turn create excessively large compaction.

1: T001 - T005
2:  000 -  005

In contrast, the test that uses key ranges generates 3 files at L2. As a single file is compacted at a time, those 3 files will generate less work per compaction iteration.

1:  001 - 002
1:  003 - 004
1:  005
2:  000 - 005
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5956

Differential Revision: D18071631

Pulled By: dlambrig

fbshipit-source-id: 12abae75fb3e0b022d228c6371698aa5e53385df
2019-10-24 11:08:44 -07:00
Peter Dillinger
27a124571f Fix memory leak on error opening PlainTable (#5951)
Summary:
Several error paths in opening of a plain table would leak memory. PR https://github.com/facebook/rocksdb/issues/5940 opened the leak to one more error path, which happens to have been (mistakenly) exercised by CuckooTableDBTest.AdaptiveTable. That test has been fixed, and the exercising of
plain table error cases (more than before) has been added as BadOptions1 and BadOptions2
to PlainTableDBTest. This effectively moved the memory leak to plain_table_db_test.

Also here is a cheap fix for the memory leak, without (yet?) changing the signature of
ReadTableProperties. This fixes ASAN on unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5951

Test Plan: make COMPILE_WITH_ASAN=1 check

Differential Revision: D18051940

Pulled By: pdillinger

fbshipit-source-id: e2952930c09a2b46c4f1ff09818c5090426929de
2019-10-21 16:53:06 -07:00
sdong
a0cd920026 LevelIterator to avoid gap after prefix bloom filters out a file (#5861)
Summary:
Right now, when LevelIterator::Seek() is called, when a file is filtered out by prefix bloom filter, the position is put to the beginning of the next file. This is a confusing internal interface because many keys in the levels are skipped. Avoid this behavior by checking the key of the next file against the seek key, and invalidate the whole iterator if the prefix doesn't match.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5861

Test Plan: Add a new unit test to validate the behavior; run all exsiting tests; run crash_test

Differential Revision: D17918213

fbshipit-source-id: f06b47d937c7cc8919001f18dcc3af5b28c9cdac
2019-10-21 11:40:57 -07:00
sdong
30e2dc02f0 Fix VerifyChecksum readahead with mmap mode (#5945)
Summary:
A recent change introduced readahead inside VerifyChecksum(). However it is not compatible with mmap mode and generated wrong checksum verification failure. Fix it by not enabling readahead in mmap
 mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5945

Test Plan: Add a unit test that used to fail.

Differential Revision: D18021443

fbshipit-source-id: 6f2eb600f81b26edb02222563a4006869d576bff
2019-10-21 11:38:30 -07:00
Peter Dillinger
fe464bca5c Fix PlainTableReader not to crash sst_dump (#5940)
Summary:
Plain table SSTs could crash sst_dump because of a bug in
PlainTableReader that can leave table_properties_ as null. Even if it
was intended not to keep the table properties in some cases, they were
leaked on the offending code path.

Steps to reproduce:

    $ db_bench --benchmarks=fillrandom --num=2000000 --use_plain_table --prefix-size=12
    $ sst_dump --file=0000xx.sst --show_properties
    from [] to []
    Process /dev/shm/dbbench/000014.sst
    Sst file format: plain table
    Raw user collected properties
    ------------------------------
    Segmentation fault (core dumped)

Also added missing unit testing of plain table full_scan_mode, and
an assertion in NewIterator to check for regression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5940

Test Plan: new unit test, manual, make check

Differential Revision: D18018145

Pulled By: pdillinger

fbshipit-source-id: 4310c755e824c4cd6f3f86a3abc20dfa417c5e07
2019-10-18 14:44:42 -07:00
Levi Tamasi
fdc1cb43a6 Support decoding blob indexes in sst_dump (#5926)
Summary:
The patch adds a new command line parameter --decode_blob_index to sst_dump.
If this switch is specified, sst_dump prints blob indexes in a human readable format,
printing the blob file number, offset, size, and expiration (if applicable) for blob
references, and the blob value (and expiration) for inlined blobs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5926

Test Plan:
Used db_bench's BlobDB mode to generate SST files containing blob references with
and without expiration, as well as inlined blobs with and without expiration (note: the
latter are stored as plain values), and confirmed sst_dump correctly prints all four types
of records.

Differential Revision: D17939077

Pulled By: ltamasi

fbshipit-source-id: edc5f58fee94ba35f6699c6a042d5758f5b3963d
2019-10-17 19:36:54 -07:00
Yi Wu
1f9d7c0f54 Fix OnFlushCompleted fired before flush result write to MANIFEST (#5908)
Summary:
When there are concurrent flush job on the same CF, `OnFlushCompleted` can be called before the flush result being install to LSM. Fixing the issue by passing `FlushJobInfo` through `MemTable`, and the thread who commit the flush result can fetch the `FlushJobInfo` and fire `OnFlushCompleted` on behave of the thread actually writing the SST.

Fix https://github.com/facebook/rocksdb/issues/5892
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5908

Test Plan: Add new test. The test will fail without the fix.

Differential Revision: D17916144

Pulled By: riversand963

fbshipit-source-id: e18df67d9533b5baee52ae3605026cdeb05cbe10
2019-10-16 10:40:23 -07:00
Yanqin Jin
5ef27dea33 Fix clang analyzer error (#5924)
Summary:
Without this PR, clang analyzer complains.
```
$USE_CLANG=1 make analyze
db/compaction/compaction_job_test.cc:161:20: warning: The left operand of '==' is a garbage value
      if (key.type == kTypeBlobIndex) {
                ~~~~~~~~ ^
                1 warning generated.
```

Test Plan (on devserver)
```
$USE_CLANG=1 make analyze
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5924

Differential Revision: D17923226

Pulled By: riversand963

fbshipit-source-id: 9d1eb769b5e0de7cb3d89dc90d1cfa895db7fdc8
2019-10-14 22:14:24 -07:00
Levi Tamasi
5f025ea832 BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903)
Summary:
This is groundwork for adding garbage collection support to BlobDB. The
patch adds logic that keeps track of the oldest blob file referred to by
each SST file. The oldest blob file is identified during flush/
compaction (similarly to how the range of keys covered by the SST is
identified), and persisted in the manifest as a custom field of the new
file edit record. Blob indexes with TTL are ignored for the purposes of
identifying the oldest blob file (since such blob files are cleaned up by the
TTL logic in BlobDB).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5903

Test Plan:
Added new unit tests; also ran db_bench in BlobDB mode, inspected the
manifest using ldb, and confirmed (by scanning the SST files using
sst_dump) that the value of the oldest blob file number field matches
the contents of the file for each SST.

Differential Revision: D17859997

Pulled By: ltamasi

fbshipit-source-id: 21662c137c6259a6af70446faaf3a9912c550e90
2019-10-14 15:21:01 -07:00
Levi Tamasi
a59dc843a4 Move blob_index.h to db/ (#5919)
Summary:
Extracted from PR https://github.com/facebook/rocksdb/issues/5903 for technical reasons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5919

Test Plan: make check

Differential Revision: D17910132

Pulled By: ltamasi

fbshipit-source-id: 6ecbb8d6e84b2a1d1f28575ad48ac3cc65833eb5
2019-10-14 12:54:05 -07:00
Yanqin Jin
6febfd8451 OnTableFileCreationCompleted use "(nil)" for empty file during flush (#5905)
Summary:
Compaction can call OnTableFileCreationCompleted(). If file is empty, "(nil)"
is used as the file name.
Do the same for flush.

Test plan (dev server):
```
make all
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5905

Differential Revision: D17883285

Pulled By: riversand963

fbshipit-source-id: 6565884adbb00e8023d88b17dfb3b6eb92220b59
2019-10-14 09:54:10 -07:00
Andrew Kryczka
b00761eea6 Fix block cache ID uniqueness for Windows builds (#5844)
Summary:
Since we do not evict a file's blocks from block cache before that file
is deleted, we require a file's cache ID prefix is both unique and
non-reusable. However, the Windows functionality we were relying on only
guaranteed uniqueness. That meant a newly created file could be assigned
the same cache ID prefix as a deleted file. If the newly created file
had block offsets matching the deleted file, full cache keys could be
exactly the same, resulting in obsolete data blocks returned from cache
when trying to read from the new file.

We noticed this when running on FAT32 where compaction was writing out
of order keys due to reading obsolete blocks from its input files. The
functionality is documented as behaving the same on NTFS, although I
wasn't able to repro it there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5844

Test Plan:
we had a reliable repro of out-of-order keys on FAT32 that
was fixed by this change

Differential Revision: D17752442

fbshipit-source-id: 95d983f9196cf415f269e19293b97341edbf7e00
2019-10-11 18:19:31 -07:00
Vijay Nadimpalli
4c49e38f15 MultiGet batching in memtable (#5818)
Summary:
RocksDB has a MultiGet() API that implements batched key lookup for higher performance (https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L468). Currently, batching is implemented in BlockBasedTableReader::MultiGet() for SST file lookups. One of the ways it improves performance is by pipelining bloom filter lookups (by prefetching required cachelines for all the keys in the batch, and then doing the probe) and thus hiding the cache miss latency. The same concept can be extended to the memtable as well. This PR involves implementing a pipelined bloom filter lookup in DynamicBloom, and implementing MemTable::MultiGet() that can leverage it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5818

Test Plan:
Existing tests

Performance Test:
Ran the below command which fills up the memtable and makes sure there are no flushes and then call multiget. Ran it on master and on the new change and see atleast 1% performance improvement across all the test runs I did. Sometimes the improvement was upto 5%.

TEST_TMPDIR=/data/users/$USER/benchmarks/feature/ numactl -C 10 ./db_bench -benchmarks="fillseq,multireadrandom" -num=600000 -compression_type="none" -level_compaction_dynamic_level_bytes -write_buffer_size=200000000 -target_file_size_base=200000000 -max_bytes_for_level_base=16777216 -reads=90000 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 -statistics -memtable_whole_key_filtering=true -memtable_bloom_size_ratio=10

Differential Revision: D17578869

Pulled By: vjnadimpalli

fbshipit-source-id: 23dc651d9bf49db11d22375bf435708875a1f192
2019-10-10 09:39:39 -07:00
Tomas Kolda
e3a93c9ee1 Fix crash when background task fails (#5879)
Summary:
Fixing crash. Full story in issue: https://github.com/facebook/rocksdb/issues/5878
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5879

Differential Revision: D17812299

Pulled By: anand1976

fbshipit-source-id: 14e5a4fc502ade974583da9692d0ed6e5014613a
2019-10-08 14:20:01 -07:00
Yanqin Jin
457bcfde02 Let TestEnv and FaultInjectEnv use Env of choice (#5886)
Summary:
Instead of hard coding Env::Default in TestEnv and a few other places, use the
DBTestBase::env_ that has been deduced from the constructor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5886

Test Plan:
```
make check
```

Differential Revision: D17773029

Pulled By: riversand963

fbshipit-source-id: 7ce4e5175a487e9d281ea2c3aae3c41bffd44629
2019-10-07 17:48:50 -07:00
jsteemann
da3b2840cb save a few redundant container lookups (#5875)
Summary:
This PR eliminates repeated lookups in associative or ordered containers when a single lookup suffices.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5875

Differential Revision: D17753172

Pulled By: anand1976

fbshipit-source-id: 796b02b760082521d8c42a1cb65a76bf0e6c1b8e
2019-10-07 12:28:09 -07:00
anand76
19a97dd139 Fix data block upper bound checking for iterator reseek case (#5883)
Summary:
When an iterator reseek happens with the user specifying a new iterate_upper_bound in ReadOptions, and the new seek position is at the end of the same data block, the Seek() ends up using a stale value of data_block_within_upper_bound_ and may return incorrect results.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5883

Test Plan: Added a new test case DBIteratorTest.IterReseekNewUpperBound. Verified that it failed due to the assertion failure without the fix, and passes with the fix.

Differential Revision: D17752740

Pulled By: anand1976

fbshipit-source-id: f9b635ff5d6aeb0e1bef102cf8b2f900efd378e3
2019-10-03 20:53:29 -07:00
sdong
846e05005d Revert "Merging iterator to avoid child iterator reseek for some cases (#5286)" (#5871)
Summary:
This reverts commit 9fad3e21eb.

Iterator verification in stress tests sometimes fail for assertion
table/block_based/block_based_table_reader.cc:2973: void rocksdb::BlockBasedTableIterator<TBlockIter, TValue>::FindBlockForward() [with TBlockIter = rocksdb::DataBlockIter; TValue = rocksdb::Slice]: Assertion `!next_block_is_out_of_bound || user_comparator_.Compare(*read_options_.iterate_upper_bound, index_iter_->user_key()) <= 0' failed.

It is likely to be linked to https://github.com/facebook/rocksdb/pull/5286 together with https://github.com/facebook/rocksdb/pull/5468 as the former PR makes some child iterator's seek being avoided, so that upper bound condition fails to be updated there. Strictly speaking, the former PR was merged before the latter one, but the latter one feels a more important improvement so I choose to revert the former one for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5871

Differential Revision: D17689196

fbshipit-source-id: 4ded5be68f67bee2782d31a29cb72ea68f59dd8c
2019-10-01 11:22:41 -07:00
Yanqin Jin
643df920d8 Explicitly declare atomic flush incompatible with pipelined write (#5860)
Summary:
Atomic flush is incompatible with pipelined write. At least now.
If pipelined write is enabled, a thread performing write can exit the write
thread and start inserting into memtables. Consequently a thread performing
flush will enter write thread and race with memtable insertion by the former.
This will cause undefined result in terms of data persistence.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5860

Test Plan:
```
$make all && make check
```

Differential Revision: D17638944

Pulled By: riversand963

fbshipit-source-id: abc578dc49a5dbe41bc5adcecf448f8e042a6d49
2019-09-27 17:17:37 -07:00
sdong
76e951dbb1 Add a unit test to reproduce a corruption bug (#5851)
Summary:
This is a bug occaionally shows up in crash test, and this unit test is to reproduce it. The bug is following:
1. Database has multiple CFs.
2. Between one DB restart, the last log file is corrupted in the middle (not the tail)
3. During restart, DB crashes between flushes between two CFs.
The DB will fail to be opened again with error "SST file is ahead of WALs"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5851

Test Plan: Run the test itself.

Differential Revision: D17614721

fbshipit-source-id: 1b0abce49b203a76a039e38e76bc940429975f20
2019-09-26 16:18:42 -07:00
sdong
e8263dbdaa Apply formatter to recent 200+ commits. (#5830)
Summary:
Further apply formatter to more recent commits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830

Test Plan: Run all existing tests.

Differential Revision: D17488031

fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab
2019-09-20 12:04:26 -07:00
Vijay Nadimpalli
a5fa8735e9 Code comment for Version Edit (#5829)
Summary:
Added comment for Version Edit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5829

Test Plan: Run existing tests

Differential Revision: D17486229

Pulled By: vjnadimpalli

fbshipit-source-id: b4b31104fadd667356b64bd2dc409b3376ee46ca
2019-09-20 10:07:42 -07:00
sdong
c06b54d0c6 Apply formatter on recent 45 commits. (#5827)
Summary:
Some recent commits might not have passed through the formatter. I formatted recent 45 commits. The script hangs for more commits so I stopped there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5827

Test Plan: Run all existing tests.

Differential Revision: D17483727

fbshipit-source-id: af23113ee63015d8a43d89a3bc2c1056189afe8f
2019-09-19 12:34:17 -07:00
Maysam Yabandeh
6ec6a4a9a4 Remove snap_refresh_nanos option (#5826)
Summary:
The snap_refresh_nanos option didn't bring much benefit. Remove the feature to simplify the code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5826

Differential Revision: D17467147

Pulled By: maysamyabandeh

fbshipit-source-id: 4f950b046990d0d1292d7fc04c2ccafaf751c7f0
2019-09-18 20:26:04 -07:00
Yanqin Jin
a9c5e8e944 Refactor deletefile_test.cc (#5822)
Summary:
Make DeleteFileTest inherit DBTestBase to avoid code duplication.

Test Plan (on devserver)
```
$make deletefile_test
$./deletefile_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5822

Differential Revision: D17456750

Pulled By: riversand963

fbshipit-source-id: 224e97967da7b98838a98981cd5095d3230a814f
2019-09-18 16:58:21 -07:00
Yanqin Jin
6a279037cf Refactor ObsoleteFilesTest to inherit from DBTestBase (#5820)
Summary:
Make class ObsoleteFilesTest inherit from DBTestBase.

Test plan (on devserver):
```
$COMPILE_WITH_ASAN=1 make obsolete_files_test
$./obsolete_files_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5820

Differential Revision: D17452348

Pulled By: riversand963

fbshipit-source-id: b09f4581a18022ca2bfd79f2836c0bf7083f5f25
2019-09-18 11:52:17 -07:00
Yanqin Jin
6d072f2a03 Move WAL-closing loop out of original loop (#5804)
Summary:
Originally the loop of closing WAL in PurgeObsoleteFiles resides inside a loop
iterating over the candidate files. It should be moved out.

Test plan (devserver)
```
$COMPILE_WITH_ASAN=1 make -j32 all
$make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5804

Differential Revision: D17374350

Pulled By: riversand963

fbshipit-source-id: 2bee7343fc0481d9a385a87c7676491522285c96
2019-09-17 17:17:19 -07:00
sdong
43a340bebe Merging iterator to disble reseek optimization in prefix seek (#5815)
Summary:
We are seeing a bug of wrong results with merging iterator's reseek avoidence feature and prefix extractor. Disable this optimization for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5815

Test Plan: Validated the same MyRocks case was fixed; run all existing tests.

Differential Revision: D17430776

fbshipit-source-id: aef664277ba0ab8a2e68331ff0db6ae682535371
2019-09-17 17:10:29 -07:00
Yi Wu
a68d814570 fast look up purge_queue (#5796)
Summary:
purge_queue_ maybe contains thousands sst files, for example manual compact a range. If full scan is triggered at the same time and the total sst files number is large, RocksDB will be blocked at https://github.com/facebook/rocksdb/blob/master/db/db_impl_files.cc#L150 for several seconds. In our environment we have 140,000 sst files and the manual compaction delete about 1000 sst files, it blocked about 2 minutes.

Commandeering https://github.com/facebook/rocksdb/issues/5290.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5796

Differential Revision: D17357775

Pulled By: riversand963

fbshipit-source-id: 20eacca917355b8de975ccc7b1c9a3e7bd5b201a
2019-09-17 16:47:55 -07:00
sdong
6287f0d73b Improve readability of DBIter's two seek functions (#5794)
Summary:
Doing some code reordering in DBIter::Seek() and DBIter::SeekForPrev().
The logic largely remains the same, except slight difference when handling some stats when valid_ = false, where they are not supposed to be used anyway.
Also remove prefix_start_key_, which sometimes point a part of seek target, some times prefix_start_buf_, which is confusing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5794

Test Plan: Run all tests.

Differential Revision: D17375257

fbshipit-source-id: 7339a23898cecd3a8475bf72340fcd6f82b933c5
2019-09-16 21:05:07 -07:00
andrew
622683000c Allow users to stop manual compactions (#3971)
Summary:
Manual compaction may bring in very high load because sometime the amount of data involved in a compaction could be large, which may affect online service. So it would be good if the running compaction making the server busy can be stopped immediately. In this implementation, stopping manual compaction condition is only checked in slow process. We let deletion compaction and trivial move go through.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3971

Test Plan: add tests at more spots.

Differential Revision: D17369043

fbshipit-source-id: 575a624fb992ce0bb07d9443eb209e547740043c
2019-09-16 21:01:47 -07:00
Peter Dillinger
68626249c3 Refactor/consolidate legacy Bloom implementation details (#5784)
Summary:
Refactoring to consolidate implementation details of legacy
Bloom filters. This helps to organize and document some related,
obscure code.

Also added make/cpp var TEST_CACHE_LINE_SIZE so that it's easy to
compile and run unit tests for non-native cache line size. (Fixed a
related test failure in db_properties_test.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5784

Test Plan:
make check, including Recently added Bloom schema unit tests
(in ./plain_table_db_test && ./bloom_test), and including with
TEST_CACHE_LINE_SIZE=128U and TEST_CACHE_LINE_SIZE=256U. Tested the
schema tests with temporary fault injection into new implementations.

Some performance testing with modified unit tests suggest a small to moderate
improvement in speed.

Differential Revision: D17381384

Pulled By: pdillinger

fbshipit-source-id: ee42586da996798910fc45ac0b6289147f16d8df
2019-09-16 16:17:09 -07:00
Maysam Yabandeh
638d239507 Charge block cache for cache internal usage (#5797)
Summary:
For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor.
This PR builds up on https://github.com/facebook/rocksdb/issues/4258
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797

Test Plan:
- Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata.
- The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata.

Differential Revision: D17396833

Pulled By: maysamyabandeh

fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f
2019-09-16 15:26:21 -07:00
Peter Dillinger
d3a6726f02 Revert changes from PR#5784 accidentally in PR#5780 (#5810)
Summary:
This will allow us to fix history by having the code changes for PR#5784 properly attributed to it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5810

Differential Revision: D17400231

Pulled By: pdillinger

fbshipit-source-id: 2da8b1cdf2533cfedb35b5526eadefb38c291f09
2019-09-16 11:38:53 -07:00
sdong
9bd5fce6e8 Refactor UniversalCompactionPicker code a little bit (#5639)
Summary:
Several functions of UniversalCompactionPicker share most of the parameters. Move these functions to a class with those shared arguments as class members. Hopefully this will make code slightly easier to maintain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5639

Test Plan: Run all existing test.

Differential Revision: D16996403

fbshipit-source-id: fffafd1897ab132b420b1dec073542cffb5c44de
2019-09-16 10:51:11 -07:00
sdong
b931f84e56 Divide file_reader_writer.h and .cc (#5803)
Summary:
file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803

Test Plan: Build whole project using make and cmake.

Differential Revision: D17374550

fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987
2019-09-16 10:33:51 -07:00
Igor Canadi
97631357aa Allow ingesting overlapping files (#5539)
Summary:
Currently IngestExternalFile() fails when its input files' ranges overlap. This condition doesn't need to hold for files that are to be ingested in L0, though.

This commit allows overlapping files and forces their target level to L0.

Additionally, ingest job's completion is logged to EventLogger, analogous to flush and compaction jobs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5539

Differential Revision: D17370660

Pulled By: riversand963

fbshipit-source-id: 749a3899b17d1be267a5afd5b0a99d96b38ab2f3
2019-09-13 14:49:47 -07:00
anand76
83a6a614e9 Refactor ArenaWrappedDBIter into separate files (#5801)
Summary:
Move definition and implementation for ArenaWrappedDBIter into its own .h/.cc files. Also, change inlining of functions to better comply with the Google C++ style guide.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5801

Test Plan: make check

Differential Revision: D17371012

Pulled By: anand1976

fbshipit-source-id: c1361abc2851575111e357a63d88be3b3d6cb341
2019-09-13 13:50:43 -07:00
Peter Dillinger
aa2486b23c Refactor some confusing logic in PlainTableReader
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5780

Test Plan: existing plain table unit test

Differential Revision: D17368629

Pulled By: pdillinger

fbshipit-source-id: f25409cdc2f39ebe8d5cbb599cf820270e6b5d26
2019-09-13 10:26:36 -07:00
Lingjing You
1a928c22a0 Add insert hints for each writebatch (#5728)
Summary:
Add insert hints for each writebatch so that they can be used in concurrent write, and add write option to enable it.

Bench result (qps):

`./db_bench --benchmarks=fillseq -allow_concurrent_memtable_write=true -num=4000000 -batch-size=1 -threads=1 -db=/data3/ylj/tmp -write_buffer_size=536870912 -num_column_families=4`

master:

| batch size \ thread num | 1       | 2       | 4       | 8       |
| ----------------------- | ------- | ------- | ------- | ------- |
| 1                       | 387883  | 220790  | 308294  | 490998  |
| 10                      | 1397208 | 978911  | 1275684 | 1733395 |
| 100                     | 2045414 | 1589927 | 1798782 | 2681039 |
| 1000                    | 2228038 | 1698252 | 1839877 | 2863490 |

fillseq with writebatch hint:

| batch size \ thread num | 1       | 2       | 4       | 8       |
| ----------------------- | ------- | ------- | ------- | ------- |
| 1                       | 286005  | 223570  | 300024  | 466981  |
| 10                      | 970374  | 813308  | 1399299 | 1753588 |
| 100                     | 1962768 | 1983023 | 2676577 | 3086426 |
| 1000                    | 2195853 | 2676782 | 3231048 | 3638143 |
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5728

Differential Revision: D17297240

fbshipit-source-id: b053590a6d77871f1ef2f911a7bd013b3899b26c
2019-09-12 17:15:18 -07:00
Ronak Sisodia
d05c0fe4d1 Option to make write group size configurable (#5759)
Summary:
The max batch size that we can write to the WAL is controlled by a static manner. So if the leader write is less than 128 KB we will have the batch size as leader write size + 128 KB else the limit will be 1 MB. Both of them are statically defined.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5759

Differential Revision: D17329298

fbshipit-source-id: a3d910629d8d8ca84ea39ad89c2b2d284571ded5
2019-09-11 18:28:33 -07:00
Shylock Hg
9eb3e1f77d Use delete to disable automatic generated methods. (#5009)
Summary:
Use delete to disable automatic generated methods instead of private, and put the constructor together for more clear.This modification cause the unused field warning, so add unused attribute to disable this warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5009

Differential Revision: D17288733

fbshipit-source-id: 8a767ce096f185f1db01bd28fc88fef1cdd921f3
2019-09-11 18:09:00 -07:00
anand76
2becafdb43 Fix Appveyor build due to signed/unsigned comparison
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5788

Test Plan: Travis CI and Appveyor should complete successfully.

Differential Revision: D17287422

Pulled By: anand1976

fbshipit-source-id: d9408b692f78be95d0088b29b33f6a8ff40ec97b
2019-09-10 14:34:37 -07:00
jsteemann
4d945c57ac do a bit less work in the normal case (#5695)
Summary:
i.e. if alive logfile is not being moved to archive while we are in GetSortedWalsOfType()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5695

Differential Revision: D17279489

Pulled By: vjnadimpalli

fbshipit-source-id: 02bcf920a75b812edba8b87c6079b4e6fd5e683c
2019-09-10 09:41:45 -07:00
Peter Dillinger
108c619acb Add regression test for serialized Bloom filters (#5778)
Summary:
Check that we don't accidentally change the on-disk format of
existing Bloom filter implementations, including for various
CACHE_LINE_SIZE (by changing temporarily).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5778

Test Plan: thisisthetest

Differential Revision: D17269630

Pulled By: pdillinger

fbshipit-source-id: c77017662f010a77603b7d475892b1f0d5563d8b
2019-09-09 14:51:30 -07:00
Wilfried Goesgens
fbab9913e2 upgrade gtest 1.7.0 => 1.8.1 for json result writing
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5332

Differential Revision: D17242232

fbshipit-source-id: c0d4646556a1335e51ac7382b986ca7f6ced7b64
2019-09-09 11:24:11 -07:00
sdong
adbc25a4c8 Rename InternalDBStatsType enum names (#5779)
Summary:
When building with clang 9, warning is reported for InternalDBStatsType type names shadowed the one for statistics. Rename them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5779

Test Plan: Build with clang 9 and see it passes.

Differential Revision: D17239378

fbshipit-source-id: af28fb42066c738cd1b841f9fe21ab4671dafd18
2019-09-06 17:31:10 -07:00
Jeffrey Xiao
eae9f040eb Initialized pinned_pos_ and pinned_seq_pos_ in FragmentedRangeTombstoneIterator (#5720)
Summary:
These uninitialized member variables can cause a key to not be pinned when it should be, causing erroneous behavior. For example ingesting a file with range deletion tombstones will yield an "external file have corrupted keys" on a Mac.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5720

Differential Revision: D17217673

fbshipit-source-id: cd7df7ce3ad9cf69c841c4d3dc6fd144eff9e212
2019-09-05 17:30:29 -07:00
Yi Wu
83b991922e Fix EncryptedEnv assert (#5735)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/5734. By reading the code the assert don't quite make sense to me, since `dataSize` and `fileOffset` has no correlation. But my knowledge about `EncryptedEnv` is very limited.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5735

Test Plan:
run `ENCRYPTED_ENV=1 ./db_encryption_test`

Signed-off-by: Yi Wu <yiwu@pingcap.com>

Differential Revision: D17133849

fbshipit-source-id: bb7262d308e5b2503c400b180edc252668df0ef0
2019-09-05 17:21:42 -07:00
Peter Dillinger
b55b2f45d0 Faster new DynamicBloom implementation (for memtable) (#5762)
Summary:
Since DynamicBloom is now only used in-memory, we're free to
change it without schema compatibility issues. The new implementation
is drawn from (with manifest permission)
303542a767/bloom_simulation_tests/foo.cc (L613)

This has several speed advantages over the prior implementation:
* Uses fastrange instead of %
* Minimum logic to determine first (and all) probed memory addresses
* (Major) Two probes per 64-bit memory fetch/write.
* Very fast and effective (murmur-like) hash expansion/re-mixing. (At
least on recent CPUs, integer multiplication is very cheap.)

While a Bloom filter with 512-bit cache locality has about a 1.15x FP
rate penalty (e.g. 0.84% to 0.97%), further restricting to two probes
per 64 bits incurs an additional 1.12x FP rate penalty (e.g. 0.97% to
1.09%). Nevertheless, the unit tests show no "mediocre" FP rate samples,
unlike the old implementation with more erratic FP rates.

Especially for the memtable, we expect speed to outweigh somewhat higher
FP rates. For example, a negative table query would have to be 1000x
slower than a BF query to justify doubling BF query time to shave 10% off
FP rate (working assumption around 1% FP rate). While that seems likely
for SSTs, my data suggests a speed factor of roughly 50x for the memtable
(vs. BF; ~1.5% lower write throughput when enabling memtable Bloom
filter, after this change).  Thus, it's probably not worth even 5% more
time in the Bloom filter to shave off 1/10th of the Bloom FP rate, or 0.1%
in absolute terms, and it's probably at least 20% slower to recoup that
much FP rate from this new implementation. Because of this, we do not see
a need for a 'locality' option that affects the MemTable Bloom filter
and have decoupled the MemTable Bloom filter from Options::bloom_locality.

Note that just 3% more memory to the Bloom filter (10.3 bits per key vs.
just 10) is able to make up for the ~12% FP rate drop in the new
implementation:

[] # Nearly "ideal" FP-wise but reasonably fast cache-local implementation
[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out 10000000 6 10 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out time: 3.29372 sampled_fp_rate: 0.00985956 ...

[] # Close match to this new implementation
[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out 10000000 6 10.3 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.10072 sampled_fp_rate: 0.00985655 ...

[] # Old locality=1 implementation
[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out 10000000 6 10 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out time: 3.95472 sampled_fp_rate: 0.00988943 ...

Also note the dramatic speed improvement vs. alternatives.

--

Performance unit test: DynamicBloomTest.concurrent_with_perf is updated
to report more precise timing data. (Measure running time of each
thread, not just longest running thread, etc.) Results averaged over
various sizes enabled with --enable_perf and 20 runs each; old dynamic
bloom refers to locality=1, the faster of the old:

old dynamic bloom, avg add latency = 65.6468
new dynamic bloom, avg add latency = 44.3809
old dynamic bloom, avg query latency = 50.6485
new dynamic bloom, avg query latency = 43.2186
old avg parallel add latency = 41.678
new avg parallel add latency = 24.5238
old avg parallel hit latency = 14.6322
new avg parallel hit latency = 12.3939
old avg parallel miss latency = 16.7289
new avg parallel miss latency = 12.2134

Tested on a dedicated 64-bit production machine at Facebook. Significant
improvement all around.

Despite now using std::atomic<uint64_t>, quick before-and-after test on
a 32-bit machine (Intel Atom N270, released 2008) shows no regression in
performance, in some cases modest improvement.

--

Performance integration test (synthetic): with DEBUG_LEVEL=0, used
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom,readmissing,readrandom,stats --num=2000000
and optionally with -memtable_whole_key_filtering -memtable_bloom_size_ratio=0.01
300 runs each configuration.

Write throughput change by enabling memtable bloom:
Old locality=0: -3.06%
Old locality=1: -2.37%
New:            -1.50%
conclusion -> seems to substantially close the gap

Readmissing throughput change by enabling memtable bloom:
Old locality=0: +34.47%
Old locality=1: +34.80%
New:            +33.25%
conclusion -> maybe a small new penalty from FP rate

Readrandom throughput change by enabling memtable bloom:
Old locality=0: +31.54%
Old locality=1: +31.13%
New:            +30.60%
conclusion -> maybe also from FP rate (after memtable flush)

--

Another conclusion we can draw from this new implementation is that the
existing 32-bit hash function is not inherently crippling the Bloom
filter speed or accuracy, below about 5 million keys. For speed, the
implementation is essentially the same whether starting with 32-bits or
64-bits of hash; it just determines whether the first multiplication
after fastrange is a pseudorandom expansion or needed re-mix. Note that
this multiplication can occur while memory is fetching.

For accuracy, in a standard configuration, you need about 5 million
keys before you have about a 1.1x FP penalty due to using a
32-bit hash vs. 64-bit:

[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.52069 sampled_fp_rate: 0.0118267 ...
[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out time: 2.43871 sampled_fp_rate: 0.0109059
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5762

Differential Revision: D17214194

Pulled By: pdillinger

fbshipit-source-id: ad9da031772e985fd6b62a0e1db8e81892520595
2019-09-05 14:59:25 -07:00
Peter Dillinger
20dec1401f Copy/split PlainTableBloomV1 from DynamicBloom (refactor) (#5767)
Summary:
DynamicBloom was being used both for memory-only and for on-disk filters, as part of the PlainTable format. To set up enhancements to the memtable Bloom filter, this splits the code into two copies and removes unused features from each copy. Adds test PlainTableDBTest.BloomSchema to ensure no accidental change to that format.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5767

Differential Revision: D17206963

Pulled By: pdillinger

fbshipit-source-id: 6cce8d55305ed0df051b4c58bdc98c8ad81d0553
2019-09-05 10:05:20 -07:00
Affan Dar
229e6fbe0e Adding DB::GetCurrentWalFile() API as a repliction/backup helper (#5765)
Summary:
Adding a light weight API to get last live WAL file name and size. Meant to be used as a helper for backup/restore tooling in a larger ecosystem such as MySQL with a MyRocks storage engine.

Specifically within MySQL's backup/restore mechanism, this call can be made with a write lock on the mysql db to get a transactionally consistent snapshot of the current WAL file position along with other non-rocksdb log/data files.

Without this, the alternative would be to take the aforementioned lock, scan the WAL dir for all files, find the last file and note its exact size as the rocksdb 'checkpoint'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5765

Differential Revision: D17172717

Pulled By: affandar

fbshipit-source-id: f2fabafd4c0e6fc45f126670c8c88a9f84cb8a37
2019-09-04 12:10:17 -07:00
Yanqin Jin
38b17ecd0e Replace named comparator struct with lambda (#5768)
Summary:
Tiny code mod: replace a named comparator struct with anonymous lambda.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5768

Differential Revision: D17185141

Pulled By: riversand963

fbshipit-source-id: fabe367649931c33a39ad035dc707d2efc3ad5fc
2019-09-04 11:38:34 -07:00
Vijay Nadimpalli
979fbdc696 Persistent globally unique DB ID in manifest (#5725)
Summary:
Each DB has a globally unique ID. A DB can be physically copied around, or backed-up and restored, and the users should be identify the same DB. This unique ID right now is stored as plain text in file IDENTITY under the DB directory. This approach introduces at least two problems: (1) the file is not checksumed; (2) the source of truth of a DB is the manifest file, which can be copied separately from IDENTITY file, causing the DB ID to be wrong.
The goal of this PR is solve this problem by moving the  DB ID to manifest. To begin with we will write to both identity file and manifest. Write to Manifest is controlled via the flag write_dbid_to_manifest in Options and default is false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5725

Test Plan: Added unit tests.

Differential Revision: D16963840

Pulled By: vjnadimpalli

fbshipit-source-id: 8a86a4c8c82c716003c40fd6b9d2d758030d92e9
2019-09-03 08:52:24 -07:00
Yanqin Jin
44eca41add Fix a bug in file ingestion (#5760)
Summary:
Before this PR, when the number of column families involved in a file ingestion exceeds 2, a bug in the looping logic prevents correct file number being assigned to each ingestion job.
Also skip deleting non-existing hard links during cleanup-after-failure.

Test plan (devserver)
```
$COMPILE_WITH_ASAN=1 make all
$./external_sst_file_test --gtest_filter=ExternalSSTFileTest/ExternalSSTFileTest.IngestFilesIntoMultipleColumnFamilies_*/*
$makke check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5760

Differential Revision: D17142982

Pulled By: riversand963

fbshipit-source-id: 06c1847a4e7a402647bcf28d124e70f2a0f9daf6
2019-08-30 18:29:07 -07:00
Yanqin Jin
672befea2a Fix assertion failure in FIFO compaction with TTL (#5754)
Summary:
Before this PR, the following sequence of events can cause assertion failure as shown below.
Stack trace (partial):
```
(gdb) bt
2  0x00007f59b350ad15 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x9f8390 "mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted", file=file@entry=0x9e347c "db/compaction/compaction.cc", line=line@entry=395, function=function@entry=0xa21ec0 <rocksdb::Compaction::MarkFilesBeingCompacted(bool)::__PRETTY_FUNCTION__> "void rocksdb::Compaction::MarkFilesBeingCompacted(bool)") at assert.c:92
3  0x00007f59b350adc3 in __GI___assert_fail (assertion=assertion@entry=0x9f8390 "mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted", file=file@entry=0x9e347c "db/compaction/compaction.cc", line=line@entry=395, function=function@entry=0xa21ec0 <rocksdb::Compaction::MarkFilesBeingCompacted(bool)::__PRETTY_FUNCTION__> "void rocksdb::Compaction::MarkFilesBeingCompacted(bool)") at assert.c:101
4  0x0000000000492ccd in rocksdb::Compaction::MarkFilesBeingCompacted (this=<optimized out>, mark_as_compacted=<optimized out>) at db/compaction/compaction.cc:394
5  0x000000000049467a in rocksdb::Compaction::Compaction (this=0x7f59af013000, vstorage=0x7f581af53030, _immutable_cf_options=..., _mutable_cf_options=..., _inputs=..., _output_level=<optimized out>, _target_file_size=0, _max_compaction_bytes=0, _output_path_id=0, _compression=<incomplete type>, _compression_opts=..., _max_subcompactions=0, _grandparents=..., _manual_compaction=false, _score=4, _deletion_compaction=true, _compaction_reason=rocksdb::CompactionReason::kFIFOTtl) at db/compaction/compaction.cc:241
6  0x00000000004af9bc in rocksdb::FIFOCompactionPicker::PickTTLCompaction (this=0x7f59b31a6900, cf_name=..., mutable_cf_options=..., vstorage=0x7f581af53030, log_buffer=log_buffer@entry=0x7f59b1bfa930) at db/compaction/compaction_picker_fifo.cc:101
7  0x00000000004b0771 in rocksdb::FIFOCompactionPicker::PickCompaction (this=0x7f59b31a6900, cf_name=..., mutable_cf_options=..., vstorage=0x7f581af53030, log_buffer=0x7f59b1bfa930) at db/compaction/compaction_picker_fifo.cc:201
8  0x00000000004838cc in rocksdb::ColumnFamilyData::PickCompaction (this=this@entry=0x7f59b31b3700, mutable_options=..., log_buffer=log_buffer@entry=0x7f59b1bfa930) at db/column_family.cc:933
9  0x00000000004f3645 in rocksdb::DBImpl::BackgroundCompaction (this=this@entry=0x7f59b3176000, made_progress=made_progress@entry=0x7f59b1bfa6bf, job_context=job_context@entry=0x7f59b1bfa760, log_buffer=log_buffer@entry=0x7f59b1bfa930, prepicked_compaction=prepicked_compaction@entry=0x0, thread_pri=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2541
10 0x00000000004f5e2a in rocksdb::DBImpl::BackgroundCallCompaction (this=this@entry=0x7f59b3176000, prepicked_compaction=prepicked_compaction@entry=0x0, bg_thread_pri=bg_thread_pri@entry=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2312
11 0x00000000004f648e in rocksdb::DBImpl::BGWorkCompaction (arg=<optimized out>) at db/db_impl/db_impl_compaction_flush.cc:2087
```
This can be caused by the following sequence of events.
```
Time
|      thr          bg_compact_thr1                     bg_compact_thr2
|      write
|      flush
|                   mark all l0 as being compacted
|      write
|      flush
|                   add cf to queue again
|                                                       mark all l0 as being
|                                                       compacted, fail the
|                                                       assertion
V
```
Test plan (on devserver)
Since bg_compact_thr1 and bg_compact_thr2 are two threads executing the same
code, it is difficult to use sync point dependency to
coordinate their execution. Therefore, I choose to use db_stress.
```
$TEST_TMPDIR=/dev/shm/rocksdb ./db_stress --periodic_compaction_seconds=1 --max_background_compactions=20 --format_version=2 --memtablerep=skip_list --max_write_buffer_number=3 --cache_index_and_filter_blocks=1 --reopen=20 --recycle_log_file_num=0 --acquire_snapshot_one_in=10000 --delpercent=4 --log2_keys_per_lock=22 --compaction_ttl=1 --block_size=16384 --use_multiget=1 --compact_files_one_in=1000000 --target_file_size_multiplier=2 --clear_column_family_one_in=0 --max_bytes_for_level_base=10485760 --use_full_merge_v1=1 --target_file_size_base=2097152 --checkpoint_one_in=1000000 --mmap_read=0 --compression_type=zstd --writepercent=35 --readpercent=45 --subcompactions=4 --use_merge=0 --write_buffer_size=4194304 --test_batches_snapshots=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox --use_direct_reads=0 --compact_range_one_in=1000000 --open_files=-1 --destroy_db_initially=0 --progress_reports=0 --compression_zstd_max_train_bytes=0 --snapshot_hold_ops=100000 --enable_pipelined_write=0 --nooverwritepercent=1 --compression_max_dict_bytes=0 --max_key=1000000 --prefixpercent=5 --flush_one_in=1000000 --ops_per_thread=40000 --index_block_restart_interval=7 --cache_size=1048576 --compaction_style=2 --verify_checksum=1 --delrangepercent=1 --use_direct_io_for_flush_and_compaction=0
```
This should see no assertion failure.
Last but not least,
```
$COMPILE_WITH_ASAN=1 make -j32 all
$make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5754

Differential Revision: D17109791

Pulled By: riversand963

fbshipit-source-id: 25fc46101235add158554e096540b72c324be078
2019-08-30 12:42:01 -07:00
Pratik Dhandharia
a281822331 Lower the risk for users to run options.force_consistency_checks = true (#5744)
Summary:
Open-source users recently reported two occurrences of LSM-tree corruption (https://github.com/facebook/rocksdb/issues/5558 is one), which would be caught by options.force_consistency_checks = true. options.force_consistency_checks has a usability limitation because it crashes the service once inconsistency is detected. This makes the feature hard to use. Most users serve from multiple RocksDB shards per server and the impacts of crashing the service is higher than it should be.

Instead, we just pass the error back to users without killing the service, and ask them to deal with the problem accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5744

Differential Revision: D17096940

Pulled By: pdhandharia

fbshipit-source-id: b6780039044e265f26ed2ad03c51f4abbe8b603c
2019-08-29 14:07:37 -07:00
anand76
1729779b85 Disable MultiGet row cache test in LITE mode (#5756)
Summary:
Row cache is not supported in LITE mode. So disable the test in that mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5756

Test Plan: make LITE=1 all check

Differential Revision: D17115684

Pulled By: anand1976

fbshipit-source-id: e6433c2e528674645cea76cdfc80ddc473708fc2
2019-08-29 12:13:28 -07:00
anand76
e10570331d Support row cache with batched MultiGet (#5706)
Summary:
This PR adds support for row cache in ```rocksdb::TableCache::MultiGet```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5706

Test Plan:
1. Unit tests in db_basic_test
2. db_bench results with batch size of 2 (```Get``` is faster than ```MultiGet``` for single key) -
Get -
readrandom   :       3.935 micros/op 254116 ops/sec;   28.1 MB/s (22870998 of 22870999 found)
MultiGet -
multireadrandom :       3.743 micros/op 267190 ops/sec; (24047998 of 24047998 found)

Command used -
TEST_TMPDIR=/dev/shm/multiget numactl -C 10  ./db_bench -use_existing_db=true -use_existing_keys=false -benchmarks="readtorowcache,[read|multiread]random" -write_buffer_size=16777216 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -row_cache_size=4194304000 -batch_size=2 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=131072

Differential Revision: D17086297

Pulled By: anand1976

fbshipit-source-id: 85784378da913e05f1baf31ec1b4e7c9345e7f57
2019-08-28 16:11:56 -07:00
Pratik Dhandharia
1b4c104a67 replace some reinterpret_cast with static_cast_with_check (#5740)
Summary:
This PR focuses on replacing some of the reinterpret_cast<DBImpl*> to static_cast_with_check<DBImpl, DB>.

Files impacted:

./db/db_impl/db_impl_compaction_flush.cc
./db/write_batch.cc
./utilities/blob_db/blob_db_impl.cc
./utilities/transactions/pessimistic_transaction_db.cc
./utilities/transactions/transaction_base.cc
./utilities/transactions/write_prepared_txn_db.cc
./utilities/transactions/write_unprepared_txn_db.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5740

Differential Revision: D17055691

Pulled By: pdhandharia

fbshipit-source-id: 0f8034d1b32eade56e37d59c04b7bf236a81d8e8
2019-08-27 10:59:11 -07:00
Andrew Kryczka
ba0967b567 Reduce severity of too many levels log message (#5742)
Summary:
This condition is now a normal occurrence during write burst so there is
no need to warn the user about it. Here is a scenario where it happens
under completely normal conditions.

* Initially we have a DB of three levels (L0, L1, and L2) that is stable, i.e., compaction scores are all less than one.
* Now a write burst comes along. At first L0 blows up a bit in size as compaction hasn't had a chance to catch up.
* As a result of the above, `base_bytes_min` also increases since it is based on L0 size as of https://github.com/facebook/rocksdb/issues/4338
* If `base_bytes_min` increased enough (i.e., to be larger than L1), then we are shown the warning that the DB has more levels than necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5742

Differential Revision: D17059221

fbshipit-source-id: e4a31d6eea42089a8d273095f19653991bd91bea
2019-08-26 15:00:43 -07:00
jsteemann
62829ff751 reuse scratch buffer in transaction_log_reader (#5702)
Summary:
in order to avoid reallocations for a scratch std::string on every call to Next().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5702

Differential Revision: D16867803

fbshipit-source-id: 1391220a1b172b23336bbc71dc0c79ccf3b1c701
2019-08-26 11:26:29 -07:00
Zhongyi Xie
2f41ecfe75 Refactor trimming logic for immutable memtables (#5022)
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022

Differential Revision: D14394062

Pulled By: miasantreble

fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
2019-08-23 13:55:34 -07:00
anand76
9046bdc5d3 Fix MultiGet() bug when whole_key_filtering is disabled (#5665)
Summary:
The batched MultiGet() implementation was not correctly handling bloom filter lookups when whole_key_filtering is disabled. It was incorrectly skipping keys not in the prefix_extractor domain, and not calling transform for keys in domain. This PR fixes both problems by moving the domain check and transformation to the FilterBlockReader.

Tests:
Unit test (confirmed failed before the fix)
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5665

Differential Revision: D16902380

Pulled By: anand1976

fbshipit-source-id: a6be81ad68a6e37134a65246aec7a2c590eccf00
2019-08-21 10:23:23 -07:00
Maysam Yabandeh
7bc18e2727 Disable snapshot refresh feature when snap_refresh_nanos is 0 (#5724)
Summary:
The comments of snap_refresh_nanos advertise that the snapshot refresh feature will be disabled when the option is set to 0. This contract is however not honored in the code: https://github.com/facebook/rocksdb/pull/5278
The patch fixes that and also adds an assert to ensure that the feature is not used when the option  is zero.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5724

Differential Revision: D16918185

Pulled By: maysamyabandeh

fbshipit-source-id: fec167287df7d85093e087fc39c0eb243e3bbd7e
2019-08-20 11:40:07 -07:00
sdong
3552473668 Introduce IngestExternalFileOptions.verify_checksums_readahead_size (#5721)
Summary:
Recently readahead is introduced for checksum verifying. However, users cannot override the setting for the checksum verifying before external SST file ingestion. Introduce a new option for the purpose.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5721

Test Plan: Add a new unit test for it.

Differential Revision: D16906896

fbshipit-source-id: 218ec37001ddcc05411cefddbe233d15ab308476
2019-08-20 10:43:39 -07:00
sdong
4c74dba5fa Bump up memory order of ref counting of ColumnFamilyData (#5723)
Summary:
We see this TSAN warning:

WARNING: ThreadSanitizer: data race (pid=282806)
  Write of size 8 at 0x7b6c00000e38 by thread T16 (mutexes: write M1023578822185846136):
    #0 operator delete(void*) <null> (libtsan.so.0+0x0000000795f8)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) db/db_impl/db_impl_compaction_flush.cc:2202 (db_flush_test+0x00000060b462)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) db/db_impl/db_impl_compaction_flush.cc:2226 (db_flush_test+0x00000060cbd8)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::BGWorkFlush(void*) db/db_impl/db_impl_compaction_flush.cc:2073 (db_flush_test+0x00000060d5ac)
    ......

Previous atomic write of size 4 at 0x7b6c00000e38 by main thread:
    #0 __tsan_atomic32_fetch_sub <null> (libtsan.so.0+0x00000006d721)
    https://github.com/facebook/rocksdb/issues/1 std::__atomic_base<int>::fetch_sub(int, std::memory_order) /mnt/gvfs/third-party2/libgcc/c67031f0f739ac61575a061518d6ef5038f99f90/7.x/platform007/5620abc/include/c++/7.3.0/bits/atomic_base.h:524 (db_flush_test+0x0000005f9e38)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::ColumnFamilyData::Unref() db/column_family.h:286 (db_flush_test+0x0000005f9e38)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&, rocksdb::FlushReason, bool) db/db_impl/db_impl_compaction_flush.cc:1624 (db_flush_test+0x0000005f9e38)
    https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::TEST_FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&) db/db_impl/db_impl_debug.cc:127 (db_flush_test+0x00000061ace9)
    https://github.com/facebook/rocksdb/issues/5 rocksdb::DBFlushTest_CFDropRaceWithWaitForFlushMemTables_Test::TestBody() db/db_flush_test.cc:320 (db_flush_test+0x0000004b44e5)
    https://github.com/facebook/rocksdb/issues/6 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.7.0/fused-src/gtest/gtest-all.cc:3824 (db_flush_test+0x000000be2988)
    ......

It's still very clear the cause of the warning is because that TSAN treats results from relaxed atomic::fetch_sub() as non-atomic with the operation itself. We can make it more explicit by bumping up the order to CS.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5723

Test Plan: Run all existing test.

Differential Revision: D16908250

fbshipit-source-id: bf17d39ed19058372bdf97f6440a743f88153021
2019-08-20 10:34:33 -07:00
sdong
e1c468d16f Do readahead in VerifyChecksum() (#5713)
Summary:
Right now VerifyChecksum() doesn't do read-ahead. In some use cases, users won't be able to achieve good performance. With this change, by default, RocksDB will do a default readahead, and users will be able to overwrite the readahead size by passing in a ReadOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5713

Test Plan: Add a new unit test.

Differential Revision: D16860874

fbshipit-source-id: 0cff0fe79ac855d3d068e6ccd770770854a68413
2019-08-16 16:42:56 -07:00
Eli Pozniansky
c2404d9928 Optimizing ApproximateSize to create index iterator just once (#5693)
Summary:
VersionSet::ApproximateSize doesn't need to create two separate index iterators and do binary search for each in BlockBasedTable. So BlockBasedTable::ApproximateSize was added that creates the iterator once and uses it to calculate the data size between start and end keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5693

Differential Revision: D16774056

Pulled By: elipoz

fbshipit-source-id: 53ce262e1a057788243bf30cd9b8aa6581df1a18
2019-08-16 14:18:28 -07:00
sdong
bd2c753dd0 Add command "list_file_range_deletes" in ldb (#5615)
Summary:
Add a command in ldb so that users can print out tombstones in SST files.
In order to test the code, change the interface of LDBCommandRunner::RunCommand() so that it doesn't return from the program, but return the status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5615

Test Plan: Add a new unit test

Differential Revision: D16550326

fbshipit-source-id: 88ddfe6984bdcbb3a528abdd115089df09eba52e
2019-08-15 17:01:03 -07:00
Jeffrey Xiao
d61d4507c0 Fix IngestExternalFile overlapping check (#5649)
Summary:
Previously, the end key of a range deletion tombstone was considered exclusive for the purposes of deletion, but considered inclusive when checking if two SSTables overlap. For example, an SSTable with a range deletion tombstone [a, b) would be considered overlapping with an SSTable with a range deletion tombstone [b, c). This commit fixes this check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5649

Differential Revision: D16808765

Pulled By: anand1976

fbshipit-source-id: 5c7ad1c027e4f778d35070e5dae1b8e6037e0d68
2019-08-14 21:02:28 -07:00
Zhongyi Xie
de3fb9a6ff exclude TEST_ENV_URI from rocksdb lite (#5686)
Summary:
PR https://github.com/facebook/rocksdb/pull/5676 added some test coverage for `TEST_ENV_URI`, which unfortunately isn't supported in lite mode, causing some test failures for rocksdb lite. For example,
```
db/db_test_util.cc: In constructor ‘rocksdb::DBTestBase::DBTestBase(std::__cxx11::string)’:
db/db_test_util.cc:57:16: error: ‘ObjectRegistry’ has not been declared
     Status s = ObjectRegistry::NewInstance()->NewSharedObject(test_env_uri,
                ^
```
This PR fixes these errors by excluding the new code from test functions for lite mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5686

Differential Revision: D16749000

Pulled By: miasantreble

fbshipit-source-id: e8b3088c31a78b3dffc5fe7814261909d2c3e369
2019-08-10 19:15:05 -07:00
Yanqin Jin
5d9a67e718 Support loading custom objects in unit tests (#5676)
Summary:
Most existing RocksDB unit tests run on `Env::Default()`. It will be useful to port the unit tests to non-default environments, e.g. `HdfsEnv`, etc.
This pull request is one step towards this goal. If RocksDB unit tests are built with a static library exposing a function `RegisterCustomObjects()`, then it is possible to implement custom object registrar logic in the library. RocksDB unit test can call `RegisterCustomObjects()` at the beginning.
By default, `ROCKSDB_UNITTESTS_WITH_CUSTOM_OBJECTS_FROM_STATIC_LIBS` is not defined, thus this PR has no impact on existing RocksDB because `RegisterCustomObjects()` is a noop.
Test plan (on devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 all
$make check
```
All unit tests must pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5676

Differential Revision: D16679157

Pulled By: riversand963

fbshipit-source-id: aca571af3fd0525277cdc674248d0fe06e060f9d
2019-08-09 15:12:08 -07:00
Vijay Nadimpalli
d150e01474 New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
      const ReadOptions& options, ColumnFamilyHandle* column_family,
      const Slice& key, PinnableSlice* merge_operands,
      GetMergeOperandsOptions* get_merge_operands_options,
      int* number_of_operands)

Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);

Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604

Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist

Differential Revision: D16657366

Pulled By: vjnadimpalli

fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
2019-08-06 14:26:44 -07:00
Zhongyi Xie
d1c9ede195 Fix duplicated file names in PurgeObsoleteFiles (#5603)
Summary:
Currently in `DBImpl::PurgeObsoleteFiles`, the list of candidate files is create through a combination of calling LogFileName using `log_delete_files` and `full_scan_candidate_files`.

In full_scan_candidate_files, the filenames look like this
{file_name = "074715.log", file_path = "/txlogs/3306"},
but LogFileName produces filenames like this that prepends a slash:
{file_name = "/074715.log", file_path = "/txlogs/3306"},

This confuses the dedup step here: bb4178066d/db/db_impl/db_impl_files.cc (L339-L345)

Because duplicates still exist, DeleteFile is called on the same file twice, and hits an error on the second try. Error message: Failed to mark /txlogs/3302/764418.log as trash.

The root cause is the use of `kDumbDbName` when generating file names, it creates file names like /074715.log. This PR removes the use of `kDumbDbName` and create paths without leading '/' when dbname can be ignored.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5603

Test Plan: make check

Differential Revision: D16413203

Pulled By: miasantreble

fbshipit-source-id: 6ba8288382c55f7d5e3892d722fc94b57d2e4491
2019-08-01 15:50:05 -07:00
Levi Tamasi
1dfc5eaab0 Test the various configurations in parallel in MergeOperatorPinningTest (#5659)
Summary:
MergeOperatorPinningTest.Randomized frequently times out under TSAN
because it tests ~40 option configurations sequentially in a loop. The
patch parallelizes the tests of the various configurations to make the
test complete faster.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5659

Test Plan: Tested using buck test mode/dev-tsan ...

Differential Revision: D16587518

Pulled By: ltamasi

fbshipit-source-id: 65bd25c0ad9a23587fed5592e69c1a0097fa27f6
2019-07-31 15:20:26 -07:00
Manuel Ung
f622ca2c7c WriteUnPrepared: savepoint support (#5627)
Summary:
Add savepoint support when the current transaction has flushed unprepared batches.

Rolling back to savepoint is similar to rolling back a transaction. It requires the set of keys that have changed since the savepoint, re-reading the keys at the snapshot at that savepoint, and the restoring the old keys by writing out another unprepared batch.

For this strategy to work though, we must be capable of reading keys at a savepoint. This does not work if keys were written out using the same sequence number before and after a savepoint. Therefore, when we flush out unprepared batches, we must split the batch by savepoint if any savepoints exist.

eg. If we have the following:
```
Put(A)
Put(B)
Put(C)
SetSavePoint()
Put(D)
Put(E)
SetSavePoint()
Put(F)
```

Then we will write out 3 separate unprepared batches:
```
Put(A) 1
Put(B) 1
Put(C) 1
Put(D) 2
Put(E) 2
Put(F) 3
```

This is so that when we rollback to eg. the first savepoint, we can just read keys at snapshot_seq = 1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5627

Differential Revision: D16584130

Pulled By: lth

fbshipit-source-id: 6d100dd548fb20c4b76661bd0f8a2647e64477fa
2019-07-31 13:39:39 -07:00
Eli Pozniansky
4834dab578 Improve CPU Efficiency of ApproximateSize (part 2) (#5609)
Summary:
In some cases, we don't have to get really accurate number. Something like 10% off is fine, we can create a new option for that use case. In this case, we can calculate size for full files first, and avoid estimation inside SST files if full files got us a huge number. For example, if we already covered 100GB of data, we should be able to skip partial dives into 10 SST files of 30MB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5609

Differential Revision: D16433481

Pulled By: elipoz

fbshipit-source-id: 5830b31e1c656d0fd3a00d7fd2678ddc8f6e601b
2019-07-31 08:50:00 -07:00
Manuel Ung
399f477818 WriteUnPrepared: Use WriteUnpreparedTxnReadCallback for MultiGet (#5634)
Summary:
The `TransactionTest.MultiGetBatchedTest` were failing with unprepared batches because we were not using the correct callbacks. Override MultiGet to pass down the correct ReadCallback. A similar problem is also fixed in WritePrepared.

This PR also fixes an issue similar to (https://github.com/facebook/rocksdb/pull/5147), but for MultiGet instead of Get.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5634

Differential Revision: D16552674

Pulled By: lth

fbshipit-source-id: 736eaf8e919c6b13d5f5655b1c0d36b57ad04804
2019-07-29 17:56:13 -07:00
Eli Pozniansky
9625a2bc2b Added SizeApproximationOptions to DB::GetApproximateSizes (#5626)
Summary:
The new DB::GetApproximateSizes with SizeApproximationOptions argument, which allows to add more options/knobs to the DB::GetApproximateSizes call (beyond only the include_flags)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5626

Differential Revision: D16496913

Pulled By: elipoz

fbshipit-source-id: ee8c6c182330a285fa056ecfc3905a592b451720
2019-07-25 22:42:30 -07:00
Yanqin Jin
ae152ee666 Avoid user key copying for Get/Put/Write with user-timestamp (#5502)
Summary:
In previous https://github.com/facebook/rocksdb/issues/5079, we added user-specified timestamp to `DB::Get()` and `DB::Put()`. Limitation is that these two functions may cause extra memory allocation and key copy. The reason is that `WriteBatch` does not allocate extra memory for timestamps because it is not aware of timestamp size, and we did not provide an API to assign/update timestamp of each key within a `WriteBatch`.
We address these issues in this PR by doing the following.
1. Add a `timestamp_size_` to `WriteBatch` so that `WriteBatch` can take timestamps into account when calling `WriteBatch::Put`, `WriteBatch::Delete`, etc.
2. Add APIs `WriteBatch::AssignTimestamp` and `WriteBatch::AssignTimestamps` so that application can assign/update timestamps for each key in a `WriteBatch`.
3. Avoid key copy in `GetImpl` by adding new constructor to `LookupKey`.

Test plan (on devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
If the API extension looks good, I will add more unit tests.

Some simple benchmark using db_bench.
```
$rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 -disable_wal=true
```
Master is at a78503bd6c.
```
|        | readrandom | fillrandom |
| master | 15.53 MB/s | 25.97 MB/s |
| PR5502 | 16.70 MB/s | 25.80 MB/s |
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5502

Differential Revision: D16340894

Pulled By: riversand963

fbshipit-source-id: 51132cf792be07d1efc3ac33f5768c4ee2608bb8
2019-07-25 15:27:39 -07:00
sdong
f5b951f7b6 Fix wrong info log printing for num_range_deletions (#5617)
Summary:
num_range_deletions printing is wrong in this log line:

2019/07/18-12:59:15.309271 7f869f9ff700 EVENT_LOG_v1 {"time_micros": 1563479955309228, "cf_name": "5", "job": 955, "event": "table_file_creation", "file_number": 34579, "file_size": 2239842, "table_properties": {"data_size": 1988792, "index_size": 3067, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 1, "filter_size": 170821, "raw_key_size": 1951792, "raw_average_key_size": 16, "raw_value_size": 1731720, "raw_average_value_size": 14, "num_data_blocks": 199, "num_entries": 121987, "num_deletions": 15184, "num_merge_operands": 86512, "num_range_deletions": 86512, "format_version": 0, "fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter", "column_family_name": "5", "column_family_id": 5, "comparator": "leveldb.BytewiseComparator", "merge_operator": "PutOperator", "prefix_extractor_name": "rocksdb.FixedPrefix.7", "property_collectors": "[]", "compression": "ZSTD", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1563479951, "oldest_key_time": 0, "file_creation_time": 1563479954}}

It actually prints "num_merge_operands" number. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5617

Test Plan: Just build.

Differential Revision: D16453110

fbshipit-source-id: fc1024b3cd5650312ed47a1379f0d2cf8b2d8a8f
2019-07-23 19:38:16 -07:00