Commit Graph

205 Commits

Author SHA1 Message Date
Cheng Chang
40497a875a Reduce memory copies when fetching and uncompressing blocks from SST files (#6689)
Summary:
In https://github.com/facebook/rocksdb/pull/6455, we modified the interface of `RandomAccessFileReader::Read` to be able to get rid of memcpy in direct IO mode.
This PR applies the new interface to `BlockFetcher` when reading blocks from SST files in direct IO mode.

Without this PR, in direct IO mode, when fetching and uncompressing compressed blocks, `BlockFetcher` will first copy the raw compressed block into `BlockFetcher::compressed_buf_` or `BlockFetcher::stack_buf_` inside `RandomAccessFileReader::Read` depending on the block size. then during uncompressing, it will copy the uncompressed block into `BlockFetcher::heap_buf_`.

In this PR, we get rid of the first memcpy and directly uncompress the block from `direct_io_buf_` to `heap_buf_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6689

Test Plan: A new unit test `block_fetcher_test` is added.

Reviewed By: anand1976

Differential Revision: D21006729

Pulled By: cheng-chang

fbshipit-source-id: 2370b92c24075692423b81277415feb2aed5d980
2020-04-24 15:32:56 -07:00
anand76
5c19a441c4 Fault injection in db_stress (#6538)
Summary:
This PR implements a fault injection mechanism for injecting errors in reads in db_stress. The FaultInjectionTestFS is used for this purpose. A thread local structure is used to track the errors, so that each db_stress thread can independently enable/disable error injection and verify observed errors against expected errors. This is initially enabled only for Get and MultiGet, but can be extended to iterator as well once its proven stable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6538

Test Plan:
crash_test
make check

Reviewed By: riversand963

Differential Revision: D20714347

Pulled By: anand1976

fbshipit-source-id: d7598321d4a2d72bda0ced57411a337a91d87dc7
2020-04-10 17:21:26 -07:00
Yanqin Jin
0c05624d50 Compaction with timestamp: input boundaries (#6645)
Summary:
Towards making compaction logic compatible with user timestamp.
When computing boundaries and overlapping ranges for inputs of compaction, We need to compare SSTs by user key without timestamp.

Test plan (devserver):
```
make check
```
Several individual tests:
```
./version_set_test --gtest_filter=VersionStorageInfoTimestampTest.GetOverlappingInputs
./db_with_timestamp_compaction_test
./db_with_timestamp_basic_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6645

Reviewed By: ltamasi

Differential Revision: D20960012

Pulled By: riversand963

fbshipit-source-id: ad377fa9eb481bf7a8a3e1824aaade48cdc653a4
2020-04-10 16:05:49 -07:00
Cheng Chang
d648a0e17f Add unit test for TransactionLockMgr (#6599)
Summary:
Although there are tests related to locking in transaction_test, this new test directly tests against TransactionLockMgr.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6599

Test Plan: make transaction_lock_mgr_test && ./transaction_lock_mgr_test

Reviewed By: lth

Differential Revision: D20673749

Pulled By: cheng-chang

fbshipit-source-id: 1fa4a13218e68d785f5a99924556751a8c5c0f31
2020-04-08 13:51:51 -07:00
Sagar Vemuri
0355d14dd9 Add a simple timer support to schedule work at fixed times/intervals (#6543)
Summary:
Adding a simple timer support to schedule work at a fixed time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6543

Test Plan: TODO: clean up the unit tests, and make them better.

Reviewed By: siying

Differential Revision: D20465390

Pulled By: sagar0

fbshipit-source-id: cba143f70b6339863e1d0f8b8bf92e51c2b3d678
2020-04-07 11:55:27 -07:00
Ziyue Yang
03a781a90c Add pipelined & parallel compression optimization (#6262)
Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262

Reviewed By: ajkr

Differential Revision: D20651306

fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
2020-04-01 16:40:18 -07:00
Levi Tamasi
6f62322fe4 Add blob files to VersionStorageInfo/VersionBuilder (#6597)
Summary:
The patch adds a couple of classes to represent metadata about
blob files: `SharedBlobFileMetaData` contains the information elements
that are immutable (once the blob file is closed), e.g. blob file number,
total number and size of blob files, checksum method/value, while
`BlobFileMetaData` contains attributes that can vary across versions like
the amount of garbage in the file. There is a single `SharedBlobFileMetaData`
for each blob file, which is jointly owned by the `BlobFileMetaData` objects
that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s
and can also be shared if the (immutable _and_ mutable) state of the blob file
is the same in two versions.

In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends
`VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those
containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata
to a new `VersionStorageInfo`. Consistency checks are also extended to ensure
that table files point to blob files that are part of the `Version`, and that all blob files
that are part of any given `Version` have at least some _non_-garbage data in them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D20656803

Pulled By: ltamasi

fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
2020-03-26 18:51:53 -07:00
Yanqin Jin
fb09ef05dc Attempt to recover from db with missing table files (#6334)
Summary:
There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status.
This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version.
`DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed.
To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush.

Test plan (on devserver):
```
$make check
$COMPILE_WITH_ASAN=1 make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334

Reviewed By: anand1976

Differential Revision: D19778960

Pulled By: riversand963

fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8
2020-03-20 19:30:48 -07:00
Cheng Chang
4fc216649d Support direct IO in RandomAccessFileReader::MultiRead (#6446)
Summary:
By supporting direct IO in RandomAccessFileReader::MultiRead, the benefits of parallel IO (IO uring) and direct IO can be combined.

In direct IO mode, read requests are aligned and merged together before being issued to RandomAccessFile::MultiRead, so blocks in the original requests might share the same underlying buffer, the shared buffers are returned in `aligned_bufs`, which is a new parameter of the `MultiRead` API.

For example, suppose alignment requirement for direct IO is 4KB, one request is (offset: 1KB, len: 1KB), another request is (offset: 3KB, len: 1KB), then since they all belong to page (offset: 0, len: 4KB), `MultiRead` only reads the page with direct IO into a buffer on heap, and returns 2 Slices referencing regions in that same buffer. See `random_access_file_reader_test.cc` for more examples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6446

Test Plan: Added a new test `random_access_file_reader_test.cc`.

Reviewed By: anand1976

Differential Revision: D20097518

Pulled By: cheng-chang

fbshipit-source-id: ca48a8faf9c3af146465c102ef6b266a363e78d1
2020-03-20 16:33:26 -07:00
sdong
d66908091d De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.

This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531

Test Plan: build using make and cmake. And build release

Differential Revision: D20473108

fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 12:20:50 -07:00
Zhichao Cao
5c30e6c088 Separate timestamp related test from db_basic_test (#6516)
Summary:
In some of the test, db_basic_test may cause time out due to its long running time. Separate the timestamp related test from db_basic_test to avoid the potential issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6516

Test Plan: pass make asan_check

Differential Revision: D20423922

Pulled By: zhichao-cao

fbshipit-source-id: d6306f89a8de55b07bf57233e4554c09ef1fe23a
2020-03-13 11:37:15 -07:00
sdong
674cf41732 Divide block_based_table_reader.cc (#6527)
Summary:
block_based_table_reader.cc is a giant file, which makes it hard for users to navigate the code. Divide the files to multiple files.
Some class templates cannot be moved to .cc file. They are moved to .h files. It is still better than including them all in block_based_table_reader.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6527

Test Plan: "make all check" and "make release". Also build using cmake.

Differential Revision: D20428455

fbshipit-source-id: ca713c698469f07f35bc0c271358c0874ed4eb28
2020-03-12 21:41:50 -07:00
Levi Tamasi
c15e85bdcb Move BlobDB related files under db/ to db/blob/ (#6519)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6519

Test Plan:
```
make all
make check
```

Differential Revision: D20400691

Pulled By: ltamasi

fbshipit-source-id: 20ef911cf1c2c92c7f71ef0b493f9be64f2eef94
2020-03-12 11:00:56 -07:00
Cheng Chang
2d9efc9ab2 Cache result of GetLogicalBufferSize in Linux (#6457)
Summary:
In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down.

This PR introduces two new APIs:
1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes.
2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes.

Other modifications:
1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms.
2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`.
3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457

Test Plan:
1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`.
2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`.

`make check`

Differential Revision: D20131243

Pulled By: cheng-chang

fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04
2020-03-11 18:40:05 -07:00
Levi Tamasi
f5bc3b99d5 Split BlobFileState into an immutable and a mutable part (#6502)
Summary:
It's never too soon to refactor something. The patch splits the recently
introduced (`VersionEdit` related) `BlobFileState` into two classes
`BlobFileAddition` and `BlobFileGarbage`. The idea is that once blob files
are closed, they are immutable, and the only thing that changes is the
amount of garbage in them. In the new design, `BlobFileAddition` contains
the immutable attributes (currently, the count and total size of all blobs, checksum
method, and checksum value), while `BlobFileGarbage` contains the mutable
GC-related information elements (count and total size of garbage blobs). This is a
better fit for the GC logic and is more consistent with how SST files are handled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6502

Test Plan: `make check`

Differential Revision: D20348352

Pulled By: ltamasi

fbshipit-source-id: ff93f0121e80ab15e0e0a6525ba0d6af16a0e008
2020-03-10 17:27:26 -07:00
Zhichao Cao
e62fe50634 Introduce FaultInjectionTestFS to test fault File system instead of Env (#6414)
Summary:
In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller.

A set of ErrorHandlerFSTests are introduced for testing
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6414

Test Plan: pass make asan_check, pass error_handler_fs_test.

Differential Revision: D20252421

Pulled By: zhichao-cao

fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21
2020-03-04 12:35:05 -08:00
Levi Tamasi
d87c10c6ab Add blob file state to VersionEdit (#6416)
Summary:
BlobDB currently does not keep track of blob files: no records are written to
the manifest when a blob file is added or removed, and upon opening a database,
the list of blob files is populated simply based on the contents of the blob directory.
This means that lost blob files cannot be detected at the moment. We plan to solve
this issue by making blob files a part of `Version`; as a first step, this patch makes
it possible to store information about blob files in `VersionEdit`. Currently, this information
includes blob file number, total number and size of all blobs, and total number and size
of garbage blobs. However, the format is extensible: new fields can be added in
both a forward compatible and a forward incompatible manner if needed (similarly
to `kNewFile4`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6416

Test Plan: `make check`

Differential Revision: D19894234

Pulled By: ltamasi

fbshipit-source-id: f9753e1f2aedf6dadb70c09b345207cb9c58c329
2020-02-24 18:39:53 -08:00
sdong
eb367d45c0 Buck config: Re-enable liburing under Linux (#6451)
Summary:
The known bug of liburing has been fixed. Now we can re-enable liburing under Linux
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6451

Test Plan: Watch internal CI

Differential Revision: D20079009

fbshipit-source-id: 04a6f53a900ff721f9a62a188cf906771b5d68d2
2020-02-24 15:47:34 -08:00
sdong
acfee40af5 Remove IO URING compiler flags (#6415)
Summary:
Since IO Uring feature is not stable. Remove it from buck configuration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6415

Test Plan: See internal build pass

Differential Revision: D19892988

fbshipit-source-id: 7fc01efc2af5ed707fb8e4e4674223aeb83cd5ea
2020-02-13 16:51:10 -08:00
Cheng Chang
dafb568052 Add utility class Defer (#6382)
Summary:
Add a utility class `Defer` to defer the execution of a function until the Defer object goes out of scope.
Used in VersionSet:: ProcessManifestWrites as an example.
The inline comments for class `Defer` have more details.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6382

Test Plan: `make defer_test version_set_test && ./defer_test && ./version_set_test`

Differential Revision: D19797538

Pulled By: cheng-chang

fbshipit-source-id: b1a9b7306e4fd4f48ec2ab55783caa561a315f0f
2020-02-10 17:59:47 -08:00
Zhichao Cao
4369f2c7bb Checksum for each SST file and stores in MANIFEST (#6216)
Summary:
In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.

Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string).  A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216

Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.

Differential Revision: D19171461

Pulled By: zhichao-cao

fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
2020-02-10 15:52:52 -08:00
Cheng Chang
b42fa1497f Support move semantics for PinnableSlice (#6374)
Summary:
It's logically correct for PinnableSlice to support move semantics to transfer ownership of the pinned memory region. This PR adds both move constructor and move assignment to PinnableSlice.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6374

Test Plan:
A set of unit tests for the move semantics are added in slice_test.
So `make slice_test && ./slice_test`.

Differential Revision: D19739254

Pulled By: cheng-chang

fbshipit-source-id: f898bd811bb05b2d87384ec58b645e9915e8e0b1
2020-02-07 14:26:26 -08:00
Chad Austin
25fbdc5a31 Fix Buck build on macOS (#6378)
Summary:
liburing is a Linux-specific dependency, so make sure it's configured in the Linux-only Buck rules.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6378

Test Plan:
```
~/fbcode $ cp internal_repo_rocksdb/repo/TARGETS rocksdb/src
~/fbcode $ buck build mode/mac eden
```

Reviewed By: chadaustin

Differential Revision: D19760039

Pulled By: riversand963

fbshipit-source-id: 2abfce81c8b17965ef76012262cd117708e0294f
2020-02-07 14:20:12 -08:00
anand76
afa2420c2b Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:48:41 -08:00
Peter Dillinger
58d46d1915 Add useful idioms to Random API (OneInOpt, PercentTrue) (#6154)
Summary:
And clean up related code, especially in stress test.

(More clean up of db_stress_test_base.cc coming after this.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6154

Test Plan: make check, make blackbox_crash_test for a bit

Differential Revision: D18938180

Pulled By: pdillinger

fbshipit-source-id: 524d27621b8dbb25f6dff40f1081e7c00630357e
2019-12-13 14:30:14 -08:00
sdong
7d79b32618 Break db_stress_tool.cc to a list of source files (#6134)
Summary:
db_stress_tool.cc now is a giant file. In order to main it easier to improve and maintain, break it down to multiple source files.
Most classes are turned into their own files. Separate .h and .cc files are created for gflag definiations. Another .h and .cc files are created for some common functions. Some test execution logic that is only loosely related to class StressTest is moved to db_stress_driver.h and db_stress_driver.cc. All the files are located under db_stress_tool/. The directory name is created as such because if we end it with either stress or test, .gitignore will ignore any file under it and makes it prone to issues in developements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6134

Test Plan: Build under GCC7 with and without LITE on using GNU Make. Build with GCC 4.8. Build with cmake with -DWITH_TOOL=1

Differential Revision: D18876064

fbshipit-source-id: b25d0a7451840f31ac0f5ebb0068785f783fdf7d
2019-12-08 23:51:01 -08:00
sdong
e3a82bb934 PosixRandomAccessFile::MultiRead() to use I/O uring if supported (#5881)
Summary:
Right now, PosixRandomAccessFile::MultiRead() executes read requests in parallel. In this PR, it leverages I/O Uring library to run it in parallel, even when page cache is enabled. This function will fall back if the kernel version doesn't support it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5881

Test Plan: Run the unit test on a kernel version supporting it and make sure all tests pass, and run a unit test on kernel version supporting it and see it pass. Before merging, will also run stress test and see it passes.

Differential Revision: D17742266

fbshipit-source-id: e05699c925ac04fdb42379456a4e23e4ebcb803a
2019-12-07 20:55:52 -08:00
Peter Dillinger
00d58a370e Abandon use of folly::Optional (#6036)
Summary:
Had complications with LITE build and valgrind test.
Reverts/fixes small parts of PR https://github.com/facebook/rocksdb/issues/6007
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6036

Test Plan:
make LITE=1 all check
and
ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 make -j24 db_bloom_filter_test && ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 ./db_bloom_filter_test

Differential Revision: D18512238

Pulled By: pdillinger

fbshipit-source-id: 37213cf0d309edf11c483fb4b2fb6c02c2cf2b28
2019-11-14 14:04:15 -08:00
Peter Dillinger
f059c7d9b9 New Bloom filter implementation for full and partitioned filters (#6007)
Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.

Speed

The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.

Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):

$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
  Single filter net ns/op: 26.2825
  Random filter net ns/op: 150.459
    Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
  Single filter net ns/op: 63.2978
  Random filter net ns/op: 188.038
    Average FP rate %: 1.13823

Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.

The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.

Accuracy

The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.

Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)

Fixes https://github.com/facebook/rocksdb/issues/5857
Fixes https://github.com/facebook/rocksdb/issues/4120

Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.

Compatibility

Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007

Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).

Differential Revision: D18294749

Pulled By: pdillinger

fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
2019-11-13 16:44:01 -08:00
Peter Dillinger
834feaff05 Add clarifying/instructive header to TARGETS and defs.bzl
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6008

Differential Revision: D18343273

Pulled By: pdillinger

fbshipit-source-id: f7d1c78d711bbfb0deea9ec88212c19ab2ec91b8
2019-11-05 20:20:33 -08:00
Peter Dillinger
ec11eff3bc FilterPolicy consolidation, part 2/2 (#5966)
Summary:
The parts that are used to implement FilterPolicy /
NewBloomFilterPolicy and not used other than for the block-based table
should be consolidated under table/block_based/filter_policy*.

This change is step 2 of 2:
mv util/bloom.cc table/block_based/filter_policy.cc
This gets its own PR so that git has the best chance of following the
rename for blame purposes. Note that low-level shared implementation
details of Bloom filters remain in util/bloom_impl.h, and
util/bloom_test.cc remains where it is for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5966

Test Plan: make check

Differential Revision: D18124930

Pulled By: pdillinger

fbshipit-source-id: 823bc09025b3395f092ef46a46aa5ba92a914d84
2019-10-24 15:44:51 -07:00
Peter Dillinger
dd19014a7a FilterPolicy consolidation, part 1/2 (#5963)
Summary:
The parts that are used to implement FilterPolicy /
NewBloomFilterPolicy and not used other than for the block-based table
should be consolidated under table/block_based/filter_policy*. I don't
foresee sharing these APIs with e.g. the Plain Table because they don't
expose hashes for reuse in indexing.

This change is step 1 of 2:
(a) mv table/full_filter_bits_builder.h to
table/block_based/filter_policy_internal.h which I expect to expand
soon to internally reveal more implementation details for testing.
(b) consolidate eventual contents of table/block_based/filter_policy.cc
in util/bloom.cc, which has the most elaborate revision history
(see step 2 ...)

Step 2 soon to follow:
mv util/bloom.cc table/block_based/filter_policy.cc
This gets its own PR so that git has the best chance of following the
rename for blame purposes. Note that low-level shared implementation
details of Bloom filters are in util/bloom_impl.h.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5963

Test Plan: make check

Differential Revision: D18121199

Pulled By: pdillinger

fbshipit-source-id: 8f21732c3d8909777e3240e4ac3123d73140326a
2019-10-24 13:20:35 -07:00
Yanqin Jin
925250f42f Include db_stress_tool in rocksdb tools lib (#5950)
Summary:
include db_stress_tool in rocksdb tools lib

Test Plan (on devserver):
```
$make db_stress
$./db_stress
$make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5950

Differential Revision: D18044399

Pulled By: riversand963

fbshipit-source-id: 895585abbbdfd8b954965921dba4b1400b7af1b1
2019-10-21 19:40:35 -07:00
Levi Tamasi
29ccf2075c Store the filter bits reader alongside the filter block contents (#5936)
Summary:
Amongst other things, PR https://github.com/facebook/rocksdb/issues/5504 refactored the filter block readers so that
only the filter block contents are stored in the block cache (as opposed to the
earlier design where the cache stored the filter block reader itself, leading to
potentially dangling pointers and concurrency bugs). However, this change
introduced a performance hit since with the new code, the metadata fields are
re-parsed upon every access. This patch reunites the block contents with the
filter bits reader to eliminate this overhead; since this is still a self-contained
pure data object, it is safe to store it in the cache. (Note: this is similar to how
the zstd digest is handled.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5936

Test Plan:
make asan_check

filter_bench results for the old code:

```
$ ./filter_bench -quick
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 26.7153
Number of filters: 16669
Total memory (MB): 200.009
Bits/key actual: 10.0647
----------------------------
Inside queries...
  Dry run (46b) ns/op: 33.4258
  Single filter ns/op: 42.5974
  Random filter ns/op: 217.861
----------------------------
Outside queries...
  Dry run (25d) ns/op: 32.4217
  Single filter ns/op: 50.9855
  Random filter ns/op: 219.167
    Average FP rate %: 1.13993
----------------------------
Done. (For more info, run with -legend or -help.)

$ ./filter_bench -quick -use_full_block_reader
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 26.5172
Number of filters: 16669
Total memory (MB): 200.009
Bits/key actual: 10.0647
----------------------------
Inside queries...
  Dry run (46b) ns/op: 32.3556
  Single filter ns/op: 83.2239
  Random filter ns/op: 370.676
----------------------------
Outside queries...
  Dry run (25d) ns/op: 32.2265
  Single filter ns/op: 93.5651
  Random filter ns/op: 408.393
    Average FP rate %: 1.13993
----------------------------
Done. (For more info, run with -legend or -help.)
```

With the new code:

```
$ ./filter_bench -quick
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 25.4285
Number of filters: 16669
Total memory (MB): 200.009
Bits/key actual: 10.0647
----------------------------
Inside queries...
  Dry run (46b) ns/op: 31.0594
  Single filter ns/op: 43.8974
  Random filter ns/op: 226.075
----------------------------
Outside queries...
  Dry run (25d) ns/op: 31.0295
  Single filter ns/op: 50.3824
  Random filter ns/op: 226.805
    Average FP rate %: 1.13993
----------------------------
Done. (For more info, run with -legend or -help.)

$ ./filter_bench -quick -use_full_block_reader
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 26.5308
Number of filters: 16669
Total memory (MB): 200.009
Bits/key actual: 10.0647
----------------------------
Inside queries...
  Dry run (46b) ns/op: 33.2968
  Single filter ns/op: 58.6163
  Random filter ns/op: 291.434
----------------------------
Outside queries...
  Dry run (25d) ns/op: 32.1839
  Single filter ns/op: 66.9039
  Random filter ns/op: 292.828
    Average FP rate %: 1.13993
----------------------------
Done. (For more info, run with -legend or -help.)
```

Differential Revision: D17991712

Pulled By: ltamasi

fbshipit-source-id: 7ea205550217bfaaa1d5158ebd658e5832e60f29
2019-10-18 19:32:59 -07:00
sdong
b931f84e56 Divide file_reader_writer.h and .cc (#5803)
Summary:
file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803

Test Plan: Build whole project using make and cmake.

Differential Revision: D17374550

fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987
2019-09-16 10:33:51 -07:00
anand76
83a6a614e9 Refactor ArenaWrappedDBIter into separate files (#5801)
Summary:
Move definition and implementation for ArenaWrappedDBIter into its own .h/.cc files. Also, change inlining of functions to better comply with the Google C++ style guide.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5801

Test Plan: make check

Differential Revision: D17371012

Pulled By: anand1976

fbshipit-source-id: c1361abc2851575111e357a63d88be3b3d6cb341
2019-09-13 13:50:43 -07:00
Peter Dillinger
20dec1401f Copy/split PlainTableBloomV1 from DynamicBloom (refactor) (#5767)
Summary:
DynamicBloom was being used both for memory-only and for on-disk filters, as part of the PlainTable format. To set up enhancements to the memtable Bloom filter, this splits the code into two copies and removes unused features from each copy. Adds test PlainTableDBTest.BloomSchema to ensure no accidental change to that format.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5767

Differential Revision: D17206963

Pulled By: pdillinger

fbshipit-source-id: 6cce8d55305ed0df051b4c58bdc98c8ad81d0553
2019-09-05 10:05:20 -07:00
Zhongyi Xie
2f41ecfe75 Refactor trimming logic for immutable memtables (#5022)
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022

Differential Revision: D14394062

Pulled By: miasantreble

fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
2019-08-23 13:55:34 -07:00
Vijay Nadimpalli
d150e01474 New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
      const ReadOptions& options, ColumnFamilyHandle* column_family,
      const Slice& key, PinnableSlice* merge_operands,
      GetMergeOperandsOptions* get_merge_operands_options,
      int* number_of_operands)

Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);

Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604

Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist

Differential Revision: D16657366

Pulled By: vjnadimpalli

fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
2019-08-06 14:26:44 -07:00
Yanqin Jin
30edf1874c Change buckifier to support parameterized dependencies (#5648)
Summary:
Users may desire to specify extra dependencies via buck. This PR allows users to pass additional dependencies as a JSON object so that the buckifier script can generate TARGETS file with desired extra dependencies.

Test plan (on dev server)
```
$python buckifier/buckify_rocksdb.py '{"fake": {"extra_deps": [":test_dep", "//fakes/module:mock1"], "extra_compiler_flags": ["-DROCKSDB_LITE", "-Os"]}}'
Generating TARGETS
Extra dependencies:
{'': {'extra_compiler_flags': [], 'extra_deps': []}, 'test_dep1': {'extra_compiler_flags': ['-O2', '-DROCKSDB_LITE'], 'extra_deps': [':fake', '//dep1/mock']}}
Generated TARGETS Summary:
- 5 libs
- 0 binarys
- 296 tests
```
Verify the TARGETS file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5648

Differential Revision: D16565043

Pulled By: riversand963

fbshipit-source-id: a6ef02274174fcf159692d7b846e828454d01e89
2019-08-02 10:55:17 -07:00
haoyuhuang
70c7302fb5 Block cache simulator: Add pysim to simulate caches using reinforcement learning. (#5610)
Summary:
This PR implements cache eviction using reinforcement learning. It includes two implementations:
1. An implementation of Thompson Sampling for the Bernoulli Bandit [1].
2. An implementation of LinUCB with disjoint linear models [2].

The idea is that a cache uses multiple eviction policies, e.g., MRU, LRU, and LFU. The cache learns which eviction policy is the best and uses it upon a cache miss.
Thompson Sampling is contextless and does not include any features.
LinUCB includes features such as level, block type, caller, column family id to decide which eviction policy to use.

[1] Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. 2018. A Tutorial on Thompson Sampling. Found. Trends Mach. Learn. 11, 1 (July 2018), 1-96. DOI: https://doi.org/10.1561/2200000070
[2] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web (WWW '10). ACM, New York, NY, USA, 661-670. DOI=http://dx.doi.org/10.1145/1772690.1772758
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5610

Differential Revision: D16435067

Pulled By: HaoyuHuang

fbshipit-source-id: 6549239ae14115c01cb1e70548af9e46d8dc21bb
2019-07-26 14:41:13 -07:00
Levi Tamasi
3617287e0e Parallelize db_bloom_filter_test (#5632)
Summary:
This test frequently times out under TSAN; parallelizing it should fix
this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5632

Test Plan:
make check
buck test mode/dev-tsan internal_repo_rocksdb/repo:db_bloom_filter_test

Differential Revision: D16519399

Pulled By: ltamasi

fbshipit-source-id: 66e05a644d6f79c6d544255ffcf6de195d2d62fe
2019-07-26 11:48:17 -07:00
Chad Austin
0d16fad51b rocksdb: build on macosx
Summary:
Make rocksdb build on macos:
1) Reorganize OS-specific flags and deps in rocksdb/src/TARGETS
2) Sandbox fbcode apple platform builds from repo root include path (which conflicts
    with layout of rocksdb headers).
3) Fix dep-translation for bzip2.

Reviewed By: andrewjcg

Differential Revision: D15125826

fbshipit-source-id: 8e143c689b88b5727e54881a5e80500f879a320b
2019-07-25 11:45:54 -07:00
Mark Rambacher
cfcf045acc The ObjectRegistry class replaces the Registrar and NewCustomObjects.… (#5293)
Summary:
The ObjectRegistry class replaces the Registrar and NewCustomObjects.  Objects are registered with the registry by Type (the class must implement the static const char *Type() method).

This change is necessary for a few reasons:
- By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable.
- By having a class with instances, different units could have different objects registered.  This could be useful if, for example, one Option allowed for a dynamic library and one did not.

When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program.

Test plan (on riversand963's  devserver)
```
$COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check
```
All tests pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293

Differential Revision: D16363396

Pulled By: riversand963

fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180
2019-07-23 17:13:05 -07:00
Levi Tamasi
092f417037 Move the uncompression dictionary object out of the block cache (#5584)
Summary:
RocksDB has historically stored uncompression dictionary objects in the block
cache as opposed to storing just the block contents. This neccesitated
evicting the object upon table close. With the new code, only the raw blocks
are stored in the cache, eliminating the need for eviction.

In addition, the patch makes the following improvements:

1) Compression dictionary blocks are now prefetched/pinned similarly to
index/filter blocks.
2) A copy operation got eliminated when the uncompression dictionary is
retrieved.
3) Errors related to retrieving the uncompression dictionary are propagated as
opposed to silently ignored.

Note: the patch temporarily breaks the compression dictionary evicition stats.
They will be fixed in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5584

Test Plan: make asan_check

Differential Revision: D16344151

Pulled By: ltamasi

fbshipit-source-id: 2962b295f5b19628f9da88a3fcebbce5a5017a7b
2019-07-23 16:01:44 -07:00
anand76
112702ac6c Parallelize file_reader_writer_test in order to reduce timeouts
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5608

Test Plan:
make check
buck test mode/dev-tsan internal_repo_rocksdb/repo:file_reader_writer_test -- --run-disabled

Differential Revision: D16441796

Pulled By: anand1976

fbshipit-source-id: afbb88a9fcb1c0ba22215118767e8eab3d1d6a4a
2019-07-23 11:50:10 -07:00
Venki Pallipadi
22ce462450 Export Import sst files (#5495)
Summary:
Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135

This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469
"Add support for taking snapshot of a column family and creating column family from a given CF snapshot"

We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.

(1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
// Exports all live SST files of a specified Column Family onto export_dir,
// returning SST files information in metadata.
// - SST files will be created as hard links when the directory specified
//   is in the same partition as the db directory, copied otherwise.
// - export_dir should not already exist and will be created by this API.
// - Always triggers a flush.
virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
                                  const std::string& export_dir,
                                  ExportImportFilesMetaData** metadata);

Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
returning the list of file metadata.

(2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
// CreateColumnFamilyWithImport() will create a new column family with
// column_family_name and import external SST files specified in metadata into
// this column family.
// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
//     an existing DB.
// Option in import_options specifies whether the external files are copied or
// moved (default is copy). When option specifies copy, managing files at
// external_file_path is caller's responsibility. When option specifies a
// move, the call ensures that the specified files at external_file_path are
// deleted on successful return and files are not modified on any error
// return.
// On error return, column family handle returned will be nullptr.
// ColumnFamily will be present on successful return and will not be present
// on error return. ColumnFamily may be present on any crash during this call.
virtual Status CreateColumnFamilyWithImport(
    const ColumnFamilyOptions& options, const std::string& column_family_name,
    const ImportColumnFamilyOptions& import_options,
    const ExportImportFilesMetaData& metadata,
    ColumnFamilyHandle** handle);

Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.

If incoming sequence number is higher than current local sequence number, local sequence
number is updated to reflect this.

Note, as the sst files is are being moved across Column Families, Column Family name in sst file
will no longer match the actual column family on destination DB. The API does not modify Column
Family name or id in the sst files being imported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495

Differential Revision: D16018881

fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
2019-07-17 12:27:14 -07:00
Levi Tamasi
3bde41b5a3 Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.

Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504

Test Plan: make asan_check

Differential Revision: D16036974

Pulled By: ltamasi

fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:14:58 -07:00
haoyuhuang
1a59b6e2a9 Cache simulator: Add a ghost cache for admission control and a hybrid row-block cache. (#5534)
Summary:
This PR adds a ghost cache for admission control. Specifically, it admits an entry on its second access.
It also adds a hybrid row-block cache that caches the referenced key-value pairs of a Get/MultiGet request instead of its blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5534

Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32

Differential Revision: D16101124

Pulled By: HaoyuHuang

fbshipit-source-id: b99edda6418a888e94eb40f71ece45d375e234b1
2019-07-11 12:43:29 -07:00
ggaurav28
60d8b19836 Implemented a file logger that uses WritableFileWriter (#5491)
Summary:
Current PosixLogger performs IO operations using posix calls. Thus the
current implementation will not work for non-posix env. Created a new
logger class EnvLogger that uses env specific WritableFileWriter for IO operations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5491

Test Plan: make check

Differential Revision: D15909002

Pulled By: ggaurav28

fbshipit-source-id: 13a8105176e8e42db0c59798d48cb6a0dbccc965
2019-07-09 16:27:22 -07:00
haoyuhuang
9f0bd56889 Cache simulator: Refactor the cache simulator so that we can add alternative policies easily (#5517)
Summary:
This PR creates cache_simulator.h file. It contains a CacheSimulator that runs against a block cache trace record. We can add alternative cache simulators derived from CacheSimulator later. For example, this PR adds a PrioritizedCacheSimulator that inserts filter/index/uncompressed dictionary blocks with high priority.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5517

Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32

Differential Revision: D16043689

Pulled By: HaoyuHuang

fbshipit-source-id: 65f28ed52b866ffb0e6eceffd7f9ca7c45bb680d
2019-07-01 12:46:32 -07:00
Zhongyi Xie
671d15cbdd Persistent Stats: persist stats history to disk (#5046)
Summary:
This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and  both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046

Differential Revision: D15863138

Pulled By: miasantreble

fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b
2019-06-17 15:21:50 -07:00
Maysam Yabandeh
ee294c24ed Make db_bloom_filter_test parallel (#5467)
Summary:
When run under TSAN it sometimes goes over 10m and times out. The slowest ones are `DBBloomFilterTestWithParam.BloomFilter` which we have 6 of them. Making the tests run in parallel should take care of the timeout issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5467

Differential Revision: D15856912

Pulled By: maysamyabandeh

fbshipit-source-id: 26c43c55312974c1b809c070342dee037d0219f4
2019-06-17 11:13:45 -07:00
haoyuhuang
bb4178066d Integrate block cache tracer into db_impl (#5433)
Summary:
This PR integrates the block cache tracer class into db_impl.cc.
db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433

Differential Revision: D15728016

Pulled By: HaoyuHuang

fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71
2019-06-13 15:43:10 -07:00
Mark Rambacher
c8267120d8 Add support for loading dynamic libraries into the RocksDB environment (#5281)
Summary:
This change adds a Dynamic Library class to the RocksDB Env.  Dynamic libraries are populated via the  Env::LoadLibrary method.

The addition of dynamic library support allows for a few different features to be developed:
1.  The compression code can be changed to use dynamic library support.  This would allow RocksDB to determine at run-time what compression packages were installed.  This change would eliminate the need to make sure the build-time and run-time environment had the same library set.  It would also simplify some of the Java build issues (where it attempts to build and include various packages inside the RocksDB jars).

2.  Along with other features (to be provided in a subsequent PR), this change would allow code/configurations to be added to RocksDB at run-time.  For example, the build system includes code for building an "rados" environment and adding "Cassandra" features.  Instead of these extensions being built into the base RocksDB code, these extensions could be loaded at run-time as required/appropriate, either by configuration or explicitly.

We intend to push out other changes in support of the extending RocksDB at run-time via configurations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5281

Differential Revision: D15447613

Pulled By: riversand963

fbshipit-source-id: 452cd4f54511c0bceee18f6d9d919aae9fd25fef
2019-06-03 23:02:56 -07:00
Siying Dong
5851cb7fdb Move util/trace_replay.* to trace_replay/ (#5376)
Summary:
util/ means for lower level libraries. trace_replay is highly integrated to DB and sometimes call DB. Move it out to a separate directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5376

Differential Revision: D15550938

Pulled By: siying

fbshipit-source-id: f46dce5ceffdc05a73f26379c7bb1b79ebe6c207
2019-06-03 13:25:26 -07:00
Siying Dong
000b9ec217 Move some logging related files to logging/ (#5387)
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387

Differential Revision: D15579036

Pulled By: siying

fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
2019-05-31 17:23:59 -07:00
Vijay Nadimpalli
49c5a12dbe Organizing rocksdb/db directory
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390

Differential Revision: D15579388

Pulled By: vjnadimpalli

fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366
2019-05-31 11:57:01 -07:00
Zhongyi Xie
ab8f6c01a6 move LevelCompactionPicker to a separate file (#5369)
Summary:
In order to improve code readability, this PR moves LevelCompactionBuilder and LevelCompactionPicker to compaction_picker_level.h and .cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5369

Differential Revision: D15540172

Pulled By: miasantreble

fbshipit-source-id: c1a578b93f127cd63661b53f32b356e6edd349af
2019-05-30 21:38:24 -07:00
Siying Dong
8843129ece Move some memory related files from util/ to memory/ (#5382)
Summary:
Move arena, allocator, and memory tools under util to a separate memory/ directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382

Differential Revision: D15564655

Pulled By: siying

fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
2019-05-30 17:44:09 -07:00
Vijay Nadimpalli
50e470791d Organizing rocksdb/table directory by format
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373

Differential Revision: D15559425

Pulled By: vjnadimpalli

fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55
2019-05-30 14:51:11 -07:00
Siying Dong
e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
Siying Dong
545d206040 Move some file related files outside util/ (#5375)
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375

Differential Revision: D15550935

Pulled By: siying

fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
2019-05-29 20:47:06 -07:00
Yanqin Jin
9358178edc Support for single-primary, multi-secondary instances (#4899)
Summary:
This PR allows RocksDB to run in single-primary, multi-secondary process mode.
The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.

This PR has several components:
1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.

2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.

3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
3.1 Tail the primary's MANIFEST during recovery.
3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
3.3 Tailing WAL will be in a future PR.

4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899

Differential Revision: D14510945

Pulled By: riversand963

fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
2019-03-26 16:45:31 -07:00
Zhongyi Xie
c4f5d0aa15 add GetStatsHistory to retrieve stats snapshots (#4748)
Summary:
This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0.

(This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748

Differential Revision: D13961471

Pulled By: miasantreble

fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04
2019-02-20 15:52:54 -08:00
Yanqin Jin
5af9446ee6 Remove Lua compaction filter from RocksDB main repo (#4971)
Summary:
as title. For people who continue to need Lua compaction filter, you
can copy the include/rocksdb/utilities/rocks_lua/lua_compaction_filter.h and
utilities/lua/rocks_lua_compaction_filter.cc to your own codebase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4971

Differential Revision: D14047468

Pulled By: riversand963

fbshipit-source-id: 9ad1a6484a7c94e478f1e108127a3184e4069f70
2019-02-13 12:42:44 -08:00
Siying Dong
cf3a671733 Remove cuckoo hash memtable (#4953)
Summary:
Cuckoo Hash is less useful than we initially expected. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953

Differential Revision: D13979264

Pulled By: siying

fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed
2019-02-07 16:15:27 -08:00
Andrew Kryczka
8ec3e72551 Cache dictionary used for decompressing data blocks (#4881)
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881

Differential Revision: D13663801

Pulled By: ajkr

fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
2019-01-23 18:15:47 -08:00
Philip Jameson
b2ba06856a Add load() statements to TARGETS files
Reviewed By: luciang

Differential Revision: D13733578

fbshipit-source-id: 556c115935aa42c1da85ec0e91199b9f198fc467
2019-01-22 15:24:51 -08:00
Siying Dong
1fb2e274c5 Remove some components (#4101)
Summary:
Remove some components that we never heard people using them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4101

Differential Revision: D8825431

Pulled By: siying

fbshipit-source-id: 97a12ad3cad4ab12c82741a5ba49669aaa854180
2019-01-10 13:30:09 -08:00
Philip Jameson
a3ab59d030 Fix skylark incompatible build files in rocksdb
Reviewed By: ttsugriy

Differential Revision: D13583867

fbshipit-source-id: 8f218a9ffd9807d386ba0adc966af2a9a48ac64c
2019-01-07 13:37:40 -08:00
Yi Wu
77a8d4d476 Detect if Jemalloc is linked with the binary (#4844)
Summary:
Declare Jemalloc non-standard APIs as weak symbols, so that if Jemalloc is linked with the binary, these symbols will be replaced by Jemalloc's, otherwise they will be nullptr. This is similar to how folly detect jemalloc, but we assume the main program use jemalloc as long as jemalloc is linked: https://github.com/facebook/folly/blob/master/folly/memory/Malloc.h#L147
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4844

Differential Revision: D13574934

Pulled By: yiwu-arbug

fbshipit-source-id: 7ea871beb1be7d5a1259cc38f9b78078793db2db
2019-01-03 16:30:12 -08:00
Abhishek Madan
81b6b09f6b Remove v1 RangeDelAggregator (#4778)
Summary:
Now that v2 is fully functional, the v1 aggregator is removed.
The v2 aggregator has been renamed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778

Differential Revision: D13495930

Pulled By: abhimadan

fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6
2018-12-17 17:33:46 -08:00
Burton Li
a8b9891f95 Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918

We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.

With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.

ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.

The usage is straight forward:
e.g.:

//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...

//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0);  // full throttle (0 task)

//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);

// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;

// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;

...

//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332

Differential Revision: D13226590

Pulled By: siying

fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 13:18:28 -08:00
anand76
e58d76955a Fix buck dev mode fbcode builds (#4747)
Summary:
Don't enable ROCKSDB_JEMALLOC unless the build mode is opt and default
allocator is jemalloc. In dev mode, this is causing compile/link errors such as -
```
stderr: buck-out/dev/gen/rocksdb/src/rocksdb_lib#compile-pic-malloc_stats.cc.o4768b59e,gcc-5-glibc-2.23-clang/db/malloc_stats.cc.o:malloc_stats.cc:function rocksdb::DumpMallocStats(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*): error: undefined reference to 'malloc_stats_print'
clang-7.0: error: linker command failed with exit code 1
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4747

Differential Revision: D13324840

Pulled By: anand1976

fbshipit-source-id: 45ffbd4f63fe4d9e8a0473d8f066155e4ef64a14
2018-12-05 10:40:31 -08:00
Sagar Vemuri
70645355ad Move FIFOCompactionPicker to a separate file (#4724)
Summary:
**Summary:**
Simplified the code layout by moving FIFOCompactionPicker to a separate file.
**Why?:**
While trying to add ttl functionality to universal compaction, I found that `FIFOCompactionPicker` class and its impl methods to be interspersed between `LevelCompactionPicker` methods which kind-of made the code a little hard to traverse. So I moved `FIFOCompactionPicker` to a separate compaction_picker_fifo.h/cc file, similar to `UniversalCompactionPicker`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4724

Differential Revision: D13227914

Pulled By: sagar0

fbshipit-source-id: 89471766ea67fa4d87664a41c057dd7df4b3d4e3
2018-11-29 16:04:52 -08:00
Huachao Huang
5e72bc113a Add SstFileReader to read sst files (#4717)
Summary:
A user friendly sst file reader is useful when we want to access sst
files outside of RocksDB. For example, we can generate an sst file
with SstFileWriter and send it to other places, then use SstFileReader
to read the file and process the entries in other ways.

Also rename the original SstFileReader to SstFileDumper because of
name conflict, and seems SstFileDumper is more appropriate for tools.

TODO: there is only a very simple test now, because I want to get some feedback first.
If the changes look good, I will add more tests soon.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4717

Differential Revision: D13212686

Pulled By: ajkr

fbshipit-source-id: 737593383264c954b79e63edaf44aaae0d947e56
2018-11-27 13:02:23 -08:00
Abhishek Madan
457f77b9ff Introduce RangeDelAggregatorV2 (#4649)
Summary:
The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649

Differential Revision: D13146964

Pulled By: abhimadan

fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
2018-11-21 10:56:45 -08:00
Yi Wu
5f5fddabc7 port folly::JemallocNodumpAllocator (#4534)
Summary:
Introduce `JemallocNodumpAllocator`, which allow exclusion of block cache usage from core dump. It utilize custom hook of jemalloc arena, and when jemalloc arena request memory from system, the allocator use the hook to set `MADV_DONTDUMP ` to the memory. The implementation is basically the same as `folly::JemallocNodumpAllocator`, except for some minor difference:
1. It only support jemalloc >= 5.0
2. When the allocator destruct, it explicitly destruct the corresponding arena via `arena.<i>.destroy` via `mallctl`.

Depending on #4502.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4534

Differential Revision: D10435474

Pulled By: yiwu-arbug

fbshipit-source-id: e80edea755d3853182485d2be710376384ce0bb4
2018-10-26 17:29:18 -07:00
Abhishek Madan
8c78348c77 Use only "local" range tombstones during Get (#4449)
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.

In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```

...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```

The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.

Readrandom results before PR:
```
readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)
```

Readrandom results after PR:
```
readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)
```

So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).

----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449

Differential Revision: D10370575

Pulled By: abhimadan

fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
2018-10-24 12:31:12 -07:00
Philip Jameson
56e129da01 Remove usages of headers attribute as a string
Reviewed By: andrewjcg

Differential Revision: D10409082

fbshipit-source-id: a1432270f79c2baf2e52e3351b5f481c7398c58d
2018-10-18 13:59:08 -07:00
Yi Wu
f60c4e5a58 Set -DROCKSDB_JEMALLOC for buck build if jemalloc presents (#4489)
Summary:
Set the macro if default allocator is jemalloc. It doesn't handle the case when allocator is specified, e.g.
```
cpp_binary(
    name="xxx"
    allocator="jemalloc", # or "malloc" or something else
    deps=["//rocksdb:rocksdb"],
)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4489

Differential Revision: D10363683

Pulled By: yiwu-arbug

fbshipit-source-id: 5da490336a8e78e0feb0900c29e8036e7ec6f12b
2018-10-15 11:41:47 -07:00
Yi Wu
d6f2ecf49c Utility to run task periodically in a thread (#4423)
Summary:
Introduce `RepeatableThread` utility to run task periodically in a separate thread. It is basically the same as the the same class in fbcode, and in addition provide a helper method to let tests mock time and trigger execution one at a time.

We can use this class to replace `TimerQueue` in #4382 and `BlobDB`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4423

Differential Revision: D10020932

Pulled By: yiwu-arbug

fbshipit-source-id: 3616bef108c39a33c92eedb1256de424b7c04087
2018-09-27 15:28:00 -07:00
Yi Wu
31d46993cc Update TARGETS file template (#4426)
Summary:
Update template of TARGETS file according to recent changes in #4371 , #4363 and dbf44c314b.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4426

Differential Revision: D10025053

Pulled By: yiwu-arbug

fbshipit-source-id: e6a0a702bfd401fc1af240ee446f5690f0bcd85d
2018-09-25 14:14:01 -07:00
Abhishek Madan
1626f6ab6b Add RangeDelAggregator microbenchmarks (#4363)
Summary:
To measure the results of upcoming DeleteRange v2 work, this commit adds
simple benchmarks for RangeDelAggregator. It measures the average time
for AddTombstones and ShouldDelete calls.

Using this to compare the results before #4014 and on the latest master (using the default arguments) produces the following results:

Before #4014:
```
=======================
Results:
=======================
AddTombstones:          1356.28 us
ShouldDelete:           0.401732 us
```

Latest master:
```
=======================
Results:
=======================
AddTombstones:          740.82 us
ShouldDelete:           0.383271 us
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4363

Differential Revision: D9881676

Pulled By: abhimadan

fbshipit-source-id: 793e7d61aa4b9d47eb917bbcc03f08695b5e5442
2018-09-17 14:58:31 -07:00
Anand Ananthabhotla
dfda91027b Remove trace_analyzer_tool.cc from rocksdb_lib buck target (#4371)
Summary:
Including tools/trace_analyzer_tool.cc in rocksdb_lib was causing conflicts in dependent binaries due to duplicate gflag (other_prefix).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4371

Differential Revision: D9846953

Pulled By: anand1976

fbshipit-source-id: 80b4aa36ab8428b8f6dceb896c45532684102709
2018-09-15 19:58:13 -07:00
Philip Jameson
dbf44c314b Lint TARGETS files with buildifier
Summary: Build file formatting

Reviewed By: mzlee

Differential Revision: D9728238

fbshipit-source-id: 99a266d5d2260eabfd63a200b2994c6850b59cf4
2018-09-11 14:58:19 -07:00
Philip Jameson
a876995ed4 Grab straggler files to explicitly import AutoHeaders
Summary: There were a few files that were missed when AutoHeaders were moved to their own file. Add explicit loads

Reviewed By: yfeldblum

Differential Revision: D9499942

fbshipit-source-id: 942bf3a683b8961e1b6244136f6337477dcc45af
2018-08-28 21:28:55 -07:00
Yi Wu
a6d3de4e7a BlobDB: Implement DisableFileDeletions (#4314)
Summary:
`DB::DiableFileDeletions` and `DB::EnableFileDeletions` are used for applications to stop RocksDB background jobs to delete files while they are doing replication. Implement these methods for BlobDB. `DeleteObsolteFiles` now needs to check `disable_file_deletions_` before starting, and will hold `delete_file_mutex_` the whole time while it is running. `DisableFileDeletions` needs to wait on `delete_file_mutex_` for running `DeleteObsolteFiles` job and set `disable_file_deletions_` flag.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4314

Differential Revision: D9501373

Pulled By: yiwu-arbug

fbshipit-source-id: 81064c1228f1724eff46da22b50ff765b16292cd
2018-08-27 10:58:29 -07:00
Fenggang Wu
19ec44fd39 Improve point-lookup performance using a data block hash index (#4174)
Summary:
Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`

The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174

Differential Revision: D8965914

Pulled By: fgwu

fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
2018-08-15 14:30:03 -07:00
Zhichao Cao
999d955e4f RocksDB Trace Analyzer (#4091)
Summary:
A framework of trace analyzing for RocksDB

After collecting the trace by using the tool of [PR #3837](https://github.com/facebook/rocksdb/pull/3837). User can use the Trace Analyzer to interpret, analyze, and characterize the collected workload.
**Input:**
1. trace file
2. Whole keys space file

**Statistics:**
1. Access count of each operation (Get, Put, Delete, SingleDelete, DeleteRange, Merge) in each column family.
2. Key hotness (access count) of each one
3. Key space separation based on given prefix
4. Key size distribution
5. Value size distribution if appliable
6. Top K accessed keys
7. QPS statistics including the average QPS and peak QPS
8. Top K accessed prefix
9. The query correlation analyzing, output the number of X after Y and the corresponding average time
    intervals

**Output:**
1. key access heat map (either in the accessed key space or whole key space)
2. trace sequence file (interpret the raw trace file to line base text file for future use)
3. Time serial (The key space ID and its access time)
4. Key access count distritbution
5. Key size distribution
6. Value size distribution (in each intervals)
7. whole key space separation by the prefix
8. Accessed key space separation by the prefix
9. QPS of each operation and each column family
10. Top K QPS and their accessed prefix range

**Test:**
1. Added the unit test of analyzing Get, Put, Delete, SingleDelete, DeleteRange, Merge
2. Generated the trace and analyze the trace

**Implemented but not tested (due to the limitation of trace_replay):**
1. Analyzing Iterator, supporting Seek() and SeekForPrev() analyzing
2. Analyzing the number of Key found by Get

**Future Work:**
1.  Support execution time analyzing of each requests
2.  Support cache hit situation and block read situation of Get
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4091

Differential Revision: D9256157

Pulled By: zhichao-cao

fbshipit-source-id: f0ceacb7eedbc43a3eee6e85b76087d7832a8fe6
2018-08-13 11:44:02 -07:00
Yi Wu
140f256da2 BlobDB: Cleanup TTLExtractor interface (#4229)
Summary:
Cleanup TTLExtractor interface. The original purpose of it is to allow our users keep using existing `Write()` interface but allow it to accept TTL via `TTLExtractor`. However the interface is confusing. Will replace it with something like `WriteWithTTL(batch, ttl)` in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4229

Differential Revision: D9174390

Pulled By: yiwu-arbug

fbshipit-source-id: 68201703d784408b851336ab4dd9b84188245b2d
2018-08-06 11:58:05 -07:00
Sagar Vemuri
12b6cdeed3 Trace and Replay for RocksDB (#3837)
Summary:
A framework for tracing and replaying RocksDB operations.

A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench.

- Column-families are supported
- Multi-threaded tracing is supported.
- TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it.
- This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations.

Currently supported DB operations:
- Writes:
-- Put
-- Merge
-- Delete
-- SingleDelete
-- DeleteRange
-- Write
- Reads:
-- Get (point lookups)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837

Differential Revision: D7974837

Pulled By: sagar0

fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72
2018-08-01 00:27:08 -07:00
Fenggang Wu
8805ec2f49 DataBlockHashIndex: Standalone Implementation with Unit Test (#4139)
Summary:
The first step of the `DataBlockHashIndex` implementation. A string based hash table is implemented and unit-tested.

`DataBlockHashIndexBuilder`: `Add()` takes pairs of `<key, restart_index>`, and formats it into a string when `Finish()` is called.
`DataBlockHashIndex`: initialized by the formatted string, and can interpret it as a hash table. Lookup for a key is supported by iterator operation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4139

Reviewed By: sagar0

Differential Revision: D8866764

Pulled By: fgwu

fbshipit-source-id: 7f015f0098632c65979a22898a50424384730b10
2018-07-24 11:43:37 -07:00
Chang Su
374c37da5b move static msgs out of Status class (#4144)
Summary:
The member msgs of class Status contains all types of status messages.
When users dump a Status object, msgs will confuse users. So move it out
of class Status by making it as file-local static variable.

Closes #3831 .
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4144

Differential Revision: D8941419

Pulled By: sagar0

fbshipit-source-id: 56b0510258465ff26db15aa6b04e01532e053e3d
2018-07-23 15:44:16 -07:00
Siying Dong
ddc07b40fc Remove managed iterator
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4124

Differential Revision: D8829910

Pulled By: siying

fbshipit-source-id: f3e952ccf3a631071a5d77c48e327046f8abb560
2018-07-17 14:43:18 -07:00
Siying Dong
a61ff876a1 Remove two CI tests (#4110)
Summary:
Two CI tests never pass because of the environment problem. Delete them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4110

Differential Revision: D8805713

Pulled By: siying

fbshipit-source-id: 6eb4813dc2094ee2045ec8ede7fe8967d546d6e8
2018-07-12 11:43:25 -07:00
Anand Ananthabhotla
52d4c9b7f6 Allow DB resume after background errors (#3997)
Summary:
Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
3. Provide an API for the user to clear the error and resume the DB instance

This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
Closes https://github.com/facebook/rocksdb/pull/3997

Differential Revision: D8653831

Pulled By: anand1976

fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
2018-06-28 12:34:40 -07:00
Yi Wu
58c221440c Update TARGETS file (#4028)
Summary:
-Wshorten-64-to-32 is invalid flag in fbcode. Changing it to -Warrowing.
Closes https://github.com/facebook/rocksdb/pull/4028

Differential Revision: D8553694

Pulled By: yiwu-arbug

fbshipit-source-id: 1523cbcb4c76cf1d2b10a4d28b5f58c78e6cb876
2018-06-21 14:42:39 -07:00
Dmitri Smirnov
f4b72d7056 Provide a way to override windows memory allocator with jemalloc for ZSTD
Summary:
Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably.

For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance.

The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression.
Closes https://github.com/facebook/rocksdb/pull/3838

Differential Revision: D8229794

Pulled By: miasantreble

fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d
2018-06-04 12:12:48 -07:00