A library that provides an embeddable, persistent key-value store for fast storage.
Go to file
Mike Kolupaev b4d7209428 Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.

Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.

So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.

Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.

This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289

Differential Revision: D15256423

Pulled By: al13n321

fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:54:04 -07:00
buckifier Add support for loading dynamic libraries into the RocksDB environment (#5281) 2019-06-03 23:02:56 -07:00
build_tools build on ARM64 (#5450) 2019-06-18 11:27:45 -07:00
cache Export Cache::GetCharge (#5476) 2019-06-18 17:35:41 -07:00
cmake Make FindZLIB consistent with official definitions (#4823) 2019-01-02 12:49:57 -08:00
coverage Add copyright headers per FB open-source checkup tool. (#5199) 2019-04-18 10:55:01 -07:00
db Add an option to put first key of each sst block in the index (#5289) 2019-06-24 20:54:04 -07:00
docs Text lint all .gitignore files 2019-05-15 11:37:27 -07:00
env Fix AlignedBuffer's usage in Encryption Env (#5396) 2019-06-19 16:46:20 -07:00
examples simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
file simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
hdfs Add copyright headers per FB open-source checkup tool. (#5199) 2019-04-18 10:55:01 -07:00
include/rocksdb Add an option to put first key of each sst block in the index (#5289) 2019-06-24 20:54:04 -07:00
java Add an option to put first key of each sst block in the index (#5289) 2019-06-24 20:54:04 -07:00
logging simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
memory Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
memtable simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
monitoring fix rocksdb lite and clang contrun test failures (#5477) 2019-06-17 21:16:29 -07:00
options Add an option to put first key of each sst block in the index (#5289) 2019-06-24 20:54:04 -07:00
port fix compilation error on MSVC (#5458) 2019-06-14 11:28:13 -07:00
table Add an option to put first key of each sst block in the index (#5289) 2019-06-24 20:54:04 -07:00
test_util Add an option to put first key of each sst block in the index (#5289) 2019-06-24 20:54:04 -07:00
third-party/gtest-1.7.0/fused-src/gtest remove bundled but unused fbson library (#5108) 2019-03-26 16:37:52 -07:00
tools Block cache trace analysis: Write time series graphs in csv files (#5490) 2019-06-24 20:42:12 -07:00
trace_replay Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
util Add an option to put first key of each sst block in the index (#5289) 2019-06-24 20:54:04 -07:00
utilities Fix segfalut in ~DBWithTTLImpl() when called after Close() (#5485) 2019-06-20 13:08:17 -07:00
.clang-format A script that automatically reformat affected lines 2014-01-14 12:21:24 -08:00
.gitignore Add support for loading dynamic libraries into the RocksDB environment (#5281) 2019-06-03 23:02:56 -07:00
.lgtm.yml Create lgtm.yml for LGTM.com C/C++ analysis (#4058) 2018-06-26 12:43:04 -07:00
.travis.yml Switch Travis to Xenial build (#4789) 2019-06-17 10:20:02 -07:00
appveyor.yml Also build compression libraries on AppVeyor CI (#5226) 2019-06-24 10:41:07 -07:00
AUTHORS Update RocksDB Authors File 2017-10-18 14:42:10 -07:00
CMakeLists.txt Persistent Stats: persist stats history to disk (#5046) 2019-06-17 15:21:50 -07:00
CODE_OF_CONDUCT.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
CONTRIBUTING.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
COPYING Add GPLv2 as an alternative license. 2017-04-27 18:06:12 -07:00
DEFAULT_OPTIONS_HISTORY.md options.delayed_write_rate use the rate of rate_limiter by default. 2017-05-24 09:58:24 -07:00
defs.bzl Add copyright headers per FB open-source checkup tool. (#5199) 2019-04-18 10:55:01 -07:00
DUMP_FORMAT.md First version of rocksdb_dump and rocksdb_undump. 2015-06-19 16:24:36 -07:00
HISTORY.md Add an option to put first key of each sst block in the index (#5289) 2019-06-24 20:54:04 -07:00
INSTALL.md Update the version of the dependencies used by the RocksJava static build (#4761) 2018-12-18 20:25:43 -08:00
issue_template.md Add a template for issues 2017-09-29 11:41:28 -07:00
LANGUAGE-BINDINGS.md LANGUAGE-BINDINGS.md: mention python-rocksdb 2019-03-20 11:10:48 -07:00
LICENSE.Apache Change RocksDB License 2017-07-15 16:11:23 -07:00
LICENSE.leveldb Add back the LevelDB license file 2017-07-16 18:42:18 -07:00
Makefile Update the version of ZStd for the Rocks Java static build 2019-06-18 11:57:01 -07:00
README.md Add LevelDB repository link in the Readme 2019-04-01 18:19:09 -07:00
ROCKSDB_LITE.md Fix some typos in comments and docs. 2018-03-08 10:27:25 -08:00
src.mk Support computing miss ratio curves using sim_cache. (#5449) 2019-06-17 16:41:12 -07:00
TARGETS Persistent Stats: persist stats history to disk (#5046) 2019-06-17 15:21:50 -07:00
thirdparty.inc Fix build jemalloc api (#5470) 2019-06-24 17:40:32 -07:00
USERS.md Add Alluxio to USERS.md (#5434) 2019-06-13 12:25:26 -07:00
Vagrantfile Adding CentOS 7 Vagrantfile & build script 2018-02-26 15:27:17 -08:00
WINDOWS_PORT.md #5145 , rename port/dirent.h to port/port_dirent.h to avoid compile err when use port dir as header dir output (#5152) 2019-04-04 11:38:19 -07:00

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

Linux/Mac Build Status Windows Build status PPC64le Build Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it specially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/master/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.