A library that provides an embeddable, persistent key-value store for fast storage.
Go to file
Sagar Vemuri df23e80e5f Improve performance of long range scans with readahead
Summary:
This change improves the performance of iterators doing long range scans (e.g. big/full table scans in MyRocks) by using readahead and prefetching additional data on each disk IO. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.

Constraints:
- The prefetched data is stored by the OS in page cache. So this currently works only for non direct-reads use-cases i.e applications which use page cache. (Direct-I/O support will be enabled in a later PR).
- This gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).

Thanks to siying for the original idea and implementation.

**Benchmarks:**
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```

Page cache was cleared before each experiment with the command:
```
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
```
```
Before:
seekrandom   :   34020.945 micros/op 29 ops/sec;   32.5 MB/s (1636 of 1999 found)
With this change:
seekrandom   :    8726.912 micros/op 114 ops/sec;  126.8 MB/s (5702 of 6999 found)
```
~3.9X performance improvement.

Also verified with strace and gdb that the readahead size is increasing as expected.
```
strace -e readahead -f -T -t -p <db_bench process pid>
```
Closes https://github.com/facebook/rocksdb/pull/3282

Differential Revision: D6586477

Pulled By: sagar0

fbshipit-source-id: 8a118a0ed4594fbb7f5b1cafb242d7a4033cb58c
2018-02-05 15:00:23 -08:00
buckifier Remove import use from TARGETS 2017-11-30 15:27:34 -08:00
build_tools FreeBSD build support for RocksDB and RocksJava 2018-01-26 11:22:51 -08:00
cache fix gflags namespace 2017-12-01 10:42:05 -08:00
cmake add missing config checks to CMakeLists.txt 2017-11-30 22:57:00 -08:00
coverage Fix /bin/bash shebangs 2017-08-03 15:56:46 -07:00
db Fix Flush() keep waiting after flush finish 2018-01-18 17:50:07 -08:00
docs Blog post for WritePrepared Txn 2017-12-20 11:42:15 -08:00
env Suppress valgrind "unimplemented functionality" error 2017-11-15 14:28:34 -08:00
examples Pinnableslice examples and blog post 2017-08-24 12:26:07 -07:00
hdfs Revert "comment out unused parameters" 2017-07-21 18:26:26 -07:00
include/rocksdb StackableDB optionally take shared ownership of the underlying DB 2018-01-31 11:07:20 -08:00
java FreeBSD build support for RocksDB and RocksJava 2018-01-26 11:22:51 -08:00
memtable fix gflags namespace 2017-12-01 10:42:05 -08:00
monitoring fix ThreadStatus for bottom-pri compaction threads 2017-12-14 14:57:49 -08:00
options Make Universal compaction options dynamic 2017-12-11 13:27:06 -08:00
port FreeBSD build support for RocksDB and RocksJava 2018-01-26 11:22:51 -08:00
table Improve performance of long range scans with readahead 2018-02-05 15:00:23 -08:00
third-party Enable MSVC W4 with a few exceptions. Fix warnings and bugs 2017-10-19 10:57:12 -07:00
tools Fix db_bench write being disabled in lite build 2018-01-09 10:57:29 -08:00
util FIXED: string buffers potentially too small to fit formatted write 2017-12-20 08:12:22 -08:00
utilities WritePrepared Txn: address some pending TODOs 2018-01-09 08:57:20 -08:00
.clang-format A script that automatically reformat affected lines 2014-01-14 12:21:24 -08:00
.gitignore Remove leftover references to phutil_module_cache 2017-08-23 12:12:21 -07:00
.travis.yml CMake cross platform Java support and add JNI to travis 2017-11-28 12:27:53 -08:00
appveyor.yml Make Windows dep switches compatible with other builds 2018-01-05 14:56:54 -08:00
AUTHORS Update RocksDB Authors File 2017-10-18 14:42:10 -07:00
CMakeLists.txt Make Windows dep switches compatible with other builds 2018-01-05 14:56:54 -08:00
CODE_OF_CONDUCT.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
CONTRIBUTING.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
COPYING Add GPLv2 as an alternative license. 2017-04-27 18:06:12 -07:00
DEFAULT_OPTIONS_HISTORY.md options.delayed_write_rate use the rate of rate_limiter by default. 2017-05-24 09:58:24 -07:00
DUMP_FORMAT.md First version of rocksdb_dump and rocksdb_undump. 2015-06-19 16:24:36 -07:00
HISTORY.md Improve performance of long range scans with readahead 2018-02-05 15:00:23 -08:00
INSTALL.md FreeBSD build support for RocksDB and RocksJava 2018-01-26 11:22:51 -08:00
issue_template.md Add a template for issues 2017-09-29 11:41:28 -07:00
LANGUAGE-BINDINGS.md Add Elixir to the list of language bindings 2017-11-21 10:13:14 -08:00
LICENSE.Apache Change RocksDB License 2017-07-15 16:11:23 -07:00
LICENSE.leveldb Add back the LevelDB license file 2017-07-16 18:42:18 -07:00
Makefile Fix PowerPC dynamic java build 2018-01-26 11:23:18 -08:00
README.md Appveyor badge to show master branch 2016-07-26 13:54:08 -07:00
ROCKSDB_LITE.md Optimistic Transactions 2015-05-29 14:36:35 -07:00
src.mk Refactor ReadBlockContents() 2017-12-11 15:27:32 -08:00
TARGETS WritePrepared Txn: make buck tests parallel 2017-12-18 14:42:09 -08:00
thirdparty.inc Make Windows dep switches compatible with other builds 2018-01-05 14:56:54 -08:00
USERS.md Added ProfaneDB 2017-11-19 10:11:44 -08:00
Vagrantfile Update Vagrant file (test internal phabricator workflow) 2016-10-28 15:39:19 -07:00
WINDOWS_PORT.md Commit both PR and internal code review changes 2015-07-07 16:58:20 -07:00

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

Build Status Build status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it specially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/master/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/