A library that provides an embeddable, persistent key-value store for fast storage.
Go to file
Sagar Vemuri 7103559f49 Improve direct IO range scan performance with readahead (#3884)
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.

**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.

**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.

**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.

**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```

Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```

```
Before:
seekrandom   :   37939.906 micros/op 26 ops/sec;   29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom   :   8527.720 micros/op 117 ops/sec;  129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884

Differential Revision: D8082143

Pulled By: sagar0

fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
2018-06-21 11:13:08 -07:00
buckifier Update buckifier and TARGETS 2018-03-30 14:26:53 -07:00
build_tools Pass -latomic to linker when using clang 2018-04-25 12:13:41 -07:00
cache Fix LRUCache missing null check on destruct 2018-05-29 15:13:09 -07:00
cmake Search paths provided by intel's "tbbvars.sh". 2018-05-07 14:28:36 -07:00
coverage Suppress lint in old files 2018-01-29 12:56:42 -08:00
db Add file name info to SequentialFileReader. (#4026) 2018-06-21 08:42:24 -07:00
docs Adding blog post for 5.10.2 release 2018-02-13 11:56:59 -08:00
env Build and tests fixes for Solaris Sparc (#4000) 2018-06-15 12:42:53 -07:00
examples Pinnableslice examples and blog post 2017-08-24 12:26:07 -07:00
hdfs Comment out unused variables 2018-03-05 13:13:41 -08:00
include/rocksdb Add kOptionsStatistics to GetProperty() (#3966) 2018-06-15 17:28:01 -07:00
java zLinux build error with gcc and IBM Java headers (#4013) 2018-06-18 13:58:28 -07:00
memtable Remove tests from ROCKSDB_VALGRIND_RUN 2018-05-30 16:15:16 -07:00
monitoring Build and tests fixes for Solaris Sparc (#4000) 2018-06-15 12:42:53 -07:00
options PersistRocksDBOptions() to use WritableFileWriter 2018-05-21 16:42:22 -07:00
port Add file name info to SequentialFileReader. (#4026) 2018-06-21 08:42:24 -07:00
table Improve direct IO range scan performance with readahead (#3884) 2018-06-21 11:13:08 -07:00
third-party fix some text in comments. 2018-04-10 15:59:24 -07:00
tools Add file name info to SequentialFileReader. (#4026) 2018-06-21 08:42:24 -07:00
util Improve direct IO range scan performance with readahead (#3884) 2018-06-21 11:13:08 -07:00
utilities Add file name info to SequentialFileReader. (#4026) 2018-06-21 08:42:24 -07:00
.clang-format A script that automatically reformat affected lines 2014-01-14 12:21:24 -08:00
.gitignore Remove leftover references to phutil_module_cache 2017-08-23 12:12:21 -07:00
.travis.yml travis: osx install zstd lz4 snappy xz (#3893) 2018-06-15 16:57:30 -07:00
appveyor.yml Upgrade Appveyor to VS2017 2018-02-01 13:57:01 -08:00
AUTHORS Update RocksDB Authors File 2017-10-18 14:42:10 -07:00
CMakeLists.txt Provide a way to override windows memory allocator with jemalloc for ZSTD 2018-06-04 12:12:48 -07:00
CODE_OF_CONDUCT.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
CONTRIBUTING.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
COPYING Add GPLv2 as an alternative license. 2017-04-27 18:06:12 -07:00
DEFAULT_OPTIONS_HISTORY.md options.delayed_write_rate use the rate of rate_limiter by default. 2017-05-24 09:58:24 -07:00
DUMP_FORMAT.md First version of rocksdb_dump and rocksdb_undump. 2015-06-19 16:24:36 -07:00
HISTORY.md Improve direct IO range scan performance with readahead (#3884) 2018-06-21 11:13:08 -07:00
INSTALL.md Enable compilation on OpenBSD 2018-03-19 12:30:05 -07:00
issue_template.md Add a template for issues 2017-09-29 11:41:28 -07:00
LANGUAGE-BINDINGS.md Add Nim to the list of language bindings 2018-01-29 09:57:46 -08:00
LICENSE.Apache Change RocksDB License 2017-07-15 16:11:23 -07:00
LICENSE.leveldb Add back the LevelDB license file 2017-07-16 18:42:18 -07:00
Makefile Extend existing unit tests to run with WriteUnprepared as well 2018-06-01 14:58:41 -07:00
README.md Add dual-license info to README.md 2018-03-06 12:43:51 -08:00
ROCKSDB_LITE.md Fix some typos in comments and docs. 2018-03-08 10:27:25 -08:00
src.mk Provide a way to override windows memory allocator with jemalloc for ZSTD 2018-06-04 12:12:48 -07:00
TARGETS Provide a way to override windows memory allocator with jemalloc for ZSTD 2018-06-04 12:12:48 -07:00
thirdparty.inc Provide a way to override windows memory allocator with jemalloc for ZSTD 2018-06-04 12:12:48 -07:00
USERS.md Added ProfaneDB 2017-11-19 10:11:44 -08:00
Vagrantfile Adding CentOS 7 Vagrantfile & build script 2018-02-26 15:27:17 -08:00
WINDOWS_PORT.md Commit both PR and internal code review changes 2015-07-07 16:58:20 -07:00

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

Linux/Mac Build Status Windows Build status PPC64le Build Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it specially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/master/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.