A library that provides an embeddable, persistent key-value store for fast storage.
Go to file
Haobo Xu 778e179046 [RocksDB] Sync file to disk incrementally
Summary:
During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic.
This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO.
In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too.

Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark.
What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller.
For the write only test 800bytes test:
Before the change:  await  occillate between 10ms and 3ms
After the change: await ranges 1-3ms

Will test against read-modify-write workload too, see if high read latency P99 could be resolved.

Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions.

Test Plan: make check; db_bench; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11115
2013-06-12 12:53:59 -07:00
db [Rocksdb] [Multiget] Introduced multiget into db_bench 2013-06-12 12:42:21 -07:00
doc merge 1.5 2012-08-28 11:43:33 -07:00
hdfs Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database. 2013-03-20 23:14:03 -07:00
helpers/memenv [RocksDB] cleanup EnvOptions 2013-06-12 11:17:19 -07:00
include [RocksDB] Sync file to disk incrementally 2013-06-12 12:53:59 -07:00
java Pom changes to make relase 1.5.7 for java. 2013-01-10 10:43:43 -08:00
linters/src fixing linters. 2012-12-14 14:05:27 -08:00
port [RocksDB] Introduce Fast Mutex option 2013-06-01 23:11:34 -07:00
scribe fix db_test error with scribe logger turned on 2012-08-28 11:22:58 -07:00
snappy Build with gcc-4.7.1-glibc-2.14.1. 2012-09-17 10:56:26 -07:00
table [RocksDB] cleanup EnvOptions 2013-06-12 11:17:19 -07:00
thrift Implement RowLocks for assoc schema 2012-10-03 23:19:01 -07:00
tools [RocksDB] cleanup EnvOptions 2013-06-12 11:17:19 -07:00
util [RocksDB] Sync file to disk incrementally 2013-06-12 12:53:59 -07:00
utilities Completed the implementation and test cases for Redis API. 2013-06-11 11:19:49 -07:00
VALGRIND_LOGS Use version 3.8.1 for valgrind in third_party and do away with log files 2013-03-06 17:47:31 -08:00
.arcconfig Enable linting in arc. 2013-02-01 11:34:25 -08:00
.gitignore Various build cleanups/improvements 2013-01-14 18:40:22 -08:00
build_detect_platform Modify build_detect_platform to run fbcode.*.* irrespective of $PATH 2013-05-14 22:09:01 -07:00
build_detect_version Make the build-time show up in the leveldb library. 2013-03-11 10:33:15 -07:00
build_java.sh Release 1.5.6 for Java code + Script to automate it. 2012-12-17 12:11:11 -08:00
e Enhance db_bench 2013-03-14 16:00:23 -07:00
fbcode.clang31.sh Cleanup TODO/NEWS/AUTHORS files 2013-01-25 09:11:26 -08:00
fbcode.gcc471.sh Updating fbcode.gcc471.sh to use jemalloc 3.3.1 2013-03-13 15:34:50 -07:00
LICENSE reverting disastrous MOE commit, returning to r21 2011-04-19 23:11:15 +00:00
Makefile Completed the implementation and test cases for Redis API. 2013-06-11 11:19:49 -07:00
README Use posix_fallocate as default. 2013-03-13 13:50:26 -07:00
README.fb Release 1.5.9.fb to third party 2013-04-10 17:23:58 -07:00
regression_build_test.sh Minor improvements to the regression testing 2013-01-16 14:47:20 -08:00
valgrind_test.sh make clean in valgrind_test.sh first 2013-04-23 14:25:19 -07:00

rocksdb: A persistent key-value store for flash storage
Authors: The Facebook Database Engineering Team

This code is a library that forms the core building block for a fast
key value server, especially suited for storing data on flash drives.
It has an Log-Stuctured-Merge-Database (LSM) design with flexible tradeoffs
between Write-Amplification-Factor(WAF), Read-Amplification-Factor (RAF)
and Space-Amplification-Factor(SAF). It has multi-threaded compactions,
making it specially suitable for storing multiple terabytes of data in a
single database.

The core of this code has been derived from open-source leveldb.

The code under this directory implements a system for maintaining a
persistent key/value store.

See doc/index.html for more explanation.
See doc/impl.html for a brief overview of the implementation.

The public interface is in include/*.h.  Callers should not include or
rely on the details of any other header files in this package.  Those
internal APIs may be changed without warning.

Guide to header files:

include/db.h
    Main interface to the DB: Start here

include/options.h
    Control over the behavior of an entire database, and also
    control over the behavior of individual reads and writes.

include/comparator.h
    Abstraction for user-specified comparison function.  If you want
    just bytewise comparison of keys, you can use the default comparator,
    but clients can write their own comparator implementations if they
    want custom ordering (e.g. to handle different character
    encodings, etc.)

include/iterator.h
    Interface for iterating over data. You can get an iterator
    from a DB object.

include/write_batch.h
    Interface for atomically applying multiple updates to a database.

include/slice.h
    A simple module for maintaining a pointer and a length into some
    other byte array.

include/status.h
    Status is returned from many of the public interfaces and is used
    to report success and various kinds of errors.

include/env.h
    Abstraction of the OS environment.  A posix implementation of
    this interface is in util/env_posix.cc

include/table.h
include/table_builder.h
    Lower-level modules that most clients probably won't use directly