A library that provides an embeddable, persistent key-value store for fast storage.
778e179046
Summary: During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic. This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO. In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too. Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark. What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller. For the write only test 800bytes test: Before the change: await occillate between 10ms and 3ms After the change: await ranges 1-3ms Will test against read-modify-write workload too, see if high read latency P99 could be resolved. Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions. Test Plan: make check; db_bench; db_stress Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11115 |
||
---|---|---|
db | ||
doc | ||
hdfs | ||
helpers/memenv | ||
include | ||
java | ||
linters/src | ||
port | ||
scribe | ||
snappy | ||
table | ||
thrift | ||
tools | ||
util | ||
utilities | ||
VALGRIND_LOGS | ||
.arcconfig | ||
.gitignore | ||
build_detect_platform | ||
build_detect_version | ||
build_java.sh | ||
e | ||
fbcode.clang31.sh | ||
fbcode.gcc471.sh | ||
LICENSE | ||
Makefile | ||
README | ||
README.fb | ||
regression_build_test.sh | ||
valgrind_test.sh |
rocksdb: A persistent key-value store for flash storage Authors: The Facebook Database Engineering Team This code is a library that forms the core building block for a fast key value server, especially suited for storing data on flash drives. It has an Log-Stuctured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor(WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor(SAF). It has multi-threaded compactions, making it specially suitable for storing multiple terabytes of data in a single database. The core of this code has been derived from open-source leveldb. The code under this directory implements a system for maintaining a persistent key/value store. See doc/index.html for more explanation. See doc/impl.html for a brief overview of the implementation. The public interface is in include/*.h. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning. Guide to header files: include/db.h Main interface to the DB: Start here include/options.h Control over the behavior of an entire database, and also control over the behavior of individual reads and writes. include/comparator.h Abstraction for user-specified comparison function. If you want just bytewise comparison of keys, you can use the default comparator, but clients can write their own comparator implementations if they want custom ordering (e.g. to handle different character encodings, etc.) include/iterator.h Interface for iterating over data. You can get an iterator from a DB object. include/write_batch.h Interface for atomically applying multiple updates to a database. include/slice.h A simple module for maintaining a pointer and a length into some other byte array. include/status.h Status is returned from many of the public interfaces and is used to report success and various kinds of errors. include/env.h Abstraction of the OS environment. A posix implementation of this interface is in util/env_posix.cc include/table.h include/table_builder.h Lower-level modules that most clients probably won't use directly