A library that provides an embeddable, persistent key-value store for fast storage.
Go to file
Nathan Bronson 7d87f02799 support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations.  Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention.  Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off).  This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex.  If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided.  This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield).  Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work.  It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive.  With 1
thread I get ~440Kops/sec.  Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads.  Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589
2015-12-25 11:03:40 -08:00
arcanist_util Don't spew warnings when flint doesn't exist 2015-10-19 18:47:59 -07:00
build_tools Update liblz4 to r131 2015-12-23 17:13:31 -08:00
coverage Fix coverage script 2014-11-03 14:53:00 -08:00
db support for concurrent adds to memtable 2015-12-25 11:03:40 -08:00
doc Lint everything 2015-11-16 12:56:21 -08:00
examples Fix examples 2015-12-16 17:04:46 +01:00
hdfs Merge pull request #863 from zhangyybuaa/fix_hdfs_error 2015-12-22 09:27:51 +01:00
include/rocksdb support for concurrent adds to memtable 2015-12-25 11:03:40 -08:00
java fix typos in comments 2015-12-11 01:54:48 +09:00
memtable Enable MS compiler warning c4244. 2015-12-11 16:47:34 -08:00
port support for concurrent adds to memtable 2015-12-25 11:03:40 -08:00
table support for concurrent adds to memtable 2015-12-25 11:03:40 -08:00
third-party Enable MS compiler warning c4244. 2015-12-11 16:47:34 -08:00
tools Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata 2015-12-23 22:59:42 -08:00
util support for concurrent adds to memtable 2015-12-25 11:03:40 -08:00
utilities Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata 2015-12-23 22:59:42 -08:00
.arcconfig Integrate Jenkins with Phabricator 2015-04-07 11:56:29 -07:00
.clang-format A script that automatically reformat affected lines 2014-01-14 12:21:24 -08:00
.gitignore New amalgamation target 2015-10-01 08:29:31 +13:00
.travis.yml Run ROCKSDB_LITE tests in travis 2015-10-16 10:47:37 -07:00
appveyor.yml Exclude DBTest.FileCreationRandomFailure as a long running test 2015-11-17 13:54:13 -08:00
AUTHORS Add AUTHORS file. Fix #203 2014-09-29 10:52:18 -07:00
CMakeLists.txt support for concurrent adds to memtable 2015-12-25 11:03:40 -08:00
CONTRIBUTING.md facebook accounts are not required for CLA signers 2014-07-08 05:57:54 -04:00
DUMP_FORMAT.md First version of rocksdb_dump and rocksdb_undump. 2015-06-19 16:24:36 -07:00
HISTORY.md Change default options.delayed_write_rate 2015-12-23 14:51:55 -08:00
INSTALL.md Update 4 is required for building with MS Visual Studio 13 2015-10-15 11:06:02 -07:00
LICENSE Fix copyright year 2014-03-12 12:06:58 -07:00
Makefile Clean up listener_test (reuse db_test_util) 2015-12-14 13:36:32 -08:00
PATENTS Update Patent Grant. 2015-04-13 10:33:43 +01:00
README.md Replaced "built on on earlier work" by "built on earlier work" in README.md 2014-09-17 01:16:17 -07:00
ROCKSDB_LITE.md Optimistic Transactions 2015-05-29 14:36:35 -07:00
src.mk support for concurrent adds to memtable 2015-12-25 11:03:40 -08:00
thirdparty.inc Enable override to 3rd party linkage 2015-11-24 11:51:37 -08:00
USERS.md Add Cloudera's blog post to USERS.md 2015-09-02 14:04:51 -07:00
Vagrantfile RocksDB on FreeBSD support 2015-02-26 15:19:17 -08:00
WINDOWS_PORT.md Commit both PR and internal code review changes 2015-07-07 16:58:20 -07:00

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

Build Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it specially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/master/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/