A library that provides an embeddable, persistent key-value store for fast storage.
Go to file
Dhruba Borthakur 47c4191fe8 Reduce write amplification by merging files in L0 back into L0
Summary:
There is a new option called hybrid_mode which, when switched on,
causes HBase style compactions.  Files from L0 are
compacted back into L0. This meat of this compaction algorithm
is in PickCompactionHybrid().

All files reside in L0. That means all files have overlapping
keys. Each file has a time-bound, i.e. each file contains a
range of keys that were inserted around the same time. The
start-seqno and the end-seqno refers to the timeframe when
these keys were inserted.  Files that have contiguous seqno
are compacted together into a larger file. All files are
ordered from most recent to the oldest.

The current compaction algorithm starts to look for
candidate files starting from the most recent file. It continues to
add more files to the same compaction run as long as the
sum of the files chosen till now is smaller than the next
candidate file size. This logic needs to be debated
and validated.

The above logic should reduce write amplification to a
large extent... will publish numbers shortly.

Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).

Differential Revision: https://reviews.facebook.net/D11289
2013-06-30 20:07:04 -07:00
db Reduce write amplification by merging files in L0 back into L0 2013-06-30 20:07:04 -07:00
doc merge 1.5 2012-08-28 11:43:33 -07:00
hdfs Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database. 2013-03-20 23:14:03 -07:00
helpers/memenv [RocksDB] cleanup EnvOptions 2013-06-12 11:17:19 -07:00
include Reduce write amplification by merging files in L0 back into L0 2013-06-30 20:07:04 -07:00
java Pom changes to make relase 1.5.7 for java. 2013-01-10 10:43:43 -08:00
linters/src fixing linters. 2012-12-14 14:05:27 -08:00
port Fix Zlib_Compress and Zlib_Uncompress 2013-06-18 16:57:42 -07:00
scribe fix db_test error with scribe logger turned on 2012-08-28 11:22:58 -07:00
snappy Build with gcc-4.7.1-glibc-2.14.1. 2012-09-17 10:56:26 -07:00
table [Rocksdb] Record WriteBlock Times into a histogram 2013-06-17 10:11:10 -07:00
thrift Implement RowLocks for assoc schema 2012-10-03 23:19:01 -07:00
tools Reduce write amplification by merging files in L0 back into L0 2013-06-30 20:07:04 -07:00
util Reduce write amplification by merging files in L0 back into L0 2013-06-30 20:07:04 -07:00
utilities Added stringappend_test back into the unit tests. 2013-06-26 11:41:13 -07:00
VALGRIND_LOGS Use version 3.8.1 for valgrind in third_party and do away with log files 2013-03-06 17:47:31 -08:00
.arcconfig Enable linting in arc. 2013-02-01 11:34:25 -08:00
.gitignore Various build cleanups/improvements 2013-01-14 18:40:22 -08:00
build_detect_platform Modify build_detect_platform to run fbcode.*.* irrespective of $PATH 2013-05-14 22:09:01 -07:00
build_detect_version Make the build-time show up in the leveldb library. 2013-03-11 10:33:15 -07:00
build_java.sh Release 1.5.6 for Java code + Script to automate it. 2012-12-17 12:11:11 -08:00
e Enhance db_bench 2013-03-14 16:00:23 -07:00
fbcode.clang31.sh Cleanup TODO/NEWS/AUTHORS files 2013-01-25 09:11:26 -08:00
fbcode.gcc471.sh Updating fbcode.gcc471.sh to use jemalloc 3.3.1 2013-03-13 15:34:50 -07:00
LICENSE reverting disastrous MOE commit, returning to r21 2011-04-19 23:11:15 +00:00
Makefile Added stringappend_test back into the unit tests. 2013-06-26 11:41:13 -07:00
README Use posix_fallocate as default. 2013-03-13 13:50:26 -07:00
README.fb Release 1.5.9.fb to third party 2013-04-10 17:23:58 -07:00
regression_build_test.sh Minor improvements to the regression testing 2013-01-16 14:47:20 -08:00
valgrind_test.sh make clean in valgrind_test.sh first 2013-04-23 14:25:19 -07:00

rocksdb: A persistent key-value store for flash storage
Authors: The Facebook Database Engineering Team

This code is a library that forms the core building block for a fast
key value server, especially suited for storing data on flash drives.
It has an Log-Stuctured-Merge-Database (LSM) design with flexible tradeoffs
between Write-Amplification-Factor(WAF), Read-Amplification-Factor (RAF)
and Space-Amplification-Factor(SAF). It has multi-threaded compactions,
making it specially suitable for storing multiple terabytes of data in a
single database.

The core of this code has been derived from open-source leveldb.

The code under this directory implements a system for maintaining a
persistent key/value store.

See doc/index.html for more explanation.
See doc/impl.html for a brief overview of the implementation.

The public interface is in include/*.h.  Callers should not include or
rely on the details of any other header files in this package.  Those
internal APIs may be changed without warning.

Guide to header files:

include/db.h
    Main interface to the DB: Start here

include/options.h
    Control over the behavior of an entire database, and also
    control over the behavior of individual reads and writes.

include/comparator.h
    Abstraction for user-specified comparison function.  If you want
    just bytewise comparison of keys, you can use the default comparator,
    but clients can write their own comparator implementations if they
    want custom ordering (e.g. to handle different character
    encodings, etc.)

include/iterator.h
    Interface for iterating over data. You can get an iterator
    from a DB object.

include/write_batch.h
    Interface for atomically applying multiple updates to a database.

include/slice.h
    A simple module for maintaining a pointer and a length into some
    other byte array.

include/status.h
    Status is returned from many of the public interfaces and is used
    to report success and various kinds of errors.

include/env.h
    Abstraction of the OS environment.  A posix implementation of
    this interface is in util/env_posix.cc

include/table.h
include/table_builder.h
    Lower-level modules that most clients probably won't use directly