Make it easier to start using RocksDB

Summary:
This diff is addressing multiple things with a single goal -- to make RocksDB easier to use:
* Add some functions to Options that make RocksDB easier to tune.
* Add example code for both simple RocksDB and RocksDB with Column Families.
* Rewrite our README.md

Regarding Options, I took a stab at something we talked about for a long time:
* https://www.facebook.com/groups/rocksdb.dev/permalink/563169950448190/

I added functions:
* IncreaseParallelism() -- easy, increases the thread pool and max_background_compactions
* OptimizeLevelStyleCompaction(memtable_memory_budget) -- the easiest way to optimize rocksdb for less stalls with level style compaction. This is very likely not ideal configuration. Feel free to suggest improvements. I used some of Mark's suggestions from here: https://github.com/facebook/rocksdb/issues/54
* OptimizeUniversalStyleCompaction(memtable_memory_budget) -- optimize for universal compaction.

Test Plan: compiled rocksdb. ran examples.

Reviewers: dhruba, MarkCallaghan, haobo, sdong, yhchiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18621
This commit is contained in:
Igor Canadi 2014-05-10 10:49:33 -07:00
parent acd17fd002
commit 038a477b53
9 changed files with 246 additions and 82 deletions

82
README
View File

@ -1,82 +0,0 @@
rocksdb: A persistent key-value store for flash storage
Authors: * The Facebook Database Engineering Team
* Build on earlier work on leveldb by Sanjay Ghemawat
(sanjay@google.com) and Jeff Dean (jeff@google.com)
This code is a library that forms the core building block for a fast
key value server, especially suited for storing data on flash drives.
It has an Log-Structured-Merge-Database (LSM) design with flexible tradeoffs
between Write-Amplification-Factor(WAF), Read-Amplification-Factor (RAF)
and Space-Amplification-Factor(SAF). It has multi-threaded compactions,
making it specially suitable for storing multiple terabytes of data in a
single database.
The core of this code has been derived from open-source leveldb.
The code under this directory implements a system for maintaining a
persistent key/value store.
See doc/index.html and github wiki (https://github.com/facebook/rocksdb/wiki)
for more explanation.
The public interface is in include/*. Callers should not include or
rely on the details of any other header files in this package. Those
internal APIs may be changed without warning.
Guide to header files:
include/rocksdb/db.h
Main interface to the DB: Start here
include/rocksdb/options.h
Control over the behavior of an entire database, and also
control over the behavior of individual reads and writes.
include/rocksdb/comparator.h
Abstraction for user-specified comparison function. If you want
just bytewise comparison of keys, you can use the default comparator,
but clients can write their own comparator implementations if they
want custom ordering (e.g. to handle different character
encodings, etc.)
include/rocksdb/iterator.h
Interface for iterating over data. You can get an iterator
from a DB object.
include/rocksdb/write_batch.h
Interface for atomically applying multiple updates to a database.
include/rocksdb/slice.h
A simple module for maintaining a pointer and a length into some
other byte array.
include/rocksdb/status.h
Status is returned from many of the public interfaces and is used
to report success and various kinds of errors.
include/rocksdb/env.h
Abstraction of the OS environment. A posix implementation of
this interface is in util/env_posix.cc
include/rocksdb/table_builder.h
Lower-level modules that most clients probably won't use directly
include/rocksdb/cache.h
An API for the block cache.
include/rocksdb/compaction_filter.h
An API for a application filter invoked on every compaction.
include/rocksdb/filter_policy.h
An API for configuring a bloom filter.
include/rocksdb/memtablerep.h
An API for implementing a memtable.
include/rocksdb/statistics.h
An API to retrieve various database statistics.
include/rocksdb/transaction_log.h
An API to retrieve transaction logs from a database.
Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/

24
README.md Normal file
View File

@ -0,0 +1,24 @@
## RocksDB: A Persistent Key-Value Store for Flash and RAM Storage
RocksDB is developed and maintained by Facebook Database Engineering Team.
It is built on on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com)
and Jeff Dean (jeff@google.com)
This code is a library that forms the core building block for a fast
key value server, especially suited for storing data on flash drives.
It has an Log-Structured-Merge-Database (LSM) design with flexible tradeoffs
between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF)
and Space-Amplification-Factor (SAF). It has multi-threaded compactions,
making it specially suitable for storing multiple terabytes of data in a
single database.
Start with example usage here: https://github.com/facebook/rocksdb/tree/master/examples
See [doc/index.html](https://github.com/facebook/rocksdb/blob/master/doc/index.html) and
[github wiki](https://github.com/facebook/rocksdb/wiki) for more explanation.
The public interface is in `include/`. Callers should not include or
rely on the details of any other header files in this package. Those
internal APIs may be changed without warning.
Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/

2
examples/.gitignore vendored Normal file
View File

@ -0,0 +1,2 @@
column_families_example
simple_example

9
examples/Makefile Normal file
View File

@ -0,0 +1,9 @@
include ../build_config.mk
all: simple_example column_families_example
simple_example: simple_example.cc
$(CXX) $(CXXFLAGS) $@.cc -o$@ ../librocksdb.a -I../include -O2 -std=c++11 $(PLATFORM_LDFLAGS) $(PLATFORM_CXXFLAGS) $(EXEC_LDFLAGS)
column_families_example: column_families_example.cc
$(CXX) $(CXXFLAGS) $@.cc -o$@ ../librocksdb.a -I../include -O2 -std=c++11 $(PLATFORM_LDFLAGS) $(PLATFORM_CXXFLAGS) $(EXEC_LDFLAGS)

1
examples/README.md Normal file
View File

@ -0,0 +1 @@
Compile RocksDB first by executing `make static_lib` in parent dir

View File

@ -0,0 +1,72 @@
// Copyright (c) 2013, Facebook, Inc. All rights reserved.
// This source code is licensed under the BSD-style license found in the
// LICENSE file in the root directory of this source tree. An additional grant
// of patent rights can be found in the PATENTS file in the same directory.
#include <cstdio>
#include <string>
#include <vector>
#include "rocksdb/db.h"
#include "rocksdb/slice.h"
#include "rocksdb/options.h"
using namespace rocksdb;
std::string kDBPath = "/tmp/rocksdb_column_families_example";
int main() {
// open DB
Options options;
options.create_if_missing = true;
DB* db;
Status s = DB::Open(options, kDBPath, &db);
assert(s.ok());
// create column family
ColumnFamilyHandle* cf;
s = db->CreateColumnFamily(ColumnFamilyOptions(), "new_cf", &cf);
assert(s.ok());
// close DB
delete cf;
delete db;
// open DB with two column families
std::vector<ColumnFamilyDescriptor> column_families;
// have to open default column familiy
column_families.push_back(ColumnFamilyDescriptor(
kDefaultColumnFamilyName, ColumnFamilyOptions()));
// open the new one, too
column_families.push_back(ColumnFamilyDescriptor(
"new_cf", ColumnFamilyOptions()));
std::vector<ColumnFamilyHandle*> handles;
s = DB::Open(DBOptions(), kDBPath, column_families, &handles, &db);
assert(s.ok());
// put and get from non-default column family
s = db->Put(WriteOptions(), handles[1], Slice("key"), Slice("value"));
assert(s.ok());
std::string value;
s = db->Get(ReadOptions(), handles[1], Slice("key"), &value);
assert(s.ok());
// atomic write
WriteBatch batch;
batch.Put(handles[0], Slice("key2"), Slice("value2"));
batch.Put(handles[1], Slice("key3"), Slice("value3"));
batch.Delete(handles[0], Slice("key"));
s = db->Write(WriteOptions(), &batch);
assert(s.ok());
// drop column family
s = db->DropColumnFamily(handles[1]);
assert(s.ok());
// close db
for (auto handle : handles) {
delete handle;
}
delete db;
return 0;
}

View File

@ -0,0 +1,41 @@
// Copyright (c) 2013, Facebook, Inc. All rights reserved.
// This source code is licensed under the BSD-style license found in the
// LICENSE file in the root directory of this source tree. An additional grant
// of patent rights can be found in the PATENTS file in the same directory.
#include <cstdio>
#include <string>
#include "rocksdb/db.h"
#include "rocksdb/slice.h"
#include "rocksdb/options.h"
using namespace rocksdb;
std::string kDBPath = "/tmp/rocksdb_simple_example";
int main() {
DB* db;
Options options;
// Optimize RocksDB. This is the easiest way to get RocksDB to perform well
options.IncreaseParallelism();
options.OptimizeLevelStyleCompaction();
// create the DB if it's not already present
options.create_if_missing = true;
// open DB
Status s = DB::Open(options, kDBPath, &db);
assert(s.ok());
// Put key-value
s = db->Put(WriteOptions(), "key", "value");
assert(s.ok());
std::string value;
// get value
s = db->Get(ReadOptions(), "key", &value);
assert(s.ok());
assert(value == "value");
delete db;
return 0;
}

View File

@ -76,6 +76,29 @@ enum UpdateStatus { // Return status For inplace update callback
struct Options;
struct ColumnFamilyOptions {
// Some functions that make it easier to optimize RocksDB
// Use this if you don't need to keep the data sorted, i.e. you'll never use
// an iterator, only Put() and Get() API calls
ColumnFamilyOptions* OptimizeForPointLookup();
// Default values for some parameters in ColumnFamilyOptions are not
// optimized for heavy workloads and big datasets, which means you might
// observe write stalls under some conditions. As a starting point for tuning
// RocksDB options, use the following two functions:
// * OptimizeLevelStyleCompaction -- optimizes level style compaction
// * OptimizeUniversalStyleCompaction -- optimizes universal style compaction
// Universal style compaction is focused on reducing Write Amplification
// Factor for big data sets, but increases Space Amplification. You can learn
// more about the different styles here:
// https://github.com/facebook/rocksdb/wiki/Rocksdb-Architecture-Guide
// Note: we might use more memory than memtable_memory_budget during high
// write rate period
ColumnFamilyOptions* OptimizeLevelStyleCompaction(
uint64_t memtable_memory_budget = 512 * 1024 * 1024);
ColumnFamilyOptions* OptimizeUniversalStyleCompaction(
uint64_t memtable_memory_budget = 512 * 1024 * 1024);
// -------------------
// Parameters that affect behavior
@ -336,6 +359,7 @@ struct ColumnFamilyOptions {
// With bloomfilter and fast storage, a miss on one level
// is very cheap if the file handle is cached in table cache
// (which is true if max_open_files is large).
// Default: true
bool disable_seek_compaction;
// Puts are delayed 0-1 ms when any level has a compaction score that exceeds
@ -546,6 +570,15 @@ struct ColumnFamilyOptions {
};
struct DBOptions {
// Some functions that make it easier to optimize RocksDB
// By default, RocksDB uses only one background thread for flush and
// compaction. Calling this function will set it up such that total of
// `total_threads` is used. Good value for `total_threads` is the number of
// cores. You almost definitely want to call this function if your system is
// bottlenecked by RocksDB.
DBOptions* IncreaseParallelism(int total_threads = 16);
// If true, the database will be created if it is missing.
// Default: false
bool create_if_missing;

View File

@ -480,4 +480,68 @@ Options::PrepareForBulkLoad()
return this;
}
// Optimization functions
ColumnFamilyOptions* ColumnFamilyOptions::OptimizeForPointLookup() {
prefix_extractor.reset(NewNoopTransform());
BlockBasedTableOptions block_based_options;
block_based_options.index_type = BlockBasedTableOptions::kBinarySearch;
table_factory.reset(new BlockBasedTableFactory(block_based_options));
memtable_factory.reset(NewHashLinkListRepFactory());
return this;
}
ColumnFamilyOptions* ColumnFamilyOptions::OptimizeLevelStyleCompaction(
uint64_t memtable_memory_budget) {
write_buffer_size = memtable_memory_budget / 4;
// merge two memtables when flushing to L0
min_write_buffer_number_to_merge = 2;
// this means we'll use 50% extra memory in the worst case, but will reduce
// write stalls.
max_write_buffer_number = 6;
// start flushing L0->L1 as soon as possible. each file on level0 is
// (memtable_memory_budget / 2). This will flush level 0 when it's bigger than
// memtable_memory_budget.
level0_file_num_compaction_trigger = 2;
// doesn't really matter much, but we don't want to create too many files
target_file_size_base = memtable_memory_budget / 8;
// make Level1 size equal to Level0 size, so that L0->L1 compactions are fast
max_bytes_for_level_base = memtable_memory_budget;
// level style compaction
compaction_style = kCompactionStyleLevel;
// only compress levels >= 2
compression_per_level.resize(num_levels);
for (int i = 0; i < num_levels; ++i) {
if (i < 2) {
compression_per_level[i] = kNoCompression;
} else {
compression_per_level[i] = kSnappyCompression;
}
}
return this;
}
ColumnFamilyOptions* ColumnFamilyOptions::OptimizeUniversalStyleCompaction(
uint64_t memtable_memory_budget) {
write_buffer_size = memtable_memory_budget / 4;
// merge two memtables when flushing to L0
min_write_buffer_number_to_merge = 2;
// this means we'll use 50% extra memory in the worst case, but will reduce
// write stalls.
max_write_buffer_number = 6;
// universal style compaction
compaction_style = kCompactionStyleUniversal;
compaction_options_universal.compression_size_percent = 80;
return this;
}
DBOptions* DBOptions::IncreaseParallelism(int total_threads) {
max_background_compactions = total_threads - 1;
max_background_flushes = 1;
env->SetBackgroundThreads(total_threads, Env::LOW);
env->SetBackgroundThreads(1, Env::HIGH);
return this;
}
} // namespace rocksdb