Make it easier to start using RocksDB

Summary: This diff is addressing multiple things with a single goal -- to make RocksDB easier to use: * Add some functions to Options that make RocksDB easier to tune. * Add example code for both simple RocksDB and RocksDB with Column Families. * Rewrite our README.md Regarding Options, I took a stab at something we talked about for a long time: * https://www.facebook.com/groups/rocksdb.dev/permalink/563169950448190/ I added functions: * IncreaseParallelism() -- easy, increases the thread pool and max_background_compactions * OptimizeLevelStyleCompaction(memtable_memory_budget) -- the easiest way to optimize rocksdb for less stalls with level style compaction. This is very likely not ideal configuration. Feel free to suggest improvements. I used some of Mark's suggestions from here: https://github.com/facebook/rocksdb/issues/54 * OptimizeUniversalStyleCompaction(memtable_memory_budget) -- optimize for universal compaction. Test Plan: compiled rocksdb. ran examples. Reviewers: dhruba, MarkCallaghan, haobo, sdong, yhchiang Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D18621
2014-05-10 10:49:33 -07:00 · 2014-05-10 10:49:33 -07:00 · 038a477b53
commit 038a477b53
parent acd17fd002
9 changed files with 246 additions and 82 deletions
--- a/82
+++ b/82
@ -1,82 +0,0 @@
-rocksdb: A persistent key-value store for flash storage
-Authors: * The Facebook Database Engineering Team
-         * Build on earlier work on leveldb by Sanjay Ghemawat
-           (sanjay@google.com) and Jeff Dean (jeff@google.com)
-
-This code is a library that forms the core building block for a fast
-key value server, especially suited for storing data on flash drives.
-It has an Log-Structured-Merge-Database (LSM) design with flexible tradeoffs
-between Write-Amplification-Factor(WAF), Read-Amplification-Factor (RAF)
-and Space-Amplification-Factor(SAF). It has multi-threaded compactions,
-making it specially suitable for storing multiple terabytes of data in a
-single database.
-
-The core of this code has been derived from open-source leveldb.
-
-The code under this directory implements a system for maintaining a
-persistent key/value store.
-
-See doc/index.html and github wiki (https://github.com/facebook/rocksdb/wiki)
-for more explanation.
-
-The public interface is in include/*.  Callers should not include or
-rely on the details of any other header files in this package.  Those
-internal APIs may be changed without warning.
-
-Guide to header files:
-
-include/rocksdb/db.h
-    Main interface to the DB: Start here
-
-include/rocksdb/options.h
-    Control over the behavior of an entire database, and also
-    control over the behavior of individual reads and writes.
-
-include/rocksdb/comparator.h
-    Abstraction for user-specified comparison function.  If you want
-    just bytewise comparison of keys, you can use the default comparator,
-    but clients can write their own comparator implementations if they
-    want custom ordering (e.g. to handle different character
-    encodings, etc.)
-
-include/rocksdb/iterator.h
-    Interface for iterating over data. You can get an iterator
-    from a DB object.
-
-include/rocksdb/write_batch.h
-    Interface for atomically applying multiple updates to a database.
-
-include/rocksdb/slice.h
-    A simple module for maintaining a pointer and a length into some
-    other byte array.
-
-include/rocksdb/status.h
-    Status is returned from many of the public interfaces and is used
-    to report success and various kinds of errors.
-
-include/rocksdb/env.h
-    Abstraction of the OS environment.  A posix implementation of
-    this interface is in util/env_posix.cc
-
-include/rocksdb/table_builder.h
-    Lower-level modules that most clients probably won't use directly
-
-include/rocksdb/cache.h
-    An API for the block cache.
-
-include/rocksdb/compaction_filter.h
-    An API for a application filter invoked on every compaction.
-
-include/rocksdb/filter_policy.h
-    An API for configuring a bloom filter.
-
-include/rocksdb/memtablerep.h
-    An API for implementing a memtable.
-
-include/rocksdb/statistics.h
-    An API to retrieve various database statistics.
-
-include/rocksdb/transaction_log.h
-    An API to retrieve transaction logs from a database.
-
-Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/
--- a/README.md
+++ b/README.md
@ -0,0 +1,24 @@
+## RocksDB: A Persistent Key-Value Store for Flash and RAM Storage
+
+RocksDB is developed and maintained by Facebook Database Engineering Team.
+It is built on on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com)
+and Jeff Dean (jeff@google.com)
+
+This code is a library that forms the core building block for a fast
+key value server, especially suited for storing data on flash drives.
+It has an Log-Structured-Merge-Database (LSM) design with flexible tradeoffs
+between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF)
+and Space-Amplification-Factor (SAF). It has multi-threaded compactions,
+making it specially suitable for storing multiple terabytes of data in a
+single database.
+
+Start with example usage here: https://github.com/facebook/rocksdb/tree/master/examples
+
+See [doc/index.html](https://github.com/facebook/rocksdb/blob/master/doc/index.html) and
+[github wiki](https://github.com/facebook/rocksdb/wiki) for more explanation.
+
+The public interface is in `include/`.  Callers should not include or
+rely on the details of any other header files in this package.  Those
+internal APIs may be changed without warning.
+
+Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/
--- a/examples/.gitignore
+++ b/examples/.gitignore
@ -0,0 +1,2 @@
+column_families_example
+simple_example
--- a/examples/Makefile
+++ b/examples/Makefile
@ -0,0 +1,9 @@
+include ../build_config.mk
+
+all: simple_example column_families_example
+
+simple_example: simple_example.cc
+	$(CXX) $(CXXFLAGS) $@.cc -o$@ ../librocksdb.a -I../include -O2 -std=c++11 $(PLATFORM_LDFLAGS) $(PLATFORM_CXXFLAGS) $(EXEC_LDFLAGS)
+
+column_families_example: column_families_example.cc
+	$(CXX) $(CXXFLAGS) $@.cc -o$@ ../librocksdb.a -I../include -O2 -std=c++11 $(PLATFORM_LDFLAGS) $(PLATFORM_CXXFLAGS) $(EXEC_LDFLAGS)
--- a/examples/README.md
+++ b/examples/README.md
@ -0,0 +1 @@
+Compile RocksDB first by executing `make static_lib` in parent dir
--- a/examples/column_families_example.cc
+++ b/examples/column_families_example.cc
@ -0,0 +1,72 @@
+// Copyright (c) 2013, Facebook, Inc.  All rights reserved.
+// This source code is licensed under the BSD-style license found in the
+// LICENSE file in the root directory of this source tree. An additional grant
+// of patent rights can be found in the PATENTS file in the same directory.
+#include <cstdio>
+#include <string>
+#include <vector>
+
+#include "rocksdb/db.h"
+#include "rocksdb/slice.h"
+#include "rocksdb/options.h"
+
+using namespace rocksdb;
+
+std::string kDBPath = "/tmp/rocksdb_column_families_example";
+
+int main() {
+  // open DB
+  Options options;
+  options.create_if_missing = true;
+  DB* db;
+  Status s = DB::Open(options, kDBPath, &db);
+  assert(s.ok());
+
+  // create column family
+  ColumnFamilyHandle* cf;
+  s = db->CreateColumnFamily(ColumnFamilyOptions(), "new_cf", &cf);
+  assert(s.ok());
+
+  // close DB
+  delete cf;
+  delete db;
+
+  // open DB with two column families
+  std::vector<ColumnFamilyDescriptor> column_families;
+  // have to open default column familiy
+  column_families.push_back(ColumnFamilyDescriptor(
+      kDefaultColumnFamilyName, ColumnFamilyOptions()));
+  // open the new one, too
+  column_families.push_back(ColumnFamilyDescriptor(
+      "new_cf", ColumnFamilyOptions()));
+  std::vector<ColumnFamilyHandle*> handles;
+  s = DB::Open(DBOptions(), kDBPath, column_families, &handles, &db);
+  assert(s.ok());
+
+  // put and get from non-default column family
+  s = db->Put(WriteOptions(), handles[1], Slice("key"), Slice("value"));
+  assert(s.ok());
+  std::string value;
+  s = db->Get(ReadOptions(), handles[1], Slice("key"), &value);
+  assert(s.ok());
+
+  // atomic write
+  WriteBatch batch;
+  batch.Put(handles[0], Slice("key2"), Slice("value2"));
+  batch.Put(handles[1], Slice("key3"), Slice("value3"));
+  batch.Delete(handles[0], Slice("key"));
+  s = db->Write(WriteOptions(), &batch);
+  assert(s.ok());
+
+  // drop column family
+  s = db->DropColumnFamily(handles[1]);
+  assert(s.ok());
+
+  // close db
+  for (auto handle : handles) {
+    delete handle;
+  }
+  delete db;
+
+  return 0;
+}
--- a/examples/simple_example.cc
+++ b/examples/simple_example.cc
@ -0,0 +1,41 @@
+// Copyright (c) 2013, Facebook, Inc.  All rights reserved.
+// This source code is licensed under the BSD-style license found in the
+// LICENSE file in the root directory of this source tree. An additional grant
+// of patent rights can be found in the PATENTS file in the same directory.
+#include <cstdio>
+#include <string>
+
+#include "rocksdb/db.h"
+#include "rocksdb/slice.h"
+#include "rocksdb/options.h"
+
+using namespace rocksdb;
+
+std::string kDBPath = "/tmp/rocksdb_simple_example";
+
+int main() {
+  DB* db;
+  Options options;
+  // Optimize RocksDB. This is the easiest way to get RocksDB to perform well
+  options.IncreaseParallelism();
+  options.OptimizeLevelStyleCompaction();
+  // create the DB if it's not already present
+  options.create_if_missing = true;
+
+  // open DB
+  Status s = DB::Open(options, kDBPath, &db);
+  assert(s.ok());
+
+  // Put key-value
+  s = db->Put(WriteOptions(), "key", "value");
+  assert(s.ok());
+  std::string value;
+  // get value
+  s = db->Get(ReadOptions(), "key", &value);
+  assert(s.ok());
+  assert(value == "value");
+
+  delete db;
+
+  return 0;
+}
--- a/include/rocksdb/options.h
+++ b/include/rocksdb/options.h
@ -76,6 +76,29 @@ enum UpdateStatus {    // Return status For inplace update callback
 struct Options;

 struct ColumnFamilyOptions {
+  // Some functions that make it easier to optimize RocksDB
+
+  // Use this if you don't need to keep the data sorted, i.e. you'll never use
+  // an iterator, only Put() and Get() API calls
+  ColumnFamilyOptions* OptimizeForPointLookup();
+
+  // Default values for some parameters in ColumnFamilyOptions are not
+  // optimized for heavy workloads and big datasets, which means you might
+  // observe write stalls under some conditions. As a starting point for tuning
+  // RocksDB options, use the following two functions:
+  // * OptimizeLevelStyleCompaction -- optimizes level style compaction
+  // * OptimizeUniversalStyleCompaction -- optimizes universal style compaction
+  // Universal style compaction is focused on reducing Write Amplification
+  // Factor for big data sets, but increases Space Amplification. You can learn
+  // more about the different styles here:
+  // https://github.com/facebook/rocksdb/wiki/Rocksdb-Architecture-Guide
+  // Note: we might use more memory than memtable_memory_budget during high
+  // write rate period
+  ColumnFamilyOptions* OptimizeLevelStyleCompaction(
+      uint64_t memtable_memory_budget = 512 * 1024 * 1024);
+  ColumnFamilyOptions* OptimizeUniversalStyleCompaction(
+      uint64_t memtable_memory_budget = 512 * 1024 * 1024);
+
  // -------------------
  // Parameters that affect behavior

@ -336,6 +359,7 @@ struct ColumnFamilyOptions {
  // With bloomfilter and fast storage, a miss on one level
  // is very cheap if the file handle is cached in table cache
  // (which is true if max_open_files is large).
+  // Default: true
  bool disable_seek_compaction;

  // Puts are delayed 0-1 ms when any level has a compaction score that exceeds
@ -546,6 +570,15 @@ struct ColumnFamilyOptions {
 };

 struct DBOptions {
+  // Some functions that make it easier to optimize RocksDB
+
+  // By default, RocksDB uses only one background thread for flush and
+  // compaction. Calling this function will set it up such that total of
+  // `total_threads` is used. Good value for `total_threads` is the number of
+  // cores. You almost definitely want to call this function if your system is
+  // bottlenecked by RocksDB.
+  DBOptions* IncreaseParallelism(int total_threads = 16);
+
  // If true, the database will be created if it is missing.
  // Default: false
  bool create_if_missing;
--- a/util/options.cc
+++ b/util/options.cc
@ -480,4 +480,68 @@ Options::PrepareForBulkLoad()
  return this;
 }

+// Optimization functions
+ColumnFamilyOptions* ColumnFamilyOptions::OptimizeForPointLookup() {
+  prefix_extractor.reset(NewNoopTransform());
+  BlockBasedTableOptions block_based_options;
+  block_based_options.index_type = BlockBasedTableOptions::kBinarySearch;
+  table_factory.reset(new BlockBasedTableFactory(block_based_options));
+  memtable_factory.reset(NewHashLinkListRepFactory());
+  return this;
+}
+
+ColumnFamilyOptions* ColumnFamilyOptions::OptimizeLevelStyleCompaction(
+    uint64_t memtable_memory_budget) {
+  write_buffer_size = memtable_memory_budget / 4;
+  // merge two memtables when flushing to L0
+  min_write_buffer_number_to_merge = 2;
+  // this means we'll use 50% extra memory in the worst case, but will reduce
+  // write stalls.
+  max_write_buffer_number = 6;
+  // start flushing L0->L1 as soon as possible. each file on level0 is
+  // (memtable_memory_budget / 2). This will flush level 0 when it's bigger than
+  // memtable_memory_budget.
+  level0_file_num_compaction_trigger = 2;
+  // doesn't really matter much, but we don't want to create too many files
+  target_file_size_base = memtable_memory_budget / 8;
+  // make Level1 size equal to Level0 size, so that L0->L1 compactions are fast
+  max_bytes_for_level_base = memtable_memory_budget;
+
+  // level style compaction
+  compaction_style = kCompactionStyleLevel;
+
+  // only compress levels >= 2
+  compression_per_level.resize(num_levels);
+  for (int i = 0; i < num_levels; ++i) {
+    if (i < 2) {
+      compression_per_level[i] = kNoCompression;
+    } else {
+      compression_per_level[i] = kSnappyCompression;
+    }
+  }
+  return this;
+}
+
+ColumnFamilyOptions* ColumnFamilyOptions::OptimizeUniversalStyleCompaction(
+    uint64_t memtable_memory_budget) {
+  write_buffer_size = memtable_memory_budget / 4;
+  // merge two memtables when flushing to L0
+  min_write_buffer_number_to_merge = 2;
+  // this means we'll use 50% extra memory in the worst case, but will reduce
+  // write stalls.
+  max_write_buffer_number = 6;
+  // universal style compaction
+  compaction_style = kCompactionStyleUniversal;
+  compaction_options_universal.compression_size_percent = 80;
+  return this;
+}
+
+DBOptions* DBOptions::IncreaseParallelism(int total_threads) {
+  max_background_compactions = total_threads - 1;
+  max_background_flushes = 1;
+  env->SetBackgroundThreads(total_threads, Env::LOW);
+  env->SetBackgroundThreads(1, Env::HIGH);
+  return this;
+}
+
 }  // namespace rocksdb
				`@ -0,0 +1 @@`
				Compile RocksDB first by executing `make static_lib` in parent dir