2016-02-09 15:12:00 -08:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2014-01-22 11:44:53 -08:00
|
|
|
// This source code is licensed under the BSD-style license found in the
|
|
|
|
// LICENSE file in the root directory of this source tree. An additional grant
|
|
|
|
// of patent rights can be found in the PATENTS file in the same directory.
|
|
|
|
//
|
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
|
|
|
|
#pragma once
|
|
|
|
|
2014-01-24 18:40:05 -08:00
|
|
|
#include <unordered_map>
|
2014-01-22 11:44:53 -08:00
|
|
|
#include <string>
|
|
|
|
#include <vector>
|
2014-02-06 11:44:50 -08:00
|
|
|
#include <atomic>
|
2014-01-22 11:44:53 -08:00
|
|
|
|
2014-02-06 15:42:16 -08:00
|
|
|
#include "db/memtable_list.h"
|
2014-01-28 11:05:04 -08:00
|
|
|
#include "db/write_batch_internal.h"
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 11:20:25 -07:00
|
|
|
#include "db/write_controller.h"
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 09:07:55 -08:00
|
|
|
#include "db/table_cache.h"
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 10:04:30 -07:00
|
|
|
#include "db/table_properties_collector.h"
|
2015-06-02 17:07:16 -07:00
|
|
|
#include "rocksdb/compaction_job_stats.h"
|
|
|
|
#include "rocksdb/db.h"
|
|
|
|
#include "rocksdb/env.h"
|
|
|
|
#include "rocksdb/options.h"
|
2014-09-17 12:49:13 -07:00
|
|
|
#include "util/mutable_cf_options.h"
|
2015-02-04 21:39:45 -08:00
|
|
|
#include "util/thread_local.h"
|
2014-01-24 14:30:28 -08:00
|
|
|
|
2014-01-22 11:44:53 -08:00
|
|
|
namespace rocksdb {
|
|
|
|
|
|
|
|
class Version;
|
|
|
|
class VersionSet;
|
2014-01-24 14:30:28 -08:00
|
|
|
class MemTable;
|
|
|
|
class MemTableListVersion;
|
2014-01-31 15:30:27 -08:00
|
|
|
class CompactionPicker;
|
|
|
|
class Compaction;
|
|
|
|
class InternalKey;
|
2014-02-04 17:45:19 -08:00
|
|
|
class InternalStats;
|
2014-02-10 17:04:44 -08:00
|
|
|
class ColumnFamilyData;
|
|
|
|
class DBImpl;
|
2014-03-10 17:25:10 -07:00
|
|
|
class LogBuffer;
|
2015-02-04 21:39:45 -08:00
|
|
|
class InstrumentedMutex;
|
|
|
|
class InstrumentedMutexLock;
|
2014-02-10 17:04:44 -08:00
|
|
|
|
When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.
Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000
and make sure without the commit, write stop will happen, but with the commit, it will not happen.
Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52131
2015-12-17 17:07:44 -08:00
|
|
|
extern const double kSlowdownRatio;
|
|
|
|
|
2014-03-11 14:52:17 -07:00
|
|
|
// ColumnFamilyHandleImpl is the class that clients use to access different
|
|
|
|
// column families. It has non-trivial destructor, which gets called when client
|
|
|
|
// is done using the column family
|
2014-02-10 17:04:44 -08:00
|
|
|
class ColumnFamilyHandleImpl : public ColumnFamilyHandle {
|
|
|
|
public:
|
|
|
|
// create while holding the mutex
|
2015-02-04 21:39:45 -08:00
|
|
|
ColumnFamilyHandleImpl(
|
|
|
|
ColumnFamilyData* cfd, DBImpl* db, InstrumentedMutex* mutex);
|
2014-02-10 17:04:44 -08:00
|
|
|
// destroy without mutex
|
|
|
|
virtual ~ColumnFamilyHandleImpl();
|
|
|
|
virtual ColumnFamilyData* cfd() const { return cfd_; }
|
2014-09-22 11:37:35 -07:00
|
|
|
virtual const Comparator* user_comparator() const;
|
2014-02-10 17:04:44 -08:00
|
|
|
|
2015-02-26 11:28:41 -08:00
|
|
|
virtual uint32_t GetID() const override;
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
2014-11-07 14:45:18 -08:00
|
|
|
virtual const std::string& GetName() const override;
|
2016-01-06 18:14:01 -08:00
|
|
|
virtual Status GetDescriptor(ColumnFamilyDescriptor* desc) override;
|
2014-02-25 17:30:54 -08:00
|
|
|
|
2014-02-10 17:04:44 -08:00
|
|
|
private:
|
|
|
|
ColumnFamilyData* cfd_;
|
|
|
|
DBImpl* db_;
|
2015-02-04 21:39:45 -08:00
|
|
|
InstrumentedMutex* mutex_;
|
2014-02-10 17:04:44 -08:00
|
|
|
};
|
|
|
|
|
2014-03-11 14:52:17 -07:00
|
|
|
// Does not ref-count ColumnFamilyData
|
|
|
|
// We use this dummy ColumnFamilyHandleImpl because sometimes MemTableInserter
|
|
|
|
// calls DBImpl methods. When this happens, MemTableInserter need access to
|
|
|
|
// ColumnFamilyHandle (same as the client would need). In that case, we feed
|
|
|
|
// MemTableInserter dummy ColumnFamilyHandle and enable it to call DBImpl
|
|
|
|
// methods
|
2014-02-10 17:04:44 -08:00
|
|
|
class ColumnFamilyHandleInternal : public ColumnFamilyHandleImpl {
|
|
|
|
public:
|
|
|
|
ColumnFamilyHandleInternal()
|
|
|
|
: ColumnFamilyHandleImpl(nullptr, nullptr, nullptr) {}
|
|
|
|
|
2014-11-06 11:14:28 -08:00
|
|
|
void SetCFD(ColumnFamilyData* _cfd) { internal_cfd_ = _cfd; }
|
2014-02-10 17:04:44 -08:00
|
|
|
virtual ColumnFamilyData* cfd() const override { return internal_cfd_; }
|
|
|
|
|
|
|
|
private:
|
|
|
|
ColumnFamilyData* internal_cfd_;
|
|
|
|
};
|
2014-01-24 14:30:28 -08:00
|
|
|
|
|
|
|
// holds references to memtable, all immutable memtables and version
|
|
|
|
struct SuperVersion {
|
2015-04-08 21:10:35 -07:00
|
|
|
// Accessing members of this class is not thread-safe and requires external
|
|
|
|
// synchronization (ie db mutex held or on write thread).
|
2014-01-24 14:30:28 -08:00
|
|
|
MemTable* mem;
|
|
|
|
MemTableListVersion* imm;
|
|
|
|
Version* current;
|
2014-09-17 12:49:13 -07:00
|
|
|
MutableCFOptions mutable_cf_options;
|
2014-03-03 17:54:04 -08:00
|
|
|
// Version number of the current SuperVersion
|
|
|
|
uint64_t version_number;
|
2015-04-08 21:10:35 -07:00
|
|
|
|
2015-02-04 21:39:45 -08:00
|
|
|
InstrumentedMutex* db_mutex;
|
2014-01-24 14:30:28 -08:00
|
|
|
|
|
|
|
// should be called outside the mutex
|
2014-02-12 14:01:30 -08:00
|
|
|
SuperVersion() = default;
|
2014-01-24 14:30:28 -08:00
|
|
|
~SuperVersion();
|
|
|
|
SuperVersion* Ref();
|
2015-04-08 21:10:35 -07:00
|
|
|
// If Unref() returns true, Cleanup() should be called with mutex held
|
|
|
|
// before deleting this SuperVersion.
|
2014-01-24 14:30:28 -08:00
|
|
|
bool Unref();
|
|
|
|
|
|
|
|
// call these two methods with db mutex held
|
|
|
|
// Cleanup unrefs mem, imm and current. Also, it stores all memtables
|
|
|
|
// that needs to be deleted in to_delete vector. Unrefing those
|
|
|
|
// objects needs to be done in the mutex
|
|
|
|
void Cleanup();
|
|
|
|
void Init(MemTable* new_mem, MemTableListVersion* new_imm,
|
|
|
|
Version* new_current);
|
2014-03-07 16:59:47 -08:00
|
|
|
|
|
|
|
// The value of dummy is not actually used. kSVInUse takes its address as a
|
|
|
|
// mark in the thread local storage to indicate the SuperVersion is in use
|
|
|
|
// by thread. This way, the value of kSVInUse is guaranteed to have no
|
|
|
|
// conflict with SuperVersion object address and portable on different
|
|
|
|
// platform.
|
|
|
|
static int dummy;
|
|
|
|
static void* const kSVInUse;
|
|
|
|
static void* const kSVObsolete;
|
2015-04-08 21:10:35 -07:00
|
|
|
|
|
|
|
private:
|
|
|
|
std::atomic<uint32_t> refs;
|
|
|
|
// We need to_delete because during Cleanup(), imm->Unref() returns
|
|
|
|
// all memtables that we need to free through this vector. We then
|
|
|
|
// delete all those memtables outside of mutex, during destruction
|
|
|
|
autovector<MemTable*> to_delete;
|
2014-01-24 14:30:28 -08:00
|
|
|
};
|
2014-01-22 11:44:53 -08:00
|
|
|
|
2015-06-18 14:55:05 -07:00
|
|
|
extern Status CheckCompressionSupported(const ColumnFamilyOptions& cf_options);
|
|
|
|
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 16:59:07 -07:00
|
|
|
extern Status CheckConcurrentWritesSupported(
|
|
|
|
const ColumnFamilyOptions& cf_options);
|
|
|
|
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
2015-02-05 11:44:17 -08:00
|
|
|
extern ColumnFamilyOptions SanitizeOptions(const DBOptions& db_options,
|
|
|
|
const InternalKeyComparator* icmp,
|
|
|
|
const ColumnFamilyOptions& src);
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 10:04:30 -07:00
|
|
|
// Wrap user defined table proproties collector factories `from cf_options`
|
|
|
|
// into internal ones in int_tbl_prop_collector_factories. Add a system internal
|
|
|
|
// one too.
|
|
|
|
extern void GetIntTblPropCollectorFactory(
|
|
|
|
const ColumnFamilyOptions& cf_options,
|
|
|
|
std::vector<std::unique_ptr<IntTblPropCollectorFactory>>*
|
|
|
|
int_tbl_prop_collector_factories);
|
2014-02-04 16:31:18 -08:00
|
|
|
|
2014-02-10 17:04:44 -08:00
|
|
|
class ColumnFamilySet;
|
|
|
|
|
2015-01-06 12:44:21 -08:00
|
|
|
// This class keeps all the data that a column family needs.
|
2014-03-11 14:52:17 -07:00
|
|
|
// Most methods require DB mutex held, unless otherwise noted
|
2014-01-29 13:28:50 -08:00
|
|
|
class ColumnFamilyData {
|
|
|
|
public:
|
2014-02-10 17:04:44 -08:00
|
|
|
~ColumnFamilyData();
|
|
|
|
|
2014-03-11 14:52:17 -07:00
|
|
|
// thread-safe
|
2014-01-29 13:28:50 -08:00
|
|
|
uint32_t GetID() const { return id_; }
|
2014-03-11 14:52:17 -07:00
|
|
|
// thread-safe
|
|
|
|
const std::string& GetName() const { return name_; }
|
2014-01-29 13:28:50 -08:00
|
|
|
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 16:59:07 -07:00
|
|
|
// Ref() can only be called from a context where the caller can guarantee
|
|
|
|
// that ColumnFamilyData is alive (while holding a non-zero ref already,
|
|
|
|
// holding a DB mutex, or as the leader in a write batch group).
|
2015-01-26 11:48:07 -08:00
|
|
|
void Ref() { refs_.fetch_add(1, std::memory_order_relaxed); }
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 16:59:07 -07:00
|
|
|
|
|
|
|
// Unref decreases the reference count, but does not handle deletion
|
|
|
|
// when the count goes to 0. If this method returns true then the
|
|
|
|
// caller should delete the instance immediately, or later, by calling
|
|
|
|
// FreeDeadColumnFamilies(). Unref() can only be called while holding
|
|
|
|
// a DB mutex, or during single-threaded recovery.
|
2014-02-10 17:04:44 -08:00
|
|
|
bool Unref() {
|
2015-01-26 11:48:07 -08:00
|
|
|
int old_refs = refs_.fetch_sub(1, std::memory_order_relaxed);
|
|
|
|
assert(old_refs > 0);
|
|
|
|
return old_refs == 1;
|
2014-02-10 17:04:44 -08:00
|
|
|
}
|
|
|
|
|
2015-01-06 12:44:21 -08:00
|
|
|
// SetDropped() can only be called under following conditions:
|
|
|
|
// 1) Holding a DB mutex,
|
|
|
|
// 2) from single-threaded write thread, AND
|
|
|
|
// 3) from single-threaded VersionSet::LogAndApply()
|
2014-03-11 14:52:17 -07:00
|
|
|
// After dropping column family no other operation on that column family
|
|
|
|
// will be executed. All the files and memory will be, however, kept around
|
|
|
|
// until client drops the column family handle. That way, client can still
|
|
|
|
// access data from dropped column family.
|
|
|
|
// Column family can be dropped and still alive. In that state:
|
|
|
|
// *) Compaction and flush is not executed on the dropped column family.
|
2015-01-06 12:44:21 -08:00
|
|
|
// *) Client can continue reading from column family. Writes will fail unless
|
|
|
|
// WriteOptions::ignore_missing_column_families is true
|
2014-03-11 14:52:17 -07:00
|
|
|
// When the dropped column family is unreferenced, then we:
|
2015-03-19 17:04:29 -07:00
|
|
|
// *) Remove column family from the linked list maintained by ColumnFamilySet
|
2014-03-11 14:52:17 -07:00
|
|
|
// *) delete all memory associated with that column family
|
|
|
|
// *) delete all the files associated with that column family
|
2015-01-06 12:44:21 -08:00
|
|
|
void SetDropped();
|
2014-03-11 14:52:17 -07:00
|
|
|
bool IsDropped() const { return dropped_; }
|
2014-02-10 17:04:44 -08:00
|
|
|
|
2014-03-11 14:52:17 -07:00
|
|
|
// thread-safe
|
2014-10-23 15:34:21 -07:00
|
|
|
int NumberLevels() const { return ioptions_.num_levels; }
|
2014-01-31 15:30:27 -08:00
|
|
|
|
2014-01-29 13:28:50 -08:00
|
|
|
void SetLogNumber(uint64_t log_number) { log_number_ = log_number; }
|
|
|
|
uint64_t GetLogNumber() const { return log_number_; }
|
|
|
|
|
2014-11-18 10:20:10 -08:00
|
|
|
// !!! To be deprecated! Please don't not use this function anymore!
|
2014-09-17 12:49:13 -07:00
|
|
|
const Options* options() const { return &options_; }
|
2014-11-18 10:20:10 -08:00
|
|
|
|
|
|
|
// thread-safe
|
2014-04-14 10:48:01 -07:00
|
|
|
const EnvOptions* soptions() const;
|
2014-09-04 16:18:36 -07:00
|
|
|
const ImmutableCFOptions* ioptions() const { return &ioptions_; }
|
2014-09-17 12:49:13 -07:00
|
|
|
// REQUIRES: DB mutex held
|
|
|
|
// This returns the MutableCFOptions used by current SuperVersion
|
|
|
|
// You shoul use this API to reference MutableCFOptions most of the time.
|
2014-11-06 11:14:28 -08:00
|
|
|
const MutableCFOptions* GetCurrentMutableCFOptions() const {
|
2014-09-17 12:49:13 -07:00
|
|
|
return &(super_version_->mutable_cf_options);
|
|
|
|
}
|
|
|
|
// REQUIRES: DB mutex held
|
|
|
|
// This returns the latest MutableCFOptions, which may be not in effect yet.
|
|
|
|
const MutableCFOptions* GetLatestMutableCFOptions() const {
|
|
|
|
return &mutable_cf_options_;
|
|
|
|
}
|
2014-11-13 16:45:33 -05:00
|
|
|
#ifndef ROCKSDB_LITE
|
2014-09-17 12:49:13 -07:00
|
|
|
// REQUIRES: DB mutex held
|
2014-11-04 16:23:05 -08:00
|
|
|
Status SetOptions(
|
2014-09-17 12:49:13 -07:00
|
|
|
const std::unordered_map<std::string, std::string>& options_map);
|
2014-11-13 16:45:33 -05:00
|
|
|
#endif // ROCKSDB_LITE
|
2014-03-11 14:52:17 -07:00
|
|
|
|
|
|
|
InternalStats* internal_stats() { return internal_stats_.get(); }
|
2014-01-29 13:28:50 -08:00
|
|
|
|
|
|
|
MemTableList* imm() { return &imm_; }
|
|
|
|
MemTable* mem() { return mem_; }
|
|
|
|
Version* current() { return current_; }
|
|
|
|
Version* dummy_versions() { return dummy_versions_; }
|
2015-12-28 09:50:49 -08:00
|
|
|
void SetCurrent(Version* _current);
|
2015-02-11 17:10:43 -08:00
|
|
|
uint64_t GetNumLiveVersions() const; // REQUIRE: DB mutex held
|
2015-08-20 11:47:19 -07:00
|
|
|
uint64_t GetTotalSstFilesSize() const; // REQUIRE: DB mutex held
|
2014-12-02 12:09:20 -08:00
|
|
|
void SetMemtable(MemTable* new_mem) { mem_ = new_mem; }
|
2015-05-29 14:36:35 -07:00
|
|
|
|
|
|
|
// See Memtable constructor for explanation of earliest_seq param.
|
|
|
|
MemTable* ConstructNewMemtable(const MutableCFOptions& mutable_cf_options,
|
|
|
|
SequenceNumber earliest_seq);
|
|
|
|
void CreateNewMemtable(const MutableCFOptions& mutable_cf_options,
|
|
|
|
SequenceNumber earliest_seq);
|
2014-01-29 13:28:50 -08:00
|
|
|
|
2014-05-30 14:31:55 -07:00
|
|
|
TableCache* table_cache() const { return table_cache_.get(); }
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 09:07:55 -08:00
|
|
|
|
2014-01-31 15:30:27 -08:00
|
|
|
// See documentation in compaction_picker.h
|
2014-10-01 16:19:16 -07:00
|
|
|
// REQUIRES: DB mutex held
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 20:38:12 +01:00
|
|
|
bool NeedsCompaction() const;
|
|
|
|
// REQUIRES: DB mutex held
|
2014-10-01 16:19:16 -07:00
|
|
|
Compaction* PickCompaction(const MutableCFOptions& mutable_options,
|
|
|
|
LogBuffer* log_buffer);
|
2015-04-22 16:55:22 -07:00
|
|
|
// A flag to tell a manual compaction is to compact all levels together
|
|
|
|
// instad of for specific level.
|
|
|
|
static const int kCompactAllLevels;
|
2015-04-14 21:45:20 -07:00
|
|
|
// A flag to tell a manual compaction's output is base level.
|
|
|
|
static const int kCompactToBaseLevel;
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 20:38:12 +01:00
|
|
|
// REQUIRES: DB mutex held
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
|
|
|
Compaction* CompactRange(const MutableCFOptions& mutable_cf_options,
|
|
|
|
int input_level, int output_level,
|
|
|
|
uint32_t output_path_id, const InternalKey* begin,
|
|
|
|
const InternalKey* end, InternalKey** compaction_end,
|
|
|
|
bool* manual_conflict);
|
2014-01-31 15:30:27 -08:00
|
|
|
|
2014-03-11 14:52:17 -07:00
|
|
|
CompactionPicker* compaction_picker() { return compaction_picker_.get(); }
|
|
|
|
// thread-safe
|
2014-02-04 16:31:18 -08:00
|
|
|
const Comparator* user_comparator() const {
|
|
|
|
return internal_comparator_.user_comparator();
|
|
|
|
}
|
2014-03-11 14:52:17 -07:00
|
|
|
// thread-safe
|
2014-02-04 16:31:18 -08:00
|
|
|
const InternalKeyComparator& internal_comparator() const {
|
|
|
|
return internal_comparator_;
|
|
|
|
}
|
2014-01-31 15:30:27 -08:00
|
|
|
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 10:04:30 -07:00
|
|
|
const std::vector<std::unique_ptr<IntTblPropCollectorFactory>>*
|
|
|
|
int_tbl_prop_collector_factories() const {
|
|
|
|
return &int_tbl_prop_collector_factories_;
|
|
|
|
}
|
|
|
|
|
2014-03-11 14:52:17 -07:00
|
|
|
SuperVersion* GetSuperVersion() { return super_version_; }
|
|
|
|
// thread-safe
|
2014-04-14 09:34:59 -07:00
|
|
|
// Return a already referenced SuperVersion to be used safely.
|
2015-02-04 21:39:45 -08:00
|
|
|
SuperVersion* GetReferencedSuperVersion(InstrumentedMutex* db_mutex);
|
2014-04-14 09:34:59 -07:00
|
|
|
// thread-safe
|
|
|
|
// Get SuperVersion stored in thread local storage. If it does not exist,
|
|
|
|
// get a reference from a current SuperVersion.
|
2015-02-04 21:39:45 -08:00
|
|
|
SuperVersion* GetThreadLocalSuperVersion(InstrumentedMutex* db_mutex);
|
2014-04-14 09:34:59 -07:00
|
|
|
// Try to return SuperVersion back to thread local storage. Retrun true on
|
|
|
|
// success and false on failure. It fails when the thread local storage
|
|
|
|
// contains anything other than SuperVersion::kSVInUse flag.
|
|
|
|
bool ReturnThreadLocalSuperVersion(SuperVersion* sv);
|
2014-03-11 14:52:17 -07:00
|
|
|
// thread-safe
|
2014-01-29 13:28:50 -08:00
|
|
|
uint64_t GetSuperVersionNumber() const {
|
|
|
|
return super_version_number_.load();
|
|
|
|
}
|
|
|
|
// will return a pointer to SuperVersion* if previous SuperVersion
|
|
|
|
// if its reference count is zero and needs deletion or nullptr if not
|
|
|
|
// As argument takes a pointer to allocated SuperVersion to enable
|
|
|
|
// the clients to allocate SuperVersion outside of mutex.
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 20:38:12 +01:00
|
|
|
// IMPORTANT: Only call this from DBImpl::InstallSuperVersion()
|
2014-09-17 12:49:13 -07:00
|
|
|
SuperVersion* InstallSuperVersion(SuperVersion* new_superversion,
|
2015-02-04 21:39:45 -08:00
|
|
|
InstrumentedMutex* db_mutex,
|
2014-09-17 12:49:13 -07:00
|
|
|
const MutableCFOptions& mutable_cf_options);
|
2014-03-03 17:54:04 -08:00
|
|
|
SuperVersion* InstallSuperVersion(SuperVersion* new_superversion,
|
2015-02-04 21:39:45 -08:00
|
|
|
InstrumentedMutex* db_mutex);
|
2014-03-03 17:54:04 -08:00
|
|
|
|
|
|
|
void ResetThreadLocalSuperVersions();
|
2014-01-29 13:28:50 -08:00
|
|
|
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 20:38:12 +01:00
|
|
|
// Protected by DB mutex
|
|
|
|
void set_pending_flush(bool value) { pending_flush_ = value; }
|
|
|
|
void set_pending_compaction(bool value) { pending_compaction_ = value; }
|
|
|
|
bool pending_flush() { return pending_flush_; }
|
|
|
|
bool pending_compaction() { return pending_compaction_; }
|
|
|
|
|
When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.
Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000
and make sure without the commit, write stop will happen, but with the commit, it will not happen.
Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52131
2015-12-17 17:07:44 -08:00
|
|
|
// Recalculate some small conditions, which are changed only during
|
|
|
|
// compaction, adding new memtable and/or
|
|
|
|
// recalculation of compaction score. These values are used in
|
|
|
|
// DBImpl::MakeRoomForWrite function to decide, if it need to make
|
|
|
|
// a write stall
|
|
|
|
void RecalculateWriteStallConditions(
|
|
|
|
const MutableCFOptions& mutable_cf_options);
|
|
|
|
|
2014-01-29 13:28:50 -08:00
|
|
|
private:
|
2014-01-30 16:49:46 -08:00
|
|
|
friend class ColumnFamilySet;
|
2014-07-30 13:53:08 -07:00
|
|
|
ColumnFamilyData(uint32_t id, const std::string& name,
|
|
|
|
Version* dummy_versions, Cache* table_cache,
|
2014-12-02 12:09:20 -08:00
|
|
|
WriteBuffer* write_buffer,
|
2014-07-30 13:53:08 -07:00
|
|
|
const ColumnFamilyOptions& options,
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 11:20:25 -07:00
|
|
|
const DBOptions* db_options, const EnvOptions& env_options,
|
2014-02-10 17:04:44 -08:00
|
|
|
ColumnFamilySet* column_family_set);
|
2014-01-30 16:49:46 -08:00
|
|
|
|
2014-01-29 13:28:50 -08:00
|
|
|
uint32_t id_;
|
|
|
|
const std::string name_;
|
|
|
|
Version* dummy_versions_; // Head of circular doubly-linked list of versions.
|
|
|
|
Version* current_; // == dummy_versions->prev_
|
|
|
|
|
2015-01-26 11:48:07 -08:00
|
|
|
std::atomic<int> refs_; // outstanding references to ColumnFamilyData
|
2014-03-11 14:52:17 -07:00
|
|
|
bool dropped_; // true if client dropped it
|
2014-02-10 17:04:44 -08:00
|
|
|
|
2014-02-04 16:31:18 -08:00
|
|
|
const InternalKeyComparator internal_comparator_;
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 10:04:30 -07:00
|
|
|
std::vector<std::unique_ptr<IntTblPropCollectorFactory>>
|
|
|
|
int_tbl_prop_collector_factories_;
|
2014-02-04 16:31:18 -08:00
|
|
|
|
2014-09-04 16:18:36 -07:00
|
|
|
const Options options_;
|
|
|
|
const ImmutableCFOptions ioptions_;
|
2014-09-17 12:49:13 -07:00
|
|
|
MutableCFOptions mutable_cf_options_;
|
2014-01-31 15:30:27 -08:00
|
|
|
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 09:07:55 -08:00
|
|
|
std::unique_ptr<TableCache> table_cache_;
|
|
|
|
|
2014-02-04 17:45:19 -08:00
|
|
|
std::unique_ptr<InternalStats> internal_stats_;
|
|
|
|
|
2014-12-02 12:09:20 -08:00
|
|
|
WriteBuffer* write_buffer_;
|
|
|
|
|
2014-01-29 13:28:50 -08:00
|
|
|
MemTable* mem_;
|
|
|
|
MemTableList imm_;
|
|
|
|
SuperVersion* super_version_;
|
|
|
|
|
|
|
|
// An ordinal representing the current SuperVersion. Updated by
|
|
|
|
// InstallSuperVersion(), i.e. incremented every time super_version_
|
|
|
|
// changes.
|
|
|
|
std::atomic<uint64_t> super_version_number_;
|
|
|
|
|
2014-03-03 17:54:04 -08:00
|
|
|
// Thread's local copy of SuperVersion pointer
|
|
|
|
// This needs to be destructed before mutex_
|
2014-03-04 09:03:56 -08:00
|
|
|
std::unique_ptr<ThreadLocalPtr> local_sv_;
|
2014-03-03 17:54:04 -08:00
|
|
|
|
2015-03-19 17:04:29 -07:00
|
|
|
// pointers for a circular linked list. we use it to support iterations over
|
|
|
|
// all column families that are alive (note: dropped column families can also
|
|
|
|
// be alive as long as client holds a reference)
|
2014-02-10 17:04:44 -08:00
|
|
|
ColumnFamilyData* next_;
|
|
|
|
ColumnFamilyData* prev_;
|
2014-01-30 16:49:46 -08:00
|
|
|
|
2014-01-29 13:28:50 -08:00
|
|
|
// This is the earliest log file number that contains data from this
|
|
|
|
// Column Family. All earlier log files must be ignored and not
|
|
|
|
// recovered from
|
|
|
|
uint64_t log_number_;
|
2014-01-30 15:23:13 -08:00
|
|
|
|
2014-01-31 15:30:27 -08:00
|
|
|
// An object that keeps all the compaction stats
|
|
|
|
// and picks the next compaction
|
|
|
|
std::unique_ptr<CompactionPicker> compaction_picker_;
|
2014-02-10 17:04:44 -08:00
|
|
|
|
|
|
|
ColumnFamilySet* column_family_set_;
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 11:20:25 -07:00
|
|
|
|
|
|
|
std::unique_ptr<WriteControllerToken> write_controller_token_;
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 20:38:12 +01:00
|
|
|
|
|
|
|
// If true --> this ColumnFamily is currently present in DBImpl::flush_queue_
|
|
|
|
bool pending_flush_;
|
|
|
|
|
|
|
|
// If true --> this ColumnFamily is currently present in
|
|
|
|
// DBImpl::compaction_queue_
|
|
|
|
bool pending_compaction_;
|
When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.
Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000
and make sure without the commit, write stop will happen, but with the commit, it will not happen.
Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52131
2015-12-17 17:07:44 -08:00
|
|
|
|
|
|
|
uint64_t prev_compaction_needed_bytes_;
|
2014-01-22 11:44:53 -08:00
|
|
|
};
|
|
|
|
|
2014-03-11 14:52:17 -07:00
|
|
|
// ColumnFamilySet has interesting thread-safety requirements
|
2015-01-06 12:44:21 -08:00
|
|
|
// * CreateColumnFamily() or RemoveColumnFamily() -- need to be protected by DB
|
|
|
|
// mutex AND executed in the write thread.
|
|
|
|
// CreateColumnFamily() should ONLY be called from VersionSet::LogAndApply() AND
|
|
|
|
// single-threaded write thread. It is also called during Recovery and in
|
|
|
|
// DumpManifest().
|
|
|
|
// RemoveColumnFamily() is only called from SetDropped(). DB mutex needs to be
|
|
|
|
// held and it needs to be executed from the write thread. SetDropped() also
|
|
|
|
// guarantees that it will be called only from single-threaded LogAndApply(),
|
|
|
|
// but this condition is not that important.
|
2014-03-11 14:52:17 -07:00
|
|
|
// * Iteration -- hold DB mutex, but you can release it in the body of
|
|
|
|
// iteration. If you release DB mutex in body, reference the column
|
|
|
|
// family before the mutex and unreference after you unlock, since the column
|
|
|
|
// family might get dropped when the DB mutex is released
|
|
|
|
// * GetDefault() -- thread safe
|
2015-01-06 12:44:21 -08:00
|
|
|
// * GetColumnFamily() -- either inside of DB mutex or from a write thread
|
2014-06-02 15:33:54 -07:00
|
|
|
// * GetNextColumnFamilyID(), GetMaxColumnFamily(), UpdateMaxColumnFamily(),
|
|
|
|
// NumberOfColumnFamilies -- inside of DB mutex
|
2014-01-22 11:44:53 -08:00
|
|
|
class ColumnFamilySet {
|
|
|
|
public:
|
2014-03-11 14:52:17 -07:00
|
|
|
// ColumnFamilySet supports iteration
|
2014-01-24 14:30:28 -08:00
|
|
|
class iterator {
|
|
|
|
public:
|
2014-01-30 16:49:46 -08:00
|
|
|
explicit iterator(ColumnFamilyData* cfd)
|
|
|
|
: current_(cfd) {}
|
2014-01-24 14:30:28 -08:00
|
|
|
iterator& operator++() {
|
2015-03-19 17:04:29 -07:00
|
|
|
// dropped column families might still be included in this iteration
|
|
|
|
// (we're only removing them when client drops the last reference to the
|
|
|
|
// column family).
|
|
|
|
// dummy is never dead, so this will never be infinite
|
2014-02-10 17:04:44 -08:00
|
|
|
do {
|
2014-03-11 14:52:17 -07:00
|
|
|
current_ = current_->next_;
|
2015-03-19 17:04:29 -07:00
|
|
|
} while (current_->refs_.load(std::memory_order_relaxed) == 0);
|
2014-01-24 14:30:28 -08:00
|
|
|
return *this;
|
|
|
|
}
|
2014-01-30 16:49:46 -08:00
|
|
|
bool operator!=(const iterator& other) {
|
|
|
|
return this->current_ != other.current_;
|
|
|
|
}
|
|
|
|
ColumnFamilyData* operator*() { return current_; }
|
2014-01-24 14:30:28 -08:00
|
|
|
|
|
|
|
private:
|
2014-01-30 16:49:46 -08:00
|
|
|
ColumnFamilyData* current_;
|
2014-01-24 14:30:28 -08:00
|
|
|
};
|
2014-01-22 11:44:53 -08:00
|
|
|
|
2014-02-05 13:12:23 -08:00
|
|
|
ColumnFamilySet(const std::string& dbname, const DBOptions* db_options,
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 11:20:25 -07:00
|
|
|
const EnvOptions& env_options, Cache* table_cache,
|
2014-12-02 12:09:20 -08:00
|
|
|
WriteBuffer* write_buffer, WriteController* write_controller);
|
2014-01-22 11:44:53 -08:00
|
|
|
~ColumnFamilySet();
|
|
|
|
|
|
|
|
ColumnFamilyData* GetDefault() const;
|
|
|
|
// GetColumnFamily() calls return nullptr if column family is not found
|
|
|
|
ColumnFamilyData* GetColumnFamily(uint32_t id) const;
|
2014-02-28 14:05:11 -08:00
|
|
|
ColumnFamilyData* GetColumnFamily(const std::string& name) const;
|
2014-01-22 11:44:53 -08:00
|
|
|
// this call will return the next available column family ID. it guarantees
|
|
|
|
// that there is no column family with id greater than or equal to the
|
2014-03-05 12:13:44 -08:00
|
|
|
// returned value in the current running instance or anytime in RocksDB
|
|
|
|
// instance history.
|
2014-01-22 11:44:53 -08:00
|
|
|
uint32_t GetNextColumnFamilyID();
|
2014-03-05 12:13:44 -08:00
|
|
|
uint32_t GetMaxColumnFamily();
|
|
|
|
void UpdateMaxColumnFamily(uint32_t new_max_column_family);
|
2014-06-02 15:33:54 -07:00
|
|
|
size_t NumberOfColumnFamilies() const;
|
2014-01-22 11:44:53 -08:00
|
|
|
|
|
|
|
ColumnFamilyData* CreateColumnFamily(const std::string& name, uint32_t id,
|
|
|
|
Version* dummy_version,
|
|
|
|
const ColumnFamilyOptions& options);
|
|
|
|
|
2014-03-11 14:52:17 -07:00
|
|
|
iterator begin() { return iterator(dummy_cfd_->next_); }
|
2014-01-30 16:49:46 -08:00
|
|
|
iterator end() { return iterator(dummy_cfd_); }
|
2014-01-22 11:44:53 -08:00
|
|
|
|
2014-04-07 14:21:25 -07:00
|
|
|
// REQUIRES: DB mutex held
|
|
|
|
// Don't call while iterating over ColumnFamilySet
|
|
|
|
void FreeDeadColumnFamilies();
|
|
|
|
|
2014-01-22 11:44:53 -08:00
|
|
|
private:
|
2014-03-11 14:52:17 -07:00
|
|
|
friend class ColumnFamilyData;
|
|
|
|
// helper function that gets called from cfd destructor
|
|
|
|
// REQUIRES: DB mutex held
|
|
|
|
void RemoveColumnFamily(ColumnFamilyData* cfd);
|
|
|
|
|
|
|
|
// column_families_ and column_family_data_ need to be protected:
|
2015-01-06 12:44:21 -08:00
|
|
|
// * when mutating both conditions have to be satisfied:
|
|
|
|
// 1. DB mutex locked
|
|
|
|
// 2. thread currently in single-threaded write thread
|
|
|
|
// * when reading, at least one condition needs to be satisfied:
|
|
|
|
// 1. DB mutex locked
|
|
|
|
// 2. accessed from a single-threaded write thread
|
2014-01-22 11:44:53 -08:00
|
|
|
std::unordered_map<std::string, uint32_t> column_families_;
|
|
|
|
std::unordered_map<uint32_t, ColumnFamilyData*> column_family_data_;
|
2014-03-11 14:52:17 -07:00
|
|
|
|
2014-01-22 11:44:53 -08:00
|
|
|
uint32_t max_column_family_;
|
2014-01-30 16:49:46 -08:00
|
|
|
ColumnFamilyData* dummy_cfd_;
|
2014-03-11 14:52:17 -07:00
|
|
|
// We don't hold the refcount here, since default column family always exists
|
|
|
|
// We are also not responsible for cleaning up default_cfd_cache_. This is
|
|
|
|
// just a cache that makes common case (accessing default column family)
|
|
|
|
// faster
|
|
|
|
ColumnFamilyData* default_cfd_cache_;
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 09:07:55 -08:00
|
|
|
|
|
|
|
const std::string db_name_;
|
2014-02-05 13:12:23 -08:00
|
|
|
const DBOptions* const db_options_;
|
2014-09-04 16:18:36 -07:00
|
|
|
const EnvOptions env_options_;
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 09:07:55 -08:00
|
|
|
Cache* table_cache_;
|
2014-12-02 12:09:20 -08:00
|
|
|
WriteBuffer* write_buffer_;
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 11:20:25 -07:00
|
|
|
WriteController* write_controller_;
|
2014-01-22 11:44:53 -08:00
|
|
|
};
|
|
|
|
|
2014-03-11 14:52:17 -07:00
|
|
|
// We use ColumnFamilyMemTablesImpl to provide WriteBatch a way to access
|
|
|
|
// memtables of different column families (specified by ID in the write batch)
|
2014-01-28 11:05:04 -08:00
|
|
|
class ColumnFamilyMemTablesImpl : public ColumnFamilyMemTables {
|
|
|
|
public:
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 16:59:07 -07:00
|
|
|
explicit ColumnFamilyMemTablesImpl(ColumnFamilySet* column_family_set)
|
|
|
|
: column_family_set_(column_family_set), current_(nullptr) {}
|
|
|
|
|
|
|
|
// Constructs a ColumnFamilyMemTablesImpl equivalent to one constructed
|
|
|
|
// with the arguments used to construct *orig.
|
|
|
|
explicit ColumnFamilyMemTablesImpl(ColumnFamilyMemTablesImpl* orig)
|
|
|
|
: column_family_set_(orig->column_family_set_), current_(nullptr) {}
|
2014-01-28 11:05:04 -08:00
|
|
|
|
2014-02-05 16:02:48 -08:00
|
|
|
// sets current_ to ColumnFamilyData with column_family_id
|
|
|
|
// returns false if column family doesn't exist
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 16:59:07 -07:00
|
|
|
// REQUIRES: use this function of DBImpl::column_family_memtables_ should be
|
|
|
|
// under a DB mutex OR from a write thread
|
2014-02-05 16:02:48 -08:00
|
|
|
bool Seek(uint32_t column_family_id) override;
|
|
|
|
|
|
|
|
// Returns log number of the selected column family
|
2015-01-06 12:44:21 -08:00
|
|
|
// REQUIRES: under a DB mutex OR from a write thread
|
2014-02-05 16:02:48 -08:00
|
|
|
uint64_t GetLogNumber() const override;
|
|
|
|
|
|
|
|
// REQUIRES: Seek() called first
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 16:59:07 -07:00
|
|
|
// REQUIRES: use this function of DBImpl::column_family_memtables_ should be
|
|
|
|
// under a DB mutex OR from a write thread
|
2014-02-05 16:02:48 -08:00
|
|
|
virtual MemTable* GetMemTable() const override;
|
2014-01-28 11:05:04 -08:00
|
|
|
|
2014-02-05 16:02:48 -08:00
|
|
|
// Returns column family handle for the selected column family
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 16:59:07 -07:00
|
|
|
// REQUIRES: use this function of DBImpl::column_family_memtables_ should be
|
|
|
|
// under a DB mutex OR from a write thread
|
2014-02-10 17:04:44 -08:00
|
|
|
virtual ColumnFamilyHandle* GetColumnFamilyHandle() override;
|
2014-01-28 11:05:04 -08:00
|
|
|
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 16:59:07 -07:00
|
|
|
// Cannot be called while another thread is calling Seek().
|
|
|
|
// REQUIRES: use this function of DBImpl::column_family_memtables_ should be
|
|
|
|
// under a DB mutex OR from a write thread
|
2015-12-28 09:50:49 -08:00
|
|
|
virtual ColumnFamilyData* current() override { return current_; }
|
2014-09-10 18:46:09 -07:00
|
|
|
|
2014-01-28 11:05:04 -08:00
|
|
|
private:
|
|
|
|
ColumnFamilySet* column_family_set_;
|
2014-02-05 16:02:48 -08:00
|
|
|
ColumnFamilyData* current_;
|
2014-02-10 17:04:44 -08:00
|
|
|
ColumnFamilyHandleInternal handle_;
|
2014-01-28 11:05:04 -08:00
|
|
|
};
|
|
|
|
|
2014-08-18 15:19:17 -07:00
|
|
|
extern uint32_t GetColumnFamilyID(ColumnFamilyHandle* column_family);
|
|
|
|
|
2014-09-22 11:37:35 -07:00
|
|
|
extern const Comparator* GetColumnFamilyUserComparator(
|
|
|
|
ColumnFamilyHandle* column_family);
|
|
|
|
|
2014-01-22 11:44:53 -08:00
|
|
|
} // namespace rocksdb
|