2016-02-10 00:12:00 +01:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2013-10-16 23:59:46 +02:00
|
|
|
// This source code is licensed under the BSD-style license found in the
|
|
|
|
// LICENSE file in the root directory of this source tree. An additional grant
|
|
|
|
// of patent rights can be found in the PATENTS file in the same directory.
|
2017-04-28 02:50:56 +02:00
|
|
|
// This source code is also licensed under the GPLv2 license found in the
|
|
|
|
// COPYING file in the root directory of this source tree.
|
2013-10-16 23:59:46 +02:00
|
|
|
//
|
2017-04-06 22:59:31 +02:00
|
|
|
#include "memtable/inlineskiplist.h"
|
2013-07-23 23:42:27 +02:00
|
|
|
#include "db/memtable.h"
|
InlineSkipList part 3/3 - new skiplist type that colocates key and node
Summary:
This diff completes the creation of InlineSkipList<Cmp>, which is like
SkipList<const char*, Cmp> but it always allocates the key contiguously
with the node. This allows us to remove the pointer from the node
to the key. As a result the memory usage of the skip list is reduced
(by 1 to sizeof(void*) bytes depending on the padding required to align
the key storage), cache locality is improved, and we halve the number
of calls to the allocator.
For skip lists whose keys are freshly-allocated const char*,
InlineSkipList is stricly preferrable to SkipList. This diff doesn't
replace SkipList, however, because some of the use cases of SkipList in
RocksDB are either character sequences that are not allocated at the
same time as the skip list node allocation (for example
hash_linklist_rep) or have different key types (for example
write_batch_with_index). Taking advantage of inline allocation for
those cases is left to future work.
The perf win is biggest for small values. For single-threaded CPU-bound
(32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte
values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec. For
large values the improvement is less pronounced, but seems to be between
5% and 10% on the same configuration.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51123
2015-11-19 23:24:29 +01:00
|
|
|
#include "rocksdb/memtablerep.h"
|
In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: leveldb, dhruba, yhchiang, igor
Differential Revision: https://reviews.facebook.net/D18513
2014-06-03 01:38:00 +02:00
|
|
|
#include "util/arena.h"
|
2013-07-23 23:42:27 +02:00
|
|
|
|
2013-10-04 06:49:15 +02:00
|
|
|
namespace rocksdb {
|
2013-08-23 08:10:02 +02:00
|
|
|
namespace {
|
2013-07-23 23:42:27 +02:00
|
|
|
class SkipListRep : public MemTableRep {
|
InlineSkipList part 3/3 - new skiplist type that colocates key and node
Summary:
This diff completes the creation of InlineSkipList<Cmp>, which is like
SkipList<const char*, Cmp> but it always allocates the key contiguously
with the node. This allows us to remove the pointer from the node
to the key. As a result the memory usage of the skip list is reduced
(by 1 to sizeof(void*) bytes depending on the padding required to align
the key storage), cache locality is improved, and we halve the number
of calls to the allocator.
For skip lists whose keys are freshly-allocated const char*,
InlineSkipList is stricly preferrable to SkipList. This diff doesn't
replace SkipList, however, because some of the use cases of SkipList in
RocksDB are either character sequences that are not allocated at the
same time as the skip list node allocation (for example
hash_linklist_rep) or have different key types (for example
write_batch_with_index). Taking advantage of inline allocation for
those cases is left to future work.
The perf win is biggest for small values. For single-threaded CPU-bound
(32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte
values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec. For
large values the improvement is less pronounced, but seems to be between
5% and 10% on the same configuration.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51123
2015-11-19 23:24:29 +01:00
|
|
|
InlineSkipList<const MemTableRep::KeyComparator&> skip_list_;
|
SkipListRep::LookaheadIterator
Summary:
This diff introduces the `lookahead` argument to `SkipListFactory()`. This is an
optimization for the tailing use case which includes many seeks. E.g. consider
the following operations on a skip list iterator:
Seek(x), Next(), Next(), Seek(x+2), Next(), Seek(x+3), Next(), Next(), ...
If `lookahead` is positive, `SkipListRep` will return an iterator which also
keeps track of the previously visited node. Seek() then first does a linear
search starting from that node (up to `lookahead` steps). As in the tailing
example above, this may require fewer than ~log(n) comparisons as with regular
skip list search.
Test Plan:
Added a new benchmark (`fillseekseq`) which simulates the usage pattern. It
first writes N records (with consecutive keys), then measures how much time it
takes to read them by calling `Seek()` and `Next()`.
$ time ./db_bench -num 10000000 -benchmarks fillseekseq -prefix_size 1 \
-key_size 8 -write_buffer_size $[1024*1024*1024] -value_size 50 \
-seekseq_next 2 -skip_list_lookahead=0
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.389 micros/op 2569047 ops/sec;
real 0m21.806s
user 0m12.106s
sys 0m9.672s
$ time ./db_bench [...] -skip_list_lookahead=2
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.153 micros/op 6540684 ops/sec;
real 0m19.469s
user 0m10.192s
sys 0m9.252s
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb, march, lovro
Differential Revision: https://reviews.facebook.net/D23997
2014-09-24 00:52:28 +02:00
|
|
|
const MemTableRep::KeyComparator& cmp_;
|
|
|
|
const SliceTransform* transform_;
|
|
|
|
const size_t lookahead_;
|
|
|
|
|
|
|
|
friend class LookaheadIterator;
|
2013-07-23 23:42:27 +02:00
|
|
|
public:
|
2014-12-02 21:09:20 +01:00
|
|
|
explicit SkipListRep(const MemTableRep::KeyComparator& compare,
|
|
|
|
MemTableAllocator* allocator,
|
SkipListRep::LookaheadIterator
Summary:
This diff introduces the `lookahead` argument to `SkipListFactory()`. This is an
optimization for the tailing use case which includes many seeks. E.g. consider
the following operations on a skip list iterator:
Seek(x), Next(), Next(), Seek(x+2), Next(), Seek(x+3), Next(), Next(), ...
If `lookahead` is positive, `SkipListRep` will return an iterator which also
keeps track of the previously visited node. Seek() then first does a linear
search starting from that node (up to `lookahead` steps). As in the tailing
example above, this may require fewer than ~log(n) comparisons as with regular
skip list search.
Test Plan:
Added a new benchmark (`fillseekseq`) which simulates the usage pattern. It
first writes N records (with consecutive keys), then measures how much time it
takes to read them by calling `Seek()` and `Next()`.
$ time ./db_bench -num 10000000 -benchmarks fillseekseq -prefix_size 1 \
-key_size 8 -write_buffer_size $[1024*1024*1024] -value_size 50 \
-seekseq_next 2 -skip_list_lookahead=0
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.389 micros/op 2569047 ops/sec;
real 0m21.806s
user 0m12.106s
sys 0m9.672s
$ time ./db_bench [...] -skip_list_lookahead=2
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.153 micros/op 6540684 ops/sec;
real 0m19.469s
user 0m10.192s
sys 0m9.252s
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb, march, lovro
Differential Revision: https://reviews.facebook.net/D23997
2014-09-24 00:52:28 +02:00
|
|
|
const SliceTransform* transform, const size_t lookahead)
|
2014-12-02 21:09:20 +01:00
|
|
|
: MemTableRep(allocator), skip_list_(compare, allocator), cmp_(compare),
|
SkipListRep::LookaheadIterator
Summary:
This diff introduces the `lookahead` argument to `SkipListFactory()`. This is an
optimization for the tailing use case which includes many seeks. E.g. consider
the following operations on a skip list iterator:
Seek(x), Next(), Next(), Seek(x+2), Next(), Seek(x+3), Next(), Next(), ...
If `lookahead` is positive, `SkipListRep` will return an iterator which also
keeps track of the previously visited node. Seek() then first does a linear
search starting from that node (up to `lookahead` steps). As in the tailing
example above, this may require fewer than ~log(n) comparisons as with regular
skip list search.
Test Plan:
Added a new benchmark (`fillseekseq`) which simulates the usage pattern. It
first writes N records (with consecutive keys), then measures how much time it
takes to read them by calling `Seek()` and `Next()`.
$ time ./db_bench -num 10000000 -benchmarks fillseekseq -prefix_size 1 \
-key_size 8 -write_buffer_size $[1024*1024*1024] -value_size 50 \
-seekseq_next 2 -skip_list_lookahead=0
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.389 micros/op 2569047 ops/sec;
real 0m21.806s
user 0m12.106s
sys 0m9.672s
$ time ./db_bench [...] -skip_list_lookahead=2
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.153 micros/op 6540684 ops/sec;
real 0m19.469s
user 0m10.192s
sys 0m9.252s
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb, march, lovro
Differential Revision: https://reviews.facebook.net/D23997
2014-09-24 00:52:28 +02:00
|
|
|
transform_(transform), lookahead_(lookahead) {
|
2014-03-10 20:56:46 +01:00
|
|
|
}
|
2013-07-23 23:42:27 +02:00
|
|
|
|
InlineSkipList part 3/3 - new skiplist type that colocates key and node
Summary:
This diff completes the creation of InlineSkipList<Cmp>, which is like
SkipList<const char*, Cmp> but it always allocates the key contiguously
with the node. This allows us to remove the pointer from the node
to the key. As a result the memory usage of the skip list is reduced
(by 1 to sizeof(void*) bytes depending on the padding required to align
the key storage), cache locality is improved, and we halve the number
of calls to the allocator.
For skip lists whose keys are freshly-allocated const char*,
InlineSkipList is stricly preferrable to SkipList. This diff doesn't
replace SkipList, however, because some of the use cases of SkipList in
RocksDB are either character sequences that are not allocated at the
same time as the skip list node allocation (for example
hash_linklist_rep) or have different key types (for example
write_batch_with_index). Taking advantage of inline allocation for
those cases is left to future work.
The perf win is biggest for small values. For single-threaded CPU-bound
(32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte
values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec. For
large values the improvement is less pronounced, but seems to be between
5% and 10% on the same configuration.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51123
2015-11-19 23:24:29 +01:00
|
|
|
virtual KeyHandle Allocate(const size_t len, char** buf) override {
|
|
|
|
*buf = skip_list_.AllocateKey(len);
|
|
|
|
return static_cast<KeyHandle>(*buf);
|
|
|
|
}
|
|
|
|
|
2013-07-23 23:42:27 +02:00
|
|
|
// Insert key into the list.
|
|
|
|
// REQUIRES: nothing that compares equal to key is currently in the list.
|
2014-04-05 00:37:28 +02:00
|
|
|
virtual void Insert(KeyHandle handle) override {
|
|
|
|
skip_list_.Insert(static_cast<char*>(handle));
|
2013-07-23 23:42:27 +02:00
|
|
|
}
|
|
|
|
|
2016-11-14 03:58:17 +01:00
|
|
|
virtual void InsertWithHint(KeyHandle handle, void** hint) override {
|
2016-11-22 23:06:54 +01:00
|
|
|
skip_list_.InsertWithHint(static_cast<char*>(handle), hint);
|
2016-11-14 03:58:17 +01:00
|
|
|
}
|
|
|
|
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-15 01:59:07 +02:00
|
|
|
virtual void InsertConcurrently(KeyHandle handle) override {
|
|
|
|
skip_list_.InsertConcurrently(static_cast<char*>(handle));
|
|
|
|
}
|
|
|
|
|
2013-07-23 23:42:27 +02:00
|
|
|
// Returns true iff an entry that compares equal to key is in the list.
|
2013-08-23 08:10:02 +02:00
|
|
|
virtual bool Contains(const char* key) const override {
|
2013-07-23 23:42:27 +02:00
|
|
|
return skip_list_.Contains(key);
|
|
|
|
}
|
|
|
|
|
2013-08-23 08:10:02 +02:00
|
|
|
virtual size_t ApproximateMemoryUsage() override {
|
2014-12-02 21:09:20 +01:00
|
|
|
// All memory is allocated through allocator; nothing to report here
|
2013-08-23 08:10:02 +02:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-02-11 18:46:30 +01:00
|
|
|
virtual void Get(const LookupKey& k, void* callback_args,
|
|
|
|
bool (*callback_func)(void* arg,
|
|
|
|
const char* entry)) override {
|
|
|
|
SkipListRep::Iterator iter(&skip_list_);
|
|
|
|
Slice dummy_slice;
|
|
|
|
for (iter.Seek(dummy_slice, k.memtable_key().data());
|
|
|
|
iter.Valid() && callback_func(callback_args, iter.key());
|
|
|
|
iter.Next()) {
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-06-13 03:04:30 +02:00
|
|
|
uint64_t ApproximateNumEntries(const Slice& start_ikey,
|
|
|
|
const Slice& end_ikey) override {
|
|
|
|
std::string tmp;
|
|
|
|
uint64_t start_count =
|
|
|
|
skip_list_.EstimateCount(EncodeKey(&tmp, start_ikey));
|
|
|
|
uint64_t end_count = skip_list_.EstimateCount(EncodeKey(&tmp, end_ikey));
|
|
|
|
return (end_count >= start_count) ? (end_count - start_count) : 0;
|
|
|
|
}
|
|
|
|
|
2013-08-23 08:10:02 +02:00
|
|
|
virtual ~SkipListRep() override { }
|
2013-07-23 23:42:27 +02:00
|
|
|
|
|
|
|
// Iteration over the contents of a skip list
|
|
|
|
class Iterator : public MemTableRep::Iterator {
|
InlineSkipList part 3/3 - new skiplist type that colocates key and node
Summary:
This diff completes the creation of InlineSkipList<Cmp>, which is like
SkipList<const char*, Cmp> but it always allocates the key contiguously
with the node. This allows us to remove the pointer from the node
to the key. As a result the memory usage of the skip list is reduced
(by 1 to sizeof(void*) bytes depending on the padding required to align
the key storage), cache locality is improved, and we halve the number
of calls to the allocator.
For skip lists whose keys are freshly-allocated const char*,
InlineSkipList is stricly preferrable to SkipList. This diff doesn't
replace SkipList, however, because some of the use cases of SkipList in
RocksDB are either character sequences that are not allocated at the
same time as the skip list node allocation (for example
hash_linklist_rep) or have different key types (for example
write_batch_with_index). Taking advantage of inline allocation for
those cases is left to future work.
The perf win is biggest for small values. For single-threaded CPU-bound
(32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte
values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec. For
large values the improvement is less pronounced, but seems to be between
5% and 10% on the same configuration.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51123
2015-11-19 23:24:29 +01:00
|
|
|
InlineSkipList<const MemTableRep::KeyComparator&>::Iterator iter_;
|
|
|
|
|
2013-07-23 23:42:27 +02:00
|
|
|
public:
|
|
|
|
// Initialize an iterator over the specified list.
|
|
|
|
// The returned iterator is not valid.
|
|
|
|
explicit Iterator(
|
InlineSkipList part 3/3 - new skiplist type that colocates key and node
Summary:
This diff completes the creation of InlineSkipList<Cmp>, which is like
SkipList<const char*, Cmp> but it always allocates the key contiguously
with the node. This allows us to remove the pointer from the node
to the key. As a result the memory usage of the skip list is reduced
(by 1 to sizeof(void*) bytes depending on the padding required to align
the key storage), cache locality is improved, and we halve the number
of calls to the allocator.
For skip lists whose keys are freshly-allocated const char*,
InlineSkipList is stricly preferrable to SkipList. This diff doesn't
replace SkipList, however, because some of the use cases of SkipList in
RocksDB are either character sequences that are not allocated at the
same time as the skip list node allocation (for example
hash_linklist_rep) or have different key types (for example
write_batch_with_index). Taking advantage of inline allocation for
those cases is left to future work.
The perf win is biggest for small values. For single-threaded CPU-bound
(32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte
values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec. For
large values the improvement is less pronounced, but seems to be between
5% and 10% on the same configuration.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51123
2015-11-19 23:24:29 +01:00
|
|
|
const InlineSkipList<const MemTableRep::KeyComparator&>* list)
|
|
|
|
: iter_(list) {}
|
2013-07-23 23:42:27 +02:00
|
|
|
|
2013-08-23 08:10:02 +02:00
|
|
|
virtual ~Iterator() override { }
|
2013-07-23 23:42:27 +02:00
|
|
|
|
|
|
|
// Returns true iff the iterator is positioned at a valid node.
|
2013-08-23 08:10:02 +02:00
|
|
|
virtual bool Valid() const override {
|
2013-07-23 23:42:27 +02:00
|
|
|
return iter_.Valid();
|
|
|
|
}
|
|
|
|
|
|
|
|
// Returns the key at the current position.
|
|
|
|
// REQUIRES: Valid()
|
2013-08-23 08:10:02 +02:00
|
|
|
virtual const char* key() const override {
|
2013-07-23 23:42:27 +02:00
|
|
|
return iter_.key();
|
|
|
|
}
|
|
|
|
|
|
|
|
// Advances to the next position.
|
|
|
|
// REQUIRES: Valid()
|
2013-08-23 08:10:02 +02:00
|
|
|
virtual void Next() override {
|
2013-07-23 23:42:27 +02:00
|
|
|
iter_.Next();
|
|
|
|
}
|
|
|
|
|
|
|
|
// Advances to the previous position.
|
|
|
|
// REQUIRES: Valid()
|
2013-08-23 08:10:02 +02:00
|
|
|
virtual void Prev() override {
|
2013-07-23 23:42:27 +02:00
|
|
|
iter_.Prev();
|
|
|
|
}
|
|
|
|
|
|
|
|
// Advance to the first entry with a key >= target
|
2013-11-21 04:49:27 +01:00
|
|
|
virtual void Seek(const Slice& user_key, const char* memtable_key)
|
|
|
|
override {
|
|
|
|
if (memtable_key != nullptr) {
|
|
|
|
iter_.Seek(memtable_key);
|
|
|
|
} else {
|
|
|
|
iter_.Seek(EncodeKey(&tmp_, user_key));
|
|
|
|
}
|
2013-07-23 23:42:27 +02:00
|
|
|
}
|
|
|
|
|
2016-09-28 03:20:57 +02:00
|
|
|
// Retreat to the last entry with a key <= target
|
|
|
|
virtual void SeekForPrev(const Slice& user_key,
|
|
|
|
const char* memtable_key) override {
|
|
|
|
if (memtable_key != nullptr) {
|
|
|
|
iter_.SeekForPrev(memtable_key);
|
|
|
|
} else {
|
|
|
|
iter_.SeekForPrev(EncodeKey(&tmp_, user_key));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-07-23 23:42:27 +02:00
|
|
|
// Position at the first entry in list.
|
|
|
|
// Final state of iterator is Valid() iff list is not empty.
|
2013-08-23 08:10:02 +02:00
|
|
|
virtual void SeekToFirst() override {
|
2013-07-23 23:42:27 +02:00
|
|
|
iter_.SeekToFirst();
|
|
|
|
}
|
|
|
|
|
|
|
|
// Position at the last entry in list.
|
|
|
|
// Final state of iterator is Valid() iff list is not empty.
|
2013-08-23 08:10:02 +02:00
|
|
|
virtual void SeekToLast() override {
|
2013-07-23 23:42:27 +02:00
|
|
|
iter_.SeekToLast();
|
|
|
|
}
|
2013-11-21 04:49:27 +01:00
|
|
|
protected:
|
|
|
|
std::string tmp_; // For passing to EncodeKey
|
2013-07-23 23:42:27 +02:00
|
|
|
};
|
|
|
|
|
SkipListRep::LookaheadIterator
Summary:
This diff introduces the `lookahead` argument to `SkipListFactory()`. This is an
optimization for the tailing use case which includes many seeks. E.g. consider
the following operations on a skip list iterator:
Seek(x), Next(), Next(), Seek(x+2), Next(), Seek(x+3), Next(), Next(), ...
If `lookahead` is positive, `SkipListRep` will return an iterator which also
keeps track of the previously visited node. Seek() then first does a linear
search starting from that node (up to `lookahead` steps). As in the tailing
example above, this may require fewer than ~log(n) comparisons as with regular
skip list search.
Test Plan:
Added a new benchmark (`fillseekseq`) which simulates the usage pattern. It
first writes N records (with consecutive keys), then measures how much time it
takes to read them by calling `Seek()` and `Next()`.
$ time ./db_bench -num 10000000 -benchmarks fillseekseq -prefix_size 1 \
-key_size 8 -write_buffer_size $[1024*1024*1024] -value_size 50 \
-seekseq_next 2 -skip_list_lookahead=0
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.389 micros/op 2569047 ops/sec;
real 0m21.806s
user 0m12.106s
sys 0m9.672s
$ time ./db_bench [...] -skip_list_lookahead=2
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.153 micros/op 6540684 ops/sec;
real 0m19.469s
user 0m10.192s
sys 0m9.252s
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb, march, lovro
Differential Revision: https://reviews.facebook.net/D23997
2014-09-24 00:52:28 +02:00
|
|
|
// Iterator over the contents of a skip list which also keeps track of the
|
|
|
|
// previously visited node. In Seek(), it examines a few nodes after it
|
|
|
|
// first, falling back to O(log n) search from the head of the list only if
|
|
|
|
// the target key hasn't been found.
|
|
|
|
class LookaheadIterator : public MemTableRep::Iterator {
|
|
|
|
public:
|
|
|
|
explicit LookaheadIterator(const SkipListRep& rep) :
|
|
|
|
rep_(rep), iter_(&rep_.skip_list_), prev_(iter_) {}
|
|
|
|
|
|
|
|
virtual ~LookaheadIterator() override {}
|
|
|
|
|
|
|
|
virtual bool Valid() const override {
|
|
|
|
return iter_.Valid();
|
|
|
|
}
|
|
|
|
|
|
|
|
virtual const char *key() const override {
|
|
|
|
assert(Valid());
|
|
|
|
return iter_.key();
|
|
|
|
}
|
|
|
|
|
|
|
|
virtual void Next() override {
|
|
|
|
assert(Valid());
|
|
|
|
|
|
|
|
bool advance_prev = true;
|
|
|
|
if (prev_.Valid()) {
|
|
|
|
auto k1 = rep_.UserKey(prev_.key());
|
|
|
|
auto k2 = rep_.UserKey(iter_.key());
|
|
|
|
|
|
|
|
if (k1.compare(k2) == 0) {
|
|
|
|
// same user key, don't move prev_
|
|
|
|
advance_prev = false;
|
|
|
|
} else if (rep_.transform_) {
|
|
|
|
// only advance prev_ if it has the same prefix as iter_
|
|
|
|
auto t1 = rep_.transform_->Transform(k1);
|
|
|
|
auto t2 = rep_.transform_->Transform(k2);
|
|
|
|
advance_prev = t1.compare(t2) == 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (advance_prev) {
|
|
|
|
prev_ = iter_;
|
|
|
|
}
|
|
|
|
iter_.Next();
|
|
|
|
}
|
|
|
|
|
|
|
|
virtual void Prev() override {
|
|
|
|
assert(Valid());
|
|
|
|
iter_.Prev();
|
|
|
|
prev_ = iter_;
|
|
|
|
}
|
|
|
|
|
|
|
|
virtual void Seek(const Slice& internal_key, const char *memtable_key)
|
|
|
|
override {
|
|
|
|
const char *encoded_key =
|
|
|
|
(memtable_key != nullptr) ?
|
|
|
|
memtable_key : EncodeKey(&tmp_, internal_key);
|
|
|
|
|
|
|
|
if (prev_.Valid() && rep_.cmp_(encoded_key, prev_.key()) >= 0) {
|
|
|
|
// prev_.key() is smaller or equal to our target key; do a quick
|
|
|
|
// linear search (at most lookahead_ steps) starting from prev_
|
|
|
|
iter_ = prev_;
|
|
|
|
|
|
|
|
size_t cur = 0;
|
|
|
|
while (cur++ <= rep_.lookahead_ && iter_.Valid()) {
|
|
|
|
if (rep_.cmp_(encoded_key, iter_.key()) <= 0) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
Next();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
iter_.Seek(encoded_key);
|
|
|
|
prev_ = iter_;
|
|
|
|
}
|
|
|
|
|
2016-09-28 03:20:57 +02:00
|
|
|
virtual void SeekForPrev(const Slice& internal_key,
|
|
|
|
const char* memtable_key) override {
|
|
|
|
const char* encoded_key = (memtable_key != nullptr)
|
|
|
|
? memtable_key
|
|
|
|
: EncodeKey(&tmp_, internal_key);
|
|
|
|
iter_.SeekForPrev(encoded_key);
|
|
|
|
prev_ = iter_;
|
|
|
|
}
|
|
|
|
|
SkipListRep::LookaheadIterator
Summary:
This diff introduces the `lookahead` argument to `SkipListFactory()`. This is an
optimization for the tailing use case which includes many seeks. E.g. consider
the following operations on a skip list iterator:
Seek(x), Next(), Next(), Seek(x+2), Next(), Seek(x+3), Next(), Next(), ...
If `lookahead` is positive, `SkipListRep` will return an iterator which also
keeps track of the previously visited node. Seek() then first does a linear
search starting from that node (up to `lookahead` steps). As in the tailing
example above, this may require fewer than ~log(n) comparisons as with regular
skip list search.
Test Plan:
Added a new benchmark (`fillseekseq`) which simulates the usage pattern. It
first writes N records (with consecutive keys), then measures how much time it
takes to read them by calling `Seek()` and `Next()`.
$ time ./db_bench -num 10000000 -benchmarks fillseekseq -prefix_size 1 \
-key_size 8 -write_buffer_size $[1024*1024*1024] -value_size 50 \
-seekseq_next 2 -skip_list_lookahead=0
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.389 micros/op 2569047 ops/sec;
real 0m21.806s
user 0m12.106s
sys 0m9.672s
$ time ./db_bench [...] -skip_list_lookahead=2
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.153 micros/op 6540684 ops/sec;
real 0m19.469s
user 0m10.192s
sys 0m9.252s
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb, march, lovro
Differential Revision: https://reviews.facebook.net/D23997
2014-09-24 00:52:28 +02:00
|
|
|
virtual void SeekToFirst() override {
|
|
|
|
iter_.SeekToFirst();
|
|
|
|
prev_ = iter_;
|
|
|
|
}
|
|
|
|
|
|
|
|
virtual void SeekToLast() override {
|
|
|
|
iter_.SeekToLast();
|
|
|
|
prev_ = iter_;
|
|
|
|
}
|
|
|
|
|
|
|
|
protected:
|
|
|
|
std::string tmp_; // For passing to EncodeKey
|
|
|
|
|
|
|
|
private:
|
|
|
|
const SkipListRep& rep_;
|
InlineSkipList part 3/3 - new skiplist type that colocates key and node
Summary:
This diff completes the creation of InlineSkipList<Cmp>, which is like
SkipList<const char*, Cmp> but it always allocates the key contiguously
with the node. This allows us to remove the pointer from the node
to the key. As a result the memory usage of the skip list is reduced
(by 1 to sizeof(void*) bytes depending on the padding required to align
the key storage), cache locality is improved, and we halve the number
of calls to the allocator.
For skip lists whose keys are freshly-allocated const char*,
InlineSkipList is stricly preferrable to SkipList. This diff doesn't
replace SkipList, however, because some of the use cases of SkipList in
RocksDB are either character sequences that are not allocated at the
same time as the skip list node allocation (for example
hash_linklist_rep) or have different key types (for example
write_batch_with_index). Taking advantage of inline allocation for
those cases is left to future work.
The perf win is biggest for small values. For single-threaded CPU-bound
(32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte
values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec. For
large values the improvement is less pronounced, but seems to be between
5% and 10% on the same configuration.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51123
2015-11-19 23:24:29 +01:00
|
|
|
InlineSkipList<const MemTableRep::KeyComparator&>::Iterator iter_;
|
|
|
|
InlineSkipList<const MemTableRep::KeyComparator&>::Iterator prev_;
|
SkipListRep::LookaheadIterator
Summary:
This diff introduces the `lookahead` argument to `SkipListFactory()`. This is an
optimization for the tailing use case which includes many seeks. E.g. consider
the following operations on a skip list iterator:
Seek(x), Next(), Next(), Seek(x+2), Next(), Seek(x+3), Next(), Next(), ...
If `lookahead` is positive, `SkipListRep` will return an iterator which also
keeps track of the previously visited node. Seek() then first does a linear
search starting from that node (up to `lookahead` steps). As in the tailing
example above, this may require fewer than ~log(n) comparisons as with regular
skip list search.
Test Plan:
Added a new benchmark (`fillseekseq`) which simulates the usage pattern. It
first writes N records (with consecutive keys), then measures how much time it
takes to read them by calling `Seek()` and `Next()`.
$ time ./db_bench -num 10000000 -benchmarks fillseekseq -prefix_size 1 \
-key_size 8 -write_buffer_size $[1024*1024*1024] -value_size 50 \
-seekseq_next 2 -skip_list_lookahead=0
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.389 micros/op 2569047 ops/sec;
real 0m21.806s
user 0m12.106s
sys 0m9.672s
$ time ./db_bench [...] -skip_list_lookahead=2
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.153 micros/op 6540684 ops/sec;
real 0m19.469s
user 0m10.192s
sys 0m9.252s
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb, march, lovro
Differential Revision: https://reviews.facebook.net/D23997
2014-09-24 00:52:28 +02:00
|
|
|
};
|
|
|
|
|
In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: leveldb, dhruba, yhchiang, igor
Differential Revision: https://reviews.facebook.net/D18513
2014-06-03 01:38:00 +02:00
|
|
|
virtual MemTableRep::Iterator* GetIterator(Arena* arena = nullptr) override {
|
SkipListRep::LookaheadIterator
Summary:
This diff introduces the `lookahead` argument to `SkipListFactory()`. This is an
optimization for the tailing use case which includes many seeks. E.g. consider
the following operations on a skip list iterator:
Seek(x), Next(), Next(), Seek(x+2), Next(), Seek(x+3), Next(), Next(), ...
If `lookahead` is positive, `SkipListRep` will return an iterator which also
keeps track of the previously visited node. Seek() then first does a linear
search starting from that node (up to `lookahead` steps). As in the tailing
example above, this may require fewer than ~log(n) comparisons as with regular
skip list search.
Test Plan:
Added a new benchmark (`fillseekseq`) which simulates the usage pattern. It
first writes N records (with consecutive keys), then measures how much time it
takes to read them by calling `Seek()` and `Next()`.
$ time ./db_bench -num 10000000 -benchmarks fillseekseq -prefix_size 1 \
-key_size 8 -write_buffer_size $[1024*1024*1024] -value_size 50 \
-seekseq_next 2 -skip_list_lookahead=0
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.389 micros/op 2569047 ops/sec;
real 0m21.806s
user 0m12.106s
sys 0m9.672s
$ time ./db_bench [...] -skip_list_lookahead=2
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.153 micros/op 6540684 ops/sec;
real 0m19.469s
user 0m10.192s
sys 0m9.252s
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb, march, lovro
Differential Revision: https://reviews.facebook.net/D23997
2014-09-24 00:52:28 +02:00
|
|
|
if (lookahead_ > 0) {
|
|
|
|
void *mem =
|
|
|
|
arena ? arena->AllocateAligned(sizeof(SkipListRep::LookaheadIterator))
|
|
|
|
: operator new(sizeof(SkipListRep::LookaheadIterator));
|
|
|
|
return new (mem) SkipListRep::LookaheadIterator(*this);
|
In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: leveldb, dhruba, yhchiang, igor
Differential Revision: https://reviews.facebook.net/D18513
2014-06-03 01:38:00 +02:00
|
|
|
} else {
|
SkipListRep::LookaheadIterator
Summary:
This diff introduces the `lookahead` argument to `SkipListFactory()`. This is an
optimization for the tailing use case which includes many seeks. E.g. consider
the following operations on a skip list iterator:
Seek(x), Next(), Next(), Seek(x+2), Next(), Seek(x+3), Next(), Next(), ...
If `lookahead` is positive, `SkipListRep` will return an iterator which also
keeps track of the previously visited node. Seek() then first does a linear
search starting from that node (up to `lookahead` steps). As in the tailing
example above, this may require fewer than ~log(n) comparisons as with regular
skip list search.
Test Plan:
Added a new benchmark (`fillseekseq`) which simulates the usage pattern. It
first writes N records (with consecutive keys), then measures how much time it
takes to read them by calling `Seek()` and `Next()`.
$ time ./db_bench -num 10000000 -benchmarks fillseekseq -prefix_size 1 \
-key_size 8 -write_buffer_size $[1024*1024*1024] -value_size 50 \
-seekseq_next 2 -skip_list_lookahead=0
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.389 micros/op 2569047 ops/sec;
real 0m21.806s
user 0m12.106s
sys 0m9.672s
$ time ./db_bench [...] -skip_list_lookahead=2
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.153 micros/op 6540684 ops/sec;
real 0m19.469s
user 0m10.192s
sys 0m9.252s
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb, march, lovro
Differential Revision: https://reviews.facebook.net/D23997
2014-09-24 00:52:28 +02:00
|
|
|
void *mem =
|
|
|
|
arena ? arena->AllocateAligned(sizeof(SkipListRep::Iterator))
|
|
|
|
: operator new(sizeof(SkipListRep::Iterator));
|
In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: leveldb, dhruba, yhchiang, igor
Differential Revision: https://reviews.facebook.net/D18513
2014-06-03 01:38:00 +02:00
|
|
|
return new (mem) SkipListRep::Iterator(&skip_list_);
|
|
|
|
}
|
2013-07-23 23:42:27 +02:00
|
|
|
}
|
|
|
|
};
|
2013-08-23 08:10:02 +02:00
|
|
|
}
|
2013-07-23 23:42:27 +02:00
|
|
|
|
2014-01-16 03:17:58 +01:00
|
|
|
MemTableRep* SkipListFactory::CreateMemTableRep(
|
2014-12-02 21:09:20 +01:00
|
|
|
const MemTableRep::KeyComparator& compare, MemTableAllocator* allocator,
|
SkipListRep::LookaheadIterator
Summary:
This diff introduces the `lookahead` argument to `SkipListFactory()`. This is an
optimization for the tailing use case which includes many seeks. E.g. consider
the following operations on a skip list iterator:
Seek(x), Next(), Next(), Seek(x+2), Next(), Seek(x+3), Next(), Next(), ...
If `lookahead` is positive, `SkipListRep` will return an iterator which also
keeps track of the previously visited node. Seek() then first does a linear
search starting from that node (up to `lookahead` steps). As in the tailing
example above, this may require fewer than ~log(n) comparisons as with regular
skip list search.
Test Plan:
Added a new benchmark (`fillseekseq`) which simulates the usage pattern. It
first writes N records (with consecutive keys), then measures how much time it
takes to read them by calling `Seek()` and `Next()`.
$ time ./db_bench -num 10000000 -benchmarks fillseekseq -prefix_size 1 \
-key_size 8 -write_buffer_size $[1024*1024*1024] -value_size 50 \
-seekseq_next 2 -skip_list_lookahead=0
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.389 micros/op 2569047 ops/sec;
real 0m21.806s
user 0m12.106s
sys 0m9.672s
$ time ./db_bench [...] -skip_list_lookahead=2
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.153 micros/op 6540684 ops/sec;
real 0m19.469s
user 0m10.192s
sys 0m9.252s
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb, march, lovro
Differential Revision: https://reviews.facebook.net/D23997
2014-09-24 00:52:28 +02:00
|
|
|
const SliceTransform* transform, Logger* logger) {
|
2014-12-02 21:09:20 +01:00
|
|
|
return new SkipListRep(compare, allocator, transform, lookahead_);
|
2013-07-23 23:42:27 +02:00
|
|
|
}
|
|
|
|
|
2013-10-04 06:49:15 +02:00
|
|
|
} // namespace rocksdb
|