rocksdb/table/multiget_context.h
Peter Dillinger 9d0cae7104 Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899)
Summary:
When MultiGet() determines that multiple query keys can be
served by examining the same data block in block cache (one Lookup()),
each PinnableSlice referring to data in that data block needs to hold
on to the block in cache so that they can be released at arbitrary
times by the API user. Historically this is accomplished with extra
calls to Ref() on the Handle from Lookup(), with each PinnableSlice
cleanup calling Release() on the Handle, but this creates extra
contention on the block cache for the extra Ref()s and Release()es,
especially because they hit the same cache shard repeatedly.

In the case of merge operands (possibly more cases?), the problem was
compounded by doing an extra Ref()+eventual Release() for each merge
operand for a key reusing a block (which could be the same key!), rather
than one Ref() per key. (Note: the non-shared case with `biter` was
already one per key.)

This change optimizes MultiGet not to rely on these extra, contentious
Ref()+Release() calls by instead, in the shared block case, wrapping
the cache Release() cleanup in a refcounted object referenced by the
PinnableSlices, such that after the last wrapped reference is released,
the cache entry is Release()ed. Relaxed atomic refcounts should be
much faster than mutex-guarded Ref() and Release(), and much less prone
to a performance cliff when MultiGet() does a lot of block sharing.

Note that I did not use std::shared_ptr, because that would require an
extra indirection object (shared_ptr itself new/delete) in order to
associate a ref increment/decrement with a Cleanable cleanup entry. (If
I assumed it was the size of two pointers, I could do some hackery to
make it work without the extra indirection, but that's too fragile.)

Some details:
* Fixed (removed) extra block cache tracing entries in cases of cache
entry reuse in MultiGet, but it's likely that in some other cases traces
are missing (XXX comment inserted)
* Moved existing implementations for cleanable.h from iterator.cc to
new cleanable.cc
* Improved API comments on Cleanable
* Added a public SharedCleanablePtr class to cleanable.h in case others
could benefit from the same pattern (potentially many Cleanables and/or
smart pointers referencing a shared Cleanable)
* Add a typedef for MultiGetContext::Mask
* Some variable renaming for clarity

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9899

Test Plan:
Added unit tests for SharedCleanablePtr.

Greatly enhanced ability of existing tests to detect cache use-after-free.
* Release PinnableSlices from MultiGet as they are read rather than in
bulk (in db_test_util wrapper).
* In ASAN build, default to using a trivially small LRUCache for block_cache
so that entries are immediately erased when unreferenced. (Updated two
tests that depend on caching.) New ASAN testsuite running time seems
OK to me.

If I introduce a bug into my implementation where we skip the shared
cleanups on block reuse, ASAN detects the bug in
`db_basic_test *MultiGet*`. If I remove either of the above testing
enhancements, the bug is not detected.

Consider for follow-up work: manipulate or randomize ordering of
PinnableSlice use and release from MultiGet db_test_util wrapper. But in
typical cases, natural ordering gives pretty good functional coverage.

Performance test:
In the extreme (but possible) case of MultiGetting the same or adjacent keys
in a batch, throughput can improve by an order of magnitude.
`./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb -readonly -num=5 -duration=10 -threads=20 -multiread_batched -batch_size=200`
Before ops/sec, num=5: 1,384,394
Before ops/sec, num=500: 6,423,720
After ops/sec, num=500: 10,658,794
After ops/sec, num=5: 16,027,257

Also note that previously, with high parallelism, having query keys
concentrated in a single block was worse than spreading them out a bit. Now
concentrated in a single block is faster than spread out, which is hopefully
consistent with natural expectation.

Random query performance: with num=1000000, over 999 x 10s runs running before & after simultaneously (each -threads=12):
Before: multireadrandom [AVG    999 runs] : 1088699 (± 7344) ops/sec;  120.4 (± 0.8 ) MB/sec
After: multireadrandom [AVG    999 runs] : 1090402 (± 7230) ops/sec;  120.6 (± 0.8 ) MB/sec
Possibly better, possibly in the noise.

Reviewed By: anand1976

Differential Revision: D35907003

Pulled By: pdillinger

fbshipit-source-id: bbd244d703649a8ca12d476f2d03853ed9d1a17e
2022-04-26 21:59:24 -07:00

293 lines
9.5 KiB
C++

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
#pragma once
#include <algorithm>
#include <array>
#include <string>
#include "db/dbformat.h"
#include "db/lookup_key.h"
#include "db/merge_context.h"
#include "rocksdb/env.h"
#include "rocksdb/statistics.h"
#include "rocksdb/types.h"
#include "util/autovector.h"
#include "util/math.h"
namespace ROCKSDB_NAMESPACE {
class GetContext;
struct KeyContext {
const Slice* key;
LookupKey* lkey;
Slice ukey_with_ts;
Slice ukey_without_ts;
Slice ikey;
ColumnFamilyHandle* column_family;
Status* s;
MergeContext merge_context;
SequenceNumber max_covering_tombstone_seq;
bool key_exists;
bool is_blob_index;
void* cb_arg;
PinnableSlice* value;
std::string* timestamp;
GetContext* get_context;
KeyContext(ColumnFamilyHandle* col_family, const Slice& user_key,
PinnableSlice* val, std::string* ts, Status* stat)
: key(&user_key),
lkey(nullptr),
column_family(col_family),
s(stat),
max_covering_tombstone_seq(0),
key_exists(false),
is_blob_index(false),
cb_arg(nullptr),
value(val),
timestamp(ts),
get_context(nullptr) {}
KeyContext() = default;
};
// The MultiGetContext class is a container for the sorted list of keys that
// we need to lookup in a batch. Its main purpose is to make batch execution
// easier by allowing various stages of the MultiGet lookups to operate on
// subsets of keys, potentially non-contiguous. In order to accomplish this,
// it defines the following classes -
//
// MultiGetContext::Range
// MultiGetContext::Range::Iterator
// MultiGetContext::Range::IteratorWrapper
//
// Here is an example of how this can be used -
//
// {
// MultiGetContext ctx(...);
// MultiGetContext::Range range = ctx.GetMultiGetRange();
//
// // Iterate to determine some subset of the keys
// MultiGetContext::Range::Iterator start = range.begin();
// MultiGetContext::Range::Iterator end = ...;
//
// // Make a new range with a subset of keys
// MultiGetContext::Range subrange(range, start, end);
//
// // Define an auxillary vector, if needed, to hold additional data for
// // each key
// std::array<Foo, MultiGetContext::MAX_BATCH_SIZE> aux;
//
// // Iterate over the subrange and the auxillary vector simultaneously
// MultiGetContext::Range::Iterator iter = subrange.begin();
// for (; iter != subrange.end(); ++iter) {
// KeyContext& key = *iter;
// Foo& aux_key = aux_iter[iter.index()];
// ...
// }
// }
class MultiGetContext {
public:
// Limit the number of keys in a batch to this number. Benchmarks show that
// there is negligible benefit for batches exceeding this. Keeping this < 32
// simplifies iteration, as well as reduces the amount of stack allocations
// that need to be performed
static const int MAX_BATCH_SIZE = 32;
// A bitmask of at least MAX_BATCH_SIZE - 1 bits, so that
// Mask{1} << MAX_BATCH_SIZE is well defined
using Mask = uint64_t;
static_assert(MAX_BATCH_SIZE < sizeof(Mask) * 8);
MultiGetContext(autovector<KeyContext*, MAX_BATCH_SIZE>* sorted_keys,
size_t begin, size_t num_keys, SequenceNumber snapshot,
const ReadOptions& read_opts)
: num_keys_(num_keys),
value_mask_(0),
value_size_(0),
lookup_key_ptr_(reinterpret_cast<LookupKey*>(lookup_key_stack_buf)) {
assert(num_keys <= MAX_BATCH_SIZE);
if (num_keys > MAX_LOOKUP_KEYS_ON_STACK) {
lookup_key_heap_buf.reset(new char[sizeof(LookupKey) * num_keys]);
lookup_key_ptr_ = reinterpret_cast<LookupKey*>(
lookup_key_heap_buf.get());
}
for (size_t iter = 0; iter != num_keys_; ++iter) {
// autovector may not be contiguous storage, so make a copy
sorted_keys_[iter] = (*sorted_keys)[begin + iter];
sorted_keys_[iter]->lkey = new (&lookup_key_ptr_[iter])
LookupKey(*sorted_keys_[iter]->key, snapshot, read_opts.timestamp);
sorted_keys_[iter]->ukey_with_ts = sorted_keys_[iter]->lkey->user_key();
sorted_keys_[iter]->ukey_without_ts = StripTimestampFromUserKey(
sorted_keys_[iter]->lkey->user_key(),
read_opts.timestamp == nullptr ? 0 : read_opts.timestamp->size());
sorted_keys_[iter]->ikey = sorted_keys_[iter]->lkey->internal_key();
}
}
~MultiGetContext() {
for (size_t i = 0; i < num_keys_; ++i) {
lookup_key_ptr_[i].~LookupKey();
}
}
private:
static const int MAX_LOOKUP_KEYS_ON_STACK = 16;
alignas(alignof(LookupKey))
char lookup_key_stack_buf[sizeof(LookupKey) * MAX_LOOKUP_KEYS_ON_STACK];
std::array<KeyContext*, MAX_BATCH_SIZE> sorted_keys_;
size_t num_keys_;
Mask value_mask_;
uint64_t value_size_;
std::unique_ptr<char[]> lookup_key_heap_buf;
LookupKey* lookup_key_ptr_;
public:
// MultiGetContext::Range - Specifies a range of keys, by start and end index,
// from the parent MultiGetContext. Each range contains a bit vector that
// indicates whether the corresponding keys need to be processed or skipped.
// A Range object can be copy constructed, and the new object inherits the
// original Range's bit vector. This is useful for progressively skipping
// keys as the lookup goes through various stages. For example, when looking
// up keys in the same SST file, a Range is created excluding keys not
// belonging to that file. A new Range is then copy constructed and individual
// keys are skipped based on bloom filter lookup.
class Range {
public:
// MultiGetContext::Range::Iterator - A forward iterator that iterates over
// non-skippable keys in a Range, as well as keys whose final value has been
// found. The latter is tracked by MultiGetContext::value_mask_
class Iterator {
public:
// -- iterator traits
using self_type = Iterator;
using value_type = KeyContext;
using reference = KeyContext&;
using pointer = KeyContext*;
using difference_type = int;
using iterator_category = std::forward_iterator_tag;
Iterator(const Range* range, size_t idx)
: range_(range), ctx_(range->ctx_), index_(idx) {
while (index_ < range_->end_ &&
(Mask{1} << index_) &
(range_->ctx_->value_mask_ | range_->skip_mask_))
index_++;
}
Iterator(const Iterator&) = default;
Iterator& operator=(const Iterator&) = default;
Iterator& operator++() {
while (++index_ < range_->end_ &&
(Mask{1} << index_) &
(range_->ctx_->value_mask_ | range_->skip_mask_))
;
return *this;
}
bool operator==(Iterator other) const {
assert(range_->ctx_ == other.range_->ctx_);
return index_ == other.index_;
}
bool operator!=(Iterator other) const {
assert(range_->ctx_ == other.range_->ctx_);
return index_ != other.index_;
}
KeyContext& operator*() {
assert(index_ < range_->end_ && index_ >= range_->start_);
return *(ctx_->sorted_keys_[index_]);
}
KeyContext* operator->() {
assert(index_ < range_->end_ && index_ >= range_->start_);
return ctx_->sorted_keys_[index_];
}
size_t index() { return index_; }
private:
friend Range;
const Range* range_;
const MultiGetContext* ctx_;
size_t index_;
};
Range(const Range& mget_range,
const Iterator& first,
const Iterator& last) {
ctx_ = mget_range.ctx_;
start_ = first.index_;
end_ = last.index_;
skip_mask_ = mget_range.skip_mask_;
assert(start_ < 64);
assert(end_ < 64);
}
Range() = default;
Iterator begin() const { return Iterator(this, start_); }
Iterator end() const { return Iterator(this, end_); }
bool empty() const { return RemainingMask() == 0; }
void SkipIndex(size_t index) { skip_mask_ |= Mask{1} << index; }
void SkipKey(const Iterator& iter) { SkipIndex(iter.index_); }
bool IsKeySkipped(const Iterator& iter) const {
return skip_mask_ & (Mask{1} << iter.index_);
}
// Update the value_mask_ in MultiGetContext so its
// immediately reflected in all the Range Iterators
void MarkKeyDone(Iterator& iter) {
ctx_->value_mask_ |= (Mask{1} << iter.index_);
}
bool CheckKeyDone(Iterator& iter) const {
return ctx_->value_mask_ & (Mask{1} << iter.index_);
}
uint64_t KeysLeft() const { return BitsSetToOne(RemainingMask()); }
void AddSkipsFrom(const Range& other) {
assert(ctx_ == other.ctx_);
skip_mask_ |= other.skip_mask_;
}
uint64_t GetValueSize() { return ctx_->value_size_; }
void AddValueSize(uint64_t value_size) { ctx_->value_size_ += value_size; }
private:
friend MultiGetContext;
MultiGetContext* ctx_;
size_t start_;
size_t end_;
Mask skip_mask_;
Range(MultiGetContext* ctx, size_t num_keys)
: ctx_(ctx), start_(0), end_(num_keys), skip_mask_(0) {
assert(num_keys < 64);
}
Mask RemainingMask() const {
return (((Mask{1} << end_) - 1) & ~((Mask{1} << start_) - 1) &
~(ctx_->value_mask_ | skip_mask_));
}
};
// Return the initial range that encompasses all the keys in the batch
Range GetMultiGetRange() { return Range(this, num_keys_); }
};
} // namespace ROCKSDB_NAMESPACE