rocksdb/table/get_context.h
anand76 fefd4b98c5 Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.

Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency

The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.

Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).

Batch   Sizes

1        | 2        | 4         | 8      | 16  | 32

Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)

Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135

Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62

Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891

dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011

Differential Revision: D14348703

Pulled By: anand1976

fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:28:26 -07:00

141 lines
4.6 KiB
C++

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
#pragma once
#include <string>
#include <db/dbformat.h>
#include "db/merge_context.h"
#include "db/read_callback.h"
#include "rocksdb/env.h"
#include "rocksdb/statistics.h"
#include "rocksdb/types.h"
#include "table/block.h"
namespace rocksdb {
class MergeContext;
class PinnedIteratorsManager;
struct GetContextStats {
uint64_t num_cache_hit = 0;
uint64_t num_cache_index_hit = 0;
uint64_t num_cache_data_hit = 0;
uint64_t num_cache_filter_hit = 0;
uint64_t num_cache_compression_dict_hit = 0;
uint64_t num_cache_index_miss = 0;
uint64_t num_cache_filter_miss = 0;
uint64_t num_cache_data_miss = 0;
uint64_t num_cache_compression_dict_miss = 0;
uint64_t num_cache_bytes_read = 0;
uint64_t num_cache_miss = 0;
uint64_t num_cache_add = 0;
uint64_t num_cache_bytes_write = 0;
uint64_t num_cache_index_add = 0;
uint64_t num_cache_index_bytes_insert = 0;
uint64_t num_cache_data_add = 0;
uint64_t num_cache_data_bytes_insert = 0;
uint64_t num_cache_filter_add = 0;
uint64_t num_cache_filter_bytes_insert = 0;
uint64_t num_cache_compression_dict_add = 0;
uint64_t num_cache_compression_dict_bytes_insert = 0;
};
class GetContext {
public:
enum GetState {
kNotFound,
kFound,
kDeleted,
kCorrupt,
kMerge, // saver contains the current merge result (the operands)
kBlobIndex,
};
GetContextStats get_context_stats_;
GetContext(const Comparator* ucmp, const MergeOperator* merge_operator,
Logger* logger, Statistics* statistics, GetState init_state,
const Slice& user_key, PinnableSlice* value, bool* value_found,
MergeContext* merge_context,
SequenceNumber* max_covering_tombstone_seq, Env* env,
SequenceNumber* seq = nullptr,
PinnedIteratorsManager* _pinned_iters_mgr = nullptr,
ReadCallback* callback = nullptr, bool* is_blob_index = nullptr);
GetContext() = default;
void MarkKeyMayExist();
// Records this key, value, and any meta-data (such as sequence number and
// state) into this GetContext.
//
// If the parsed_key matches the user key that we are looking for, sets
// mathced to true.
//
// Returns True if more keys need to be read (due to merges) or
// False if the complete value has been found.
bool SaveValue(const ParsedInternalKey& parsed_key, const Slice& value,
bool* matched, Cleanable* value_pinner = nullptr);
// Simplified version of the previous function. Should only be used when we
// know that the operation is a Put.
void SaveValue(const Slice& value, SequenceNumber seq);
GetState State() const { return state_; }
SequenceNumber* max_covering_tombstone_seq() {
return max_covering_tombstone_seq_;
}
PinnedIteratorsManager* pinned_iters_mgr() { return pinned_iters_mgr_; }
// If a non-null string is passed, all the SaveValue calls will be
// logged into the string. The operations can then be replayed on
// another GetContext with replayGetContextLog.
void SetReplayLog(std::string* replay_log) { replay_log_ = replay_log; }
// Do we need to fetch the SequenceNumber for this key?
bool NeedToReadSequence() const { return (seq_ != nullptr); }
bool sample() const { return sample_; }
bool CheckCallback(SequenceNumber seq) {
if (callback_) {
return callback_->IsVisible(seq);
}
return true;
}
void ReportCounters();
private:
const Comparator* ucmp_;
const MergeOperator* merge_operator_;
// the merge operations encountered;
Logger* logger_;
Statistics* statistics_;
GetState state_;
Slice user_key_;
PinnableSlice* pinnable_val_;
bool* value_found_; // Is value set correctly? Used by KeyMayExist
MergeContext* merge_context_;
SequenceNumber* max_covering_tombstone_seq_;
Env* env_;
// If a key is found, seq_ will be set to the SequenceNumber of most recent
// write to the key or kMaxSequenceNumber if unknown
SequenceNumber* seq_;
std::string* replay_log_;
// Used to temporarily pin blocks when state_ == GetContext::kMerge
PinnedIteratorsManager* pinned_iters_mgr_;
ReadCallback* callback_;
bool sample_;
bool* is_blob_index_;
};
void replayGetContextLog(const Slice& replay_log, const Slice& user_key,
GetContext* get_context,
Cleanable* value_pinner = nullptr);
} // namespace rocksdb