68a8e6b8fa
Summary: This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice> This diff is stacked on top of D56493 and D56511 In this diff we - Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future - Replace std::deque<std::string> with std::vector<Slice> to pass operands - Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187) - Allow FullMergeV2 output to be an existing operand ``` [Everything in Memtable | 10K operands | 10 KB each | 1 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s [master] readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s ``` ``` [Everything in Memtable | 10K operands | 10 KB each | 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s [master] readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s ``` ``` [Everything in Block cache | 10K operands | 10 KB each | 1 operand per key] [FullMergeV2] $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s [master] readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s ``` ``` [Everything in Block cache | 10K operands | 10 KB each | 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions [FullMergeV2] readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s [master] readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, andrewkr, sdong Reviewed By: sdong Subscribers: lovro, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57075
230 lines
10 KiB
C++
230 lines
10 KiB
C++
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
// This source code is licensed under the BSD-style license found in the
|
|
// LICENSE file in the root directory of this source tree. An additional grant
|
|
// of patent rights can be found in the PATENTS file in the same directory.
|
|
|
|
#ifndef STORAGE_ROCKSDB_INCLUDE_MERGE_OPERATOR_H_
|
|
#define STORAGE_ROCKSDB_INCLUDE_MERGE_OPERATOR_H_
|
|
|
|
#include <deque>
|
|
#include <memory>
|
|
#include <string>
|
|
#include <vector>
|
|
|
|
#include "rocksdb/slice.h"
|
|
|
|
namespace rocksdb {
|
|
|
|
class Slice;
|
|
class Logger;
|
|
|
|
// The Merge Operator
|
|
//
|
|
// Essentially, a MergeOperator specifies the SEMANTICS of a merge, which only
|
|
// client knows. It could be numeric addition, list append, string
|
|
// concatenation, edit data structure, ... , anything.
|
|
// The library, on the other hand, is concerned with the exercise of this
|
|
// interface, at the right time (during get, iteration, compaction...)
|
|
//
|
|
// To use merge, the client needs to provide an object implementing one of
|
|
// the following interfaces:
|
|
// a) AssociativeMergeOperator - for most simple semantics (always take
|
|
// two values, and merge them into one value, which is then put back
|
|
// into rocksdb); numeric addition and string concatenation are examples;
|
|
//
|
|
// b) MergeOperator - the generic class for all the more abstract / complex
|
|
// operations; one method (FullMergeV2) to merge a Put/Delete value with a
|
|
// merge operand; and another method (PartialMerge) that merges multiple
|
|
// operands together. this is especially useful if your key values have
|
|
// complex structures but you would still like to support client-specific
|
|
// incremental updates.
|
|
//
|
|
// AssociativeMergeOperator is simpler to implement. MergeOperator is simply
|
|
// more powerful.
|
|
//
|
|
// Refer to rocksdb-merge wiki for more details and example implementations.
|
|
//
|
|
class MergeOperator {
|
|
public:
|
|
virtual ~MergeOperator() {}
|
|
|
|
// Gives the client a way to express the read -> modify -> write semantics
|
|
// key: (IN) The key that's associated with this merge operation.
|
|
// Client could multiplex the merge operator based on it
|
|
// if the key space is partitioned and different subspaces
|
|
// refer to different types of data which have different
|
|
// merge operation semantics
|
|
// existing: (IN) null indicates that the key does not exist before this op
|
|
// operand_list:(IN) the sequence of merge operations to apply, front() first.
|
|
// new_value:(OUT) Client is responsible for filling the merge result here.
|
|
// The string that new_value is pointing to will be empty.
|
|
// logger: (IN) Client could use this to log errors during merge.
|
|
//
|
|
// Return true on success.
|
|
// All values passed in will be client-specific values. So if this method
|
|
// returns false, it is because client specified bad data or there was
|
|
// internal corruption. This will be treated as an error by the library.
|
|
//
|
|
// Also make use of the *logger for error messages.
|
|
virtual bool FullMerge(const Slice& key,
|
|
const Slice* existing_value,
|
|
const std::deque<std::string>& operand_list,
|
|
std::string* new_value,
|
|
Logger* logger) const {
|
|
// deprecated, please use FullMergeV2()
|
|
assert(false);
|
|
return false;
|
|
}
|
|
|
|
struct MergeOperationInput {
|
|
explicit MergeOperationInput(const Slice& _key,
|
|
const Slice* _existing_value,
|
|
const std::vector<Slice>& _operand_list,
|
|
Logger* _logger)
|
|
: key(_key),
|
|
existing_value(_existing_value),
|
|
operand_list(_operand_list),
|
|
logger(_logger) {}
|
|
|
|
// The key associated with the merge operation.
|
|
const Slice& key;
|
|
// The existing value of the current key, nullptr means that the
|
|
// value dont exist.
|
|
const Slice* existing_value;
|
|
// A list of operands to apply.
|
|
const std::vector<Slice>& operand_list;
|
|
// Logger could be used by client to log any errors that happen during
|
|
// the merge operation.
|
|
Logger* logger;
|
|
};
|
|
|
|
struct MergeOperationOutput {
|
|
explicit MergeOperationOutput(std::string& _new_value,
|
|
Slice& _existing_operand)
|
|
: new_value(_new_value), existing_operand(_existing_operand) {}
|
|
|
|
// Client is responsible for filling the merge result here.
|
|
std::string& new_value;
|
|
// If the merge result is one of the existing operands (or existing_value),
|
|
// client can set this field to the operand (or existing_value) instead of
|
|
// using new_value.
|
|
Slice& existing_operand;
|
|
};
|
|
|
|
virtual bool FullMergeV2(const MergeOperationInput& merge_in,
|
|
MergeOperationOutput* merge_out) const;
|
|
|
|
// This function performs merge(left_op, right_op)
|
|
// when both the operands are themselves merge operation types
|
|
// that you would have passed to a DB::Merge() call in the same order
|
|
// (i.e.: DB::Merge(key,left_op), followed by DB::Merge(key,right_op)).
|
|
//
|
|
// PartialMerge should combine them into a single merge operation that is
|
|
// saved into *new_value, and then it should return true.
|
|
// *new_value should be constructed such that a call to
|
|
// DB::Merge(key, *new_value) would yield the same result as a call
|
|
// to DB::Merge(key, left_op) followed by DB::Merge(key, right_op).
|
|
//
|
|
// The string that new_value is pointing to will be empty.
|
|
//
|
|
// The default implementation of PartialMergeMulti will use this function
|
|
// as a helper, for backward compatibility. Any successor class of
|
|
// MergeOperator should either implement PartialMerge or PartialMergeMulti,
|
|
// although implementing PartialMergeMulti is suggested as it is in general
|
|
// more effective to merge multiple operands at a time instead of two
|
|
// operands at a time.
|
|
//
|
|
// If it is impossible or infeasible to combine the two operations,
|
|
// leave new_value unchanged and return false. The library will
|
|
// internally keep track of the operations, and apply them in the
|
|
// correct order once a base-value (a Put/Delete/End-of-Database) is seen.
|
|
//
|
|
// TODO: Presently there is no way to differentiate between error/corruption
|
|
// and simply "return false". For now, the client should simply return
|
|
// false in any case it cannot perform partial-merge, regardless of reason.
|
|
// If there is corruption in the data, handle it in the FullMergeV2() function
|
|
// and return false there. The default implementation of PartialMerge will
|
|
// always return false.
|
|
virtual bool PartialMerge(const Slice& key, const Slice& left_operand,
|
|
const Slice& right_operand, std::string* new_value,
|
|
Logger* logger) const {
|
|
return false;
|
|
}
|
|
|
|
// This function performs merge when all the operands are themselves merge
|
|
// operation types that you would have passed to a DB::Merge() call in the
|
|
// same order (front() first)
|
|
// (i.e. DB::Merge(key, operand_list[0]), followed by
|
|
// DB::Merge(key, operand_list[1]), ...)
|
|
//
|
|
// PartialMergeMulti should combine them into a single merge operation that is
|
|
// saved into *new_value, and then it should return true. *new_value should
|
|
// be constructed such that a call to DB::Merge(key, *new_value) would yield
|
|
// the same result as subquential individual calls to DB::Merge(key, operand)
|
|
// for each operand in operand_list from front() to back().
|
|
//
|
|
// The string that new_value is pointing to will be empty.
|
|
//
|
|
// The PartialMergeMulti function will be called only when the list of
|
|
// operands are long enough. The minimum amount of operands that will be
|
|
// passed to the function are specified by the "min_partial_merge_operands"
|
|
// option.
|
|
//
|
|
// In the default implementation, PartialMergeMulti will invoke PartialMerge
|
|
// multiple times, where each time it only merges two operands. Developers
|
|
// should either implement PartialMergeMulti, or implement PartialMerge which
|
|
// is served as the helper function of the default PartialMergeMulti.
|
|
virtual bool PartialMergeMulti(const Slice& key,
|
|
const std::deque<Slice>& operand_list,
|
|
std::string* new_value, Logger* logger) const;
|
|
|
|
// The name of the MergeOperator. Used to check for MergeOperator
|
|
// mismatches (i.e., a DB created with one MergeOperator is
|
|
// accessed using a different MergeOperator)
|
|
// TODO: the name is currently not stored persistently and thus
|
|
// no checking is enforced. Client is responsible for providing
|
|
// consistent MergeOperator between DB opens.
|
|
virtual const char* Name() const = 0;
|
|
};
|
|
|
|
// The simpler, associative merge operator.
|
|
class AssociativeMergeOperator : public MergeOperator {
|
|
public:
|
|
virtual ~AssociativeMergeOperator() {}
|
|
|
|
// Gives the client a way to express the read -> modify -> write semantics
|
|
// key: (IN) The key that's associated with this merge operation.
|
|
// existing_value:(IN) null indicates the key does not exist before this op
|
|
// value: (IN) the value to update/merge the existing_value with
|
|
// new_value: (OUT) Client is responsible for filling the merge result
|
|
// here. The string that new_value is pointing to will be empty.
|
|
// logger: (IN) Client could use this to log errors during merge.
|
|
//
|
|
// Return true on success.
|
|
// All values passed in will be client-specific values. So if this method
|
|
// returns false, it is because client specified bad data or there was
|
|
// internal corruption. The client should assume that this will be treated
|
|
// as an error by the library.
|
|
virtual bool Merge(const Slice& key,
|
|
const Slice* existing_value,
|
|
const Slice& value,
|
|
std::string* new_value,
|
|
Logger* logger) const = 0;
|
|
|
|
|
|
private:
|
|
// Default implementations of the MergeOperator functions
|
|
virtual bool FullMergeV2(const MergeOperationInput& merge_in,
|
|
MergeOperationOutput* merge_out) const override;
|
|
|
|
virtual bool PartialMerge(const Slice& key,
|
|
const Slice& left_operand,
|
|
const Slice& right_operand,
|
|
std::string* new_value,
|
|
Logger* logger) const override;
|
|
};
|
|
|
|
} // namespace rocksdb
|
|
|
|
#endif // STORAGE_ROCKSDB_INCLUDE_MERGE_OPERATOR_H_
|