2016-02-09 15:12:00 -08:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-15 16:03:42 -07:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
2015-09-10 14:35:25 -07:00
|
|
|
#pragma once
|
|
|
|
|
|
|
|
#include <algorithm>
|
|
|
|
#include <deque>
|
|
|
|
#include <string>
|
WritePrepared: fix two versions in compaction see different status for released snapshots (#4890)
Summary:
Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios:
Scenario 1:
key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1.
The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key.
Scenario 2:
key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here.
The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890
Differential Revision: D13705538
Pulled By: maysamyabandeh
fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4
2019-01-18 17:20:13 -08:00
|
|
|
#include <unordered_set>
|
2015-09-10 14:35:25 -07:00
|
|
|
#include <vector>
|
|
|
|
|
2019-05-31 11:52:59 -07:00
|
|
|
#include "db/compaction/compaction.h"
|
2019-05-31 15:21:36 -07:00
|
|
|
#include "db/compaction/compaction_iteration_stats.h"
|
2015-09-10 14:35:25 -07:00
|
|
|
#include "db/merge_helper.h"
|
Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>
This diff is stacked on top of D56493 and D56511
In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand
```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s
[master]
readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```
```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s
[master]
readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s
readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s
readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s
readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s
readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]
[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s
readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s
[master]
readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s
readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
[FullMergeV2]
readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s
readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s
[master]
readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s
readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```
Test Plan: COMPILE_WITH_ASAN=1 make check -j64
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: lovro, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57075
2016-07-20 09:49:03 -07:00
|
|
|
#include "db/pinned_iterators_manager.h"
|
2018-12-17 17:26:56 -08:00
|
|
|
#include "db/range_del_aggregator.h"
|
2017-10-06 10:26:38 -07:00
|
|
|
#include "db/snapshot_checker.h"
|
2017-05-17 11:32:26 -07:00
|
|
|
#include "options/cf_options.h"
|
2015-09-10 14:35:25 -07:00
|
|
|
#include "rocksdb/compaction_filter.h"
|
|
|
|
|
|
|
|
namespace rocksdb {
|
|
|
|
|
2019-05-03 17:26:20 -07:00
|
|
|
// This callback can be used to refresh the snapshot list from the db. It
|
|
|
|
// includes logics to exponentially decrease the refresh rate to limit the
|
|
|
|
// overhead of refresh.
|
|
|
|
class SnapshotListFetchCallback {
|
|
|
|
public:
|
|
|
|
SnapshotListFetchCallback(Env* env, uint64_t snap_refresh_nanos,
|
|
|
|
size_t every_nth_key = 1024)
|
|
|
|
: timer_(env, /*auto restart*/ true),
|
|
|
|
snap_refresh_nanos_(snap_refresh_nanos),
|
|
|
|
every_nth_key_minus_one_(every_nth_key - 1) {
|
|
|
|
assert(every_nth_key > 0);
|
|
|
|
assert((ceil(log2(every_nth_key)) == floor(log2(every_nth_key))));
|
|
|
|
}
|
|
|
|
// Refresh the snapshot list. snapshots will bre replacted with the new list.
|
|
|
|
// max is the upper bound. Note: this function will acquire the db_mutex_.
|
|
|
|
virtual void Refresh(std::vector<SequenceNumber>* snapshots,
|
|
|
|
SequenceNumber max) = 0;
|
|
|
|
inline bool TimeToRefresh(const size_t key_index) {
|
2019-08-20 11:38:15 -07:00
|
|
|
assert(snap_refresh_nanos_ != 0);
|
2019-05-03 17:26:20 -07:00
|
|
|
// skip the key if key_index % every_nth_key (which is of power 2) is not 0.
|
|
|
|
if ((key_index & every_nth_key_minus_one_) != 0) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
const uint64_t elapsed = timer_.ElapsedNanos();
|
|
|
|
auto ret = elapsed > snap_refresh_nanos_;
|
|
|
|
// pre-compute the next time threshold
|
|
|
|
if (ret) {
|
|
|
|
// inc next refresh period exponentially (by x4)
|
|
|
|
auto next_refresh_threshold = snap_refresh_nanos_ << 2;
|
|
|
|
// make sure the shift has not overflown the highest 1 bit
|
|
|
|
snap_refresh_nanos_ =
|
|
|
|
std::max(snap_refresh_nanos_, next_refresh_threshold);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
static constexpr SnapshotListFetchCallback* kDisabled = nullptr;
|
|
|
|
|
|
|
|
virtual ~SnapshotListFetchCallback() {}
|
|
|
|
|
|
|
|
private:
|
|
|
|
// Time since the callback was created
|
|
|
|
StopWatchNano timer_;
|
|
|
|
// The delay before calling ::Refresh. To be increased exponentially.
|
|
|
|
uint64_t snap_refresh_nanos_;
|
|
|
|
// Skip evey nth key. Number n if of power 2. The math will require n-1.
|
|
|
|
const uint64_t every_nth_key_minus_one_;
|
|
|
|
};
|
|
|
|
|
2015-09-10 14:35:25 -07:00
|
|
|
class CompactionIterator {
|
|
|
|
public:
|
2016-12-01 07:00:17 -08:00
|
|
|
// A wrapper around Compaction. Has a much smaller interface, only what
|
|
|
|
// CompactionIterator uses. Tests can override it.
|
|
|
|
class CompactionProxy {
|
|
|
|
public:
|
|
|
|
explicit CompactionProxy(const Compaction* compaction)
|
|
|
|
: compaction_(compaction) {}
|
|
|
|
|
|
|
|
virtual ~CompactionProxy() = default;
|
2018-03-05 13:08:17 -08:00
|
|
|
virtual int level(size_t /*compaction_input_level*/ = 0) const {
|
2016-12-01 07:00:17 -08:00
|
|
|
return compaction_->level();
|
|
|
|
}
|
|
|
|
virtual bool KeyNotExistsBeyondOutputLevel(
|
|
|
|
const Slice& user_key, std::vector<size_t>* level_ptrs) const {
|
|
|
|
return compaction_->KeyNotExistsBeyondOutputLevel(user_key, level_ptrs);
|
|
|
|
}
|
|
|
|
virtual bool bottommost_level() const {
|
|
|
|
return compaction_->bottommost_level();
|
|
|
|
}
|
|
|
|
virtual int number_levels() const { return compaction_->number_levels(); }
|
|
|
|
virtual Slice GetLargestUserKey() const {
|
|
|
|
return compaction_->GetLargestUserKey();
|
|
|
|
}
|
2017-05-17 11:32:26 -07:00
|
|
|
virtual bool allow_ingest_behind() const {
|
|
|
|
return compaction_->immutable_cf_options()->allow_ingest_behind;
|
|
|
|
}
|
Added support for differential snapshots
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
For now, what's done here according to initial discussions:
Preserving deletes:
- We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
- I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
- Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
Iterator changes:
- couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
TableCache changes:
- I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
What's left:
- Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999
Differential Revision: D6175602
Pulled By: mikhail-antonov
fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
2017-11-01 18:43:29 -07:00
|
|
|
virtual bool preserve_deletes() const {
|
|
|
|
return compaction_->immutable_cf_options()->preserve_deletes;
|
|
|
|
}
|
2016-12-01 07:00:17 -08:00
|
|
|
|
|
|
|
protected:
|
|
|
|
CompactionProxy() = default;
|
|
|
|
|
|
|
|
private:
|
|
|
|
const Compaction* compaction_;
|
|
|
|
};
|
|
|
|
|
2015-10-12 15:06:38 -07:00
|
|
|
CompactionIterator(InternalIterator* input, const Comparator* cmp,
|
2015-09-10 14:35:25 -07:00
|
|
|
MergeHelper* merge_helper, SequenceNumber last_sequence,
|
2015-12-08 12:25:48 -08:00
|
|
|
std::vector<SequenceNumber>* snapshots,
|
2017-10-06 10:26:38 -07:00
|
|
|
SequenceNumber earliest_write_conflict_snapshot,
|
|
|
|
const SnapshotChecker* snapshot_checker, Env* env,
|
2018-06-21 21:16:49 -07:00
|
|
|
bool report_detailed_time, bool expect_valid_internal_key,
|
2018-12-17 17:26:56 -08:00
|
|
|
CompactionRangeDelAggregator* range_del_agg,
|
2016-03-24 19:36:39 -07:00
|
|
|
const Compaction* compaction = nullptr,
|
2015-09-10 14:35:25 -07:00
|
|
|
const CompactionFilter* compaction_filter = nullptr,
|
Added support for differential snapshots
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
For now, what's done here according to initial discussions:
Preserving deletes:
- We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
- I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
- Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
Iterator changes:
- couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
TableCache changes:
- I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
What's left:
- Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999
Differential Revision: D6175602
Pulled By: mikhail-antonov
fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
2017-11-01 18:43:29 -07:00
|
|
|
const std::atomic<bool>* shutting_down = nullptr,
|
2019-05-03 17:26:20 -07:00
|
|
|
const SequenceNumber preserve_deletes_seqnum = 0,
|
|
|
|
SnapshotListFetchCallback* snap_list_callback = nullptr);
|
2015-09-10 14:35:25 -07:00
|
|
|
|
2016-12-01 07:00:17 -08:00
|
|
|
// Constructor with custom CompactionProxy, used for tests.
|
|
|
|
CompactionIterator(InternalIterator* input, const Comparator* cmp,
|
|
|
|
MergeHelper* merge_helper, SequenceNumber last_sequence,
|
|
|
|
std::vector<SequenceNumber>* snapshots,
|
2017-10-06 10:26:38 -07:00
|
|
|
SequenceNumber earliest_write_conflict_snapshot,
|
|
|
|
const SnapshotChecker* snapshot_checker, Env* env,
|
2018-06-21 21:16:49 -07:00
|
|
|
bool report_detailed_time, bool expect_valid_internal_key,
|
2018-12-17 17:26:56 -08:00
|
|
|
CompactionRangeDelAggregator* range_del_agg,
|
2016-12-01 07:00:17 -08:00
|
|
|
std::unique_ptr<CompactionProxy> compaction,
|
|
|
|
const CompactionFilter* compaction_filter = nullptr,
|
Added support for differential snapshots
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
For now, what's done here according to initial discussions:
Preserving deletes:
- We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
- I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
- Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
Iterator changes:
- couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
TableCache changes:
- I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
What's left:
- Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999
Differential Revision: D6175602
Pulled By: mikhail-antonov
fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
2017-11-01 18:43:29 -07:00
|
|
|
const std::atomic<bool>* shutting_down = nullptr,
|
2019-05-03 17:26:20 -07:00
|
|
|
const SequenceNumber preserve_deletes_seqnum = 0,
|
|
|
|
SnapshotListFetchCallback* snap_list_callback = nullptr);
|
2016-12-01 07:00:17 -08:00
|
|
|
|
Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>
This diff is stacked on top of D56493 and D56511
In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand
```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s
[master]
readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```
```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s
[master]
readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s
readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s
readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s
readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s
readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]
[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s
readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s
[master]
readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s
readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
[FullMergeV2]
readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s
readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s
[master]
readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s
readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```
Test Plan: COMPILE_WITH_ASAN=1 make check -j64
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: lovro, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57075
2016-07-20 09:49:03 -07:00
|
|
|
~CompactionIterator();
|
|
|
|
|
2015-09-10 14:35:25 -07:00
|
|
|
void ResetRecordCounts();
|
|
|
|
|
|
|
|
// Seek to the beginning of the compaction iterator output.
|
|
|
|
//
|
|
|
|
// REQUIRED: Call only once.
|
|
|
|
void SeekToFirst();
|
|
|
|
|
|
|
|
// Produces the next record in the compaction.
|
|
|
|
//
|
|
|
|
// REQUIRED: SeekToFirst() has been called.
|
|
|
|
void Next();
|
|
|
|
|
|
|
|
// Getters
|
|
|
|
const Slice& key() const { return key_; }
|
|
|
|
const Slice& value() const { return value_; }
|
|
|
|
const Status& status() const { return status_; }
|
|
|
|
const ParsedInternalKey& ikey() const { return ikey_; }
|
|
|
|
bool Valid() const { return valid_; }
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 11:42:56 -07:00
|
|
|
const Slice& user_key() const { return current_user_key_; }
|
2016-11-28 11:44:40 -08:00
|
|
|
const CompactionIterationStats& iter_stats() const { return iter_stats_; }
|
2015-09-10 14:35:25 -07:00
|
|
|
|
|
|
|
private:
|
|
|
|
// Processes the input stream to find the next output
|
|
|
|
void NextFromInput();
|
2019-05-03 17:26:20 -07:00
|
|
|
// Process snapshots_ and assign related variables
|
|
|
|
void ProcessSnapshotList();
|
2015-09-10 14:35:25 -07:00
|
|
|
|
|
|
|
// Do last preparations before presenting the output to the callee. At this
|
|
|
|
// point this only zeroes out the sequence number if possible for better
|
|
|
|
// compression.
|
|
|
|
void PrepareOutput();
|
|
|
|
|
2017-10-06 10:26:38 -07:00
|
|
|
// Invoke compaction filter if needed.
|
|
|
|
void InvokeFilterIfNeeded(bool* need_skip, Slice* skip_until);
|
|
|
|
|
2015-09-10 14:35:25 -07:00
|
|
|
// Given a sequence number, return the sequence number of the
|
|
|
|
// earliest snapshot that this sequence number is visible in.
|
|
|
|
// The snapshots themselves are arranged in ascending order of
|
|
|
|
// sequence numbers.
|
|
|
|
// Employ a sequential search because the total number of
|
|
|
|
// snapshots are typically small.
|
|
|
|
inline SequenceNumber findEarliestVisibleSnapshot(
|
|
|
|
SequenceNumber in, SequenceNumber* prev_snapshot);
|
|
|
|
|
Added support for differential snapshots
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
For now, what's done here according to initial discussions:
Preserving deletes:
- We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
- I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
- Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
Iterator changes:
- couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
TableCache changes:
- I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
What's left:
- Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999
Differential Revision: D6175602
Pulled By: mikhail-antonov
fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
2017-11-01 18:43:29 -07:00
|
|
|
// Checks whether the currently seen ikey_ is needed for
|
|
|
|
// incremental (differential) snapshot and hence can't be dropped
|
|
|
|
// or seqnum be zero-ed out even if all other conditions for it are met.
|
|
|
|
inline bool ikeyNotNeededForIncrementalSnapshot();
|
|
|
|
|
2019-01-16 09:48:10 -08:00
|
|
|
inline bool KeyCommitted(SequenceNumber sequence) {
|
|
|
|
return snapshot_checker_ == nullptr ||
|
|
|
|
snapshot_checker_->CheckInSnapshot(sequence, kMaxSequenceNumber) ==
|
|
|
|
SnapshotCheckerResult::kInSnapshot;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool IsInEarliestSnapshot(SequenceNumber sequence);
|
|
|
|
|
2015-10-12 15:06:38 -07:00
|
|
|
InternalIterator* input_;
|
2015-09-10 14:35:25 -07:00
|
|
|
const Comparator* cmp_;
|
|
|
|
MergeHelper* merge_helper_;
|
2019-05-03 17:26:20 -07:00
|
|
|
std::vector<SequenceNumber>* snapshots_;
|
WritePrepared: fix two versions in compaction see different status for released snapshots (#4890)
Summary:
Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios:
Scenario 1:
key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1.
The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key.
Scenario 2:
key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here.
The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890
Differential Revision: D13705538
Pulled By: maysamyabandeh
fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4
2019-01-18 17:20:13 -08:00
|
|
|
// List of snapshots released during compaction.
|
|
|
|
// findEarliestVisibleSnapshot() find them out from return of
|
|
|
|
// snapshot_checker, and make sure they will not be returned as
|
|
|
|
// earliest visible snapshot of an older value.
|
|
|
|
// See WritePreparedTransactionTest::ReleaseSnapshotDuringCompaction3.
|
|
|
|
std::unordered_set<SequenceNumber> released_snapshots_;
|
2019-01-16 09:48:10 -08:00
|
|
|
std::vector<SequenceNumber>::const_iterator earliest_snapshot_iter_;
|
2015-12-08 12:25:48 -08:00
|
|
|
const SequenceNumber earliest_write_conflict_snapshot_;
|
2017-10-06 10:26:38 -07:00
|
|
|
const SnapshotChecker* const snapshot_checker_;
|
2015-09-10 14:35:25 -07:00
|
|
|
Env* env_;
|
2018-06-21 21:16:49 -07:00
|
|
|
bool report_detailed_time_;
|
2015-09-10 15:16:32 -07:00
|
|
|
bool expect_valid_internal_key_;
|
2018-12-17 17:26:56 -08:00
|
|
|
CompactionRangeDelAggregator* range_del_agg_;
|
2016-12-01 07:00:17 -08:00
|
|
|
std::unique_ptr<CompactionProxy> compaction_;
|
2015-09-10 14:35:25 -07:00
|
|
|
const CompactionFilter* compaction_filter_;
|
2017-01-11 15:01:21 -08:00
|
|
|
const std::atomic<bool>* shutting_down_;
|
Added support for differential snapshots
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
For now, what's done here according to initial discussions:
Preserving deletes:
- We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
- I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
- Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
Iterator changes:
- couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
TableCache changes:
- I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
What's left:
- Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999
Differential Revision: D6175602
Pulled By: mikhail-antonov
fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
2017-11-01 18:43:29 -07:00
|
|
|
const SequenceNumber preserve_deletes_seqnum_;
|
2015-09-10 14:35:25 -07:00
|
|
|
bool bottommost_level_;
|
|
|
|
bool valid_ = false;
|
2016-10-13 10:49:06 -07:00
|
|
|
bool visible_at_tip_;
|
2015-09-10 14:35:25 -07:00
|
|
|
SequenceNumber earliest_snapshot_;
|
|
|
|
SequenceNumber latest_snapshot_;
|
2019-01-16 09:48:10 -08:00
|
|
|
|
2015-09-10 14:35:25 -07:00
|
|
|
// State
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 11:42:56 -07:00
|
|
|
//
|
|
|
|
// Points to a copy of the current compaction iterator output (current_key_)
|
|
|
|
// if valid_.
|
2015-09-10 14:35:25 -07:00
|
|
|
Slice key_;
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 11:42:56 -07:00
|
|
|
// Points to the value in the underlying iterator that corresponds to the
|
|
|
|
// current output.
|
2015-09-10 14:35:25 -07:00
|
|
|
Slice value_;
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 11:42:56 -07:00
|
|
|
// The status is OK unless compaction iterator encounters a merge operand
|
|
|
|
// while not having a merge operator defined.
|
2015-09-10 14:35:25 -07:00
|
|
|
Status status_;
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 11:42:56 -07:00
|
|
|
// Stores the user key, sequence number and type of the current compaction
|
|
|
|
// iterator output (or current key in the underlying iterator during
|
|
|
|
// NextFromInput()).
|
2015-09-10 14:35:25 -07:00
|
|
|
ParsedInternalKey ikey_;
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 11:42:56 -07:00
|
|
|
// Stores whether ikey_.user_key is valid. If set to false, the user key is
|
|
|
|
// not compared against the current key in the underlying iterator.
|
2015-09-10 14:35:25 -07:00
|
|
|
bool has_current_user_key_ = false;
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 11:42:56 -07:00
|
|
|
bool at_next_ = false; // If false, the iterator
|
|
|
|
// Holds a copy of the current compaction iterator output (or current key in
|
|
|
|
// the underlying iterator during NextFromInput()).
|
|
|
|
IterKey current_key_;
|
|
|
|
Slice current_user_key_;
|
2015-09-10 14:35:25 -07:00
|
|
|
SequenceNumber current_user_key_sequence_;
|
|
|
|
SequenceNumber current_user_key_snapshot_;
|
2015-11-04 20:52:22 -08:00
|
|
|
|
|
|
|
// True if the iterator has already returned a record for the current key.
|
|
|
|
bool has_outputted_key_ = false;
|
|
|
|
|
|
|
|
// truncated the value of the next key and output it without applying any
|
|
|
|
// compaction rules. This is used for outputting a put after a single delete.
|
|
|
|
bool clear_and_output_next_key_ = false;
|
|
|
|
|
2015-09-10 14:35:25 -07:00
|
|
|
MergeOutputIterator merge_out_iter_;
|
Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>
This diff is stacked on top of D56493 and D56511
In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand
```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s
[master]
readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```
```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s
[master]
readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s
readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s
readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s
readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s
readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]
[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s
readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s
[master]
readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s
readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
[FullMergeV2]
readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s
readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s
[master]
readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s
readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```
Test Plan: COMPILE_WITH_ASAN=1 make check -j64
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: lovro, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57075
2016-07-20 09:49:03 -07:00
|
|
|
// PinnedIteratorsManager used to pin input_ Iterator blocks while reading
|
|
|
|
// merge operands and then releasing them after consuming them.
|
|
|
|
PinnedIteratorsManager pinned_iters_mgr_;
|
2015-09-10 14:35:25 -07:00
|
|
|
std::string compaction_filter_value_;
|
2016-12-01 07:00:17 -08:00
|
|
|
InternalKey compaction_filter_skip_until_;
|
2015-09-10 14:35:25 -07:00
|
|
|
// "level_ptrs" holds indices that remember which file of an associated
|
|
|
|
// level we were last checking during the last call to compaction->
|
|
|
|
// KeyNotExistsBeyondOutputLevel(). This allows future calls to the function
|
|
|
|
// to pick off where it left off since each subcompaction's key range is
|
|
|
|
// increasing so a later call to the function must be looking for a key that
|
|
|
|
// is in or beyond the last file checked during the previous call
|
|
|
|
std::vector<size_t> level_ptrs_;
|
2016-11-28 11:44:40 -08:00
|
|
|
CompactionIterationStats iter_stats_;
|
2017-01-11 15:01:21 -08:00
|
|
|
|
2017-10-06 10:26:38 -07:00
|
|
|
// Used to avoid purging uncommitted values. The application can specify
|
|
|
|
// uncommitted values by providing a SnapshotChecker object.
|
|
|
|
bool current_key_committed_;
|
2019-05-03 17:26:20 -07:00
|
|
|
SnapshotListFetchCallback* snap_list_callback_;
|
|
|
|
// number of distinct keys processed
|
|
|
|
size_t num_keys_ = 0;
|
2017-10-06 10:26:38 -07:00
|
|
|
|
2017-01-11 15:01:21 -08:00
|
|
|
bool IsShuttingDown() {
|
|
|
|
// This is a best-effort facility, so memory_order_relaxed is sufficient.
|
|
|
|
return shutting_down_ && shutting_down_->load(std::memory_order_relaxed);
|
|
|
|
}
|
2015-09-10 14:35:25 -07:00
|
|
|
};
|
|
|
|
} // namespace rocksdb
|