rocksdb/db/db_test_util.h
Peter Dillinger 78a309bf86 New Cache API for gathering statistics (#8225)
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:

* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)

I have enhanced cache_bench to validate this approach:

* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.

To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225

Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.

The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.

Baseline typical output:

```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662

Operation latency (ns):
Count: 80000000 Average: 11223.9494  StdDev: 29.61
Min: 0  Median: 7759.3973  Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[       0,       1 ]       68   0.000%   0.000%
(    2900,    4400 ]       89   0.000%   0.000%
(    4400,    6600 ] 33630240  42.038%  42.038% ########
(    6600,    9900 ] 18129842  22.662%  64.700% #####
(    9900,   14000 ]  7877533   9.847%  74.547% ##
(   14000,   22000 ] 15193238  18.992%  93.539% ####
(   22000,   33000 ]  3037061   3.796%  97.335% #
(   33000,   50000 ]  1626316   2.033%  99.368%
(   50000,   75000 ]   421532   0.527%  99.895%
(   75000,  110000 ]    56910   0.071%  99.966%
(  110000,  170000 ]    16134   0.020%  99.986%
(  170000,  250000 ]     5166   0.006%  99.993%
(  250000,  380000 ]     3017   0.004%  99.996%
(  380000,  570000 ]     1337   0.002%  99.998%
(  570000,  860000 ]      805   0.001%  99.999%
(  860000, 1200000 ]      319   0.000% 100.000%
( 1200000, 1900000 ]      231   0.000% 100.000%
( 1900000, 2900000 ]      100   0.000% 100.000%
( 2900000, 4300000 ]       39   0.000% 100.000%
( 4300000, 6500000 ]       16   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
```

New, gather_stats=false. Median thread ops/sec of 5 runs:

```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458

Operation latency (ns):
Count: 80000000 Average: 11298.1027  StdDev: 42.18
Min: 0  Median: 7722.0822  Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[       0,       1 ]      109   0.000%   0.000%
(    2900,    4400 ]      793   0.001%   0.001%
(    4400,    6600 ] 34054563  42.568%  42.569% #########
(    6600,    9900 ] 17482646  21.853%  64.423% ####
(    9900,   14000 ]  7908180   9.885%  74.308% ##
(   14000,   22000 ] 15032072  18.790%  93.098% ####
(   22000,   33000 ]  3237834   4.047%  97.145% #
(   33000,   50000 ]  1736882   2.171%  99.316%
(   50000,   75000 ]   446851   0.559%  99.875%
(   75000,  110000 ]    68251   0.085%  99.960%
(  110000,  170000 ]    18592   0.023%  99.983%
(  170000,  250000 ]     7200   0.009%  99.992%
(  250000,  380000 ]     3334   0.004%  99.997%
(  380000,  570000 ]     1393   0.002%  99.998%
(  570000,  860000 ]      700   0.001%  99.999%
(  860000, 1200000 ]      293   0.000% 100.000%
( 1200000, 1900000 ]      196   0.000% 100.000%
( 1900000, 2900000 ]       69   0.000% 100.000%
( 2900000, 4300000 ]       32   0.000% 100.000%
( 4300000, 6500000 ]       10   0.000% 100.000%
```

New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:

```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551

Operation latency (ns):
Count: 80000000 Average: 11311.2629  StdDev: 45.28
Min: 0  Median: 7686.5458  Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[       0,       1 ]       71   0.000%   0.000%
(    2900,    4400 ]      291   0.000%   0.000%
(    4400,    6600 ] 34492060  43.115%  43.116% #########
(    6600,    9900 ] 16727328  20.909%  64.025% ####
(    9900,   14000 ]  7845828   9.807%  73.832% ##
(   14000,   22000 ] 15510654  19.388%  93.220% ####
(   22000,   33000 ]  3216533   4.021%  97.241% #
(   33000,   50000 ]  1680859   2.101%  99.342%
(   50000,   75000 ]   439059   0.549%  99.891%
(   75000,  110000 ]    60540   0.076%  99.967%
(  110000,  170000 ]    14649   0.018%  99.985%
(  170000,  250000 ]     5242   0.007%  99.991%
(  250000,  380000 ]     3260   0.004%  99.995%
(  380000,  570000 ]     1599   0.002%  99.997%
(  570000,  860000 ]     1043   0.001%  99.999%
(  860000, 1200000 ]      471   0.001%  99.999%
( 1200000, 1900000 ]      275   0.000% 100.000%
( 1900000, 2900000 ]      143   0.000% 100.000%
( 2900000, 4300000 ]       60   0.000% 100.000%
( 4300000, 6500000 ]       27   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
( 9800000, 14000000 ]        1   0.000% 100.000%

Gather stats latency (us):
Count: 46 Average: 980387.5870  StdDev: 60911.18
Min: 879155  Median: 1033777.7778  Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
(  860000, 1200000 ]       45  97.826%  97.826% ####################
( 1200000, 1900000 ]        1   2.174% 100.000%

Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```

Reviewed By: mrambacher

Differential Revision: D28295742

Pulled By: pdillinger

fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
2021-05-11 16:17:10 -07:00

1262 lines
40 KiB
C++

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
//
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file. See the AUTHORS file for names of contributors.
#pragma once
#include <fcntl.h>
#include <algorithm>
#include <cinttypes>
#include <map>
#include <set>
#include <string>
#include <thread>
#include <unordered_set>
#include <utility>
#include <vector>
#include "db/db_impl/db_impl.h"
#include "db/dbformat.h"
#include "file/filename.h"
#include "memtable/hash_linklist_rep.h"
#include "rocksdb/cache.h"
#include "rocksdb/compaction_filter.h"
#include "rocksdb/convenience.h"
#include "rocksdb/db.h"
#include "rocksdb/env.h"
#include "rocksdb/filter_policy.h"
#include "rocksdb/options.h"
#include "rocksdb/slice.h"
#include "rocksdb/sst_file_writer.h"
#include "rocksdb/statistics.h"
#include "rocksdb/table.h"
#include "rocksdb/utilities/checkpoint.h"
#include "table/mock_table.h"
#include "table/scoped_arena_iterator.h"
#include "test_util/sync_point.h"
#include "test_util/testharness.h"
#include "util/cast_util.h"
#include "util/compression.h"
#include "util/mutexlock.h"
#include "util/string_util.h"
#include "utilities/merge_operators.h"
namespace ROCKSDB_NAMESPACE {
class MockEnv;
namespace anon {
class AtomicCounter {
public:
explicit AtomicCounter(Env* env = NULL)
: env_(env), cond_count_(&mu_), count_(0) {}
void Increment() {
MutexLock l(&mu_);
count_++;
cond_count_.SignalAll();
}
int Read() {
MutexLock l(&mu_);
return count_;
}
bool WaitFor(int count) {
MutexLock l(&mu_);
uint64_t start = env_->NowMicros();
while (count_ < count) {
uint64_t now = env_->NowMicros();
cond_count_.TimedWait(now + /*1s*/ 1 * 1000 * 1000);
if (env_->NowMicros() - start > /*10s*/ 10 * 1000 * 1000) {
return false;
}
if (count_ < count) {
GTEST_LOG_(WARNING) << "WaitFor is taking more time than usual";
}
}
return true;
}
void Reset() {
MutexLock l(&mu_);
count_ = 0;
cond_count_.SignalAll();
}
private:
Env* env_;
port::Mutex mu_;
port::CondVar cond_count_;
int count_;
};
struct OptionsOverride {
std::shared_ptr<const FilterPolicy> filter_policy = nullptr;
// These will be used only if filter_policy is set
bool partition_filters = false;
uint64_t metadata_block_size = 1024;
// Used as a bit mask of individual enums in which to skip an XF test point
int skip_policy = 0;
};
} // namespace anon
enum SkipPolicy { kSkipNone = 0, kSkipNoSnapshot = 1, kSkipNoPrefix = 2 };
// A hacky skip list mem table that triggers flush after number of entries.
class SpecialMemTableRep : public MemTableRep {
public:
explicit SpecialMemTableRep(Allocator* allocator, MemTableRep* memtable,
int num_entries_flush)
: MemTableRep(allocator),
memtable_(memtable),
num_entries_flush_(num_entries_flush),
num_entries_(0) {}
virtual KeyHandle Allocate(const size_t len, char** buf) override {
return memtable_->Allocate(len, buf);
}
// Insert key into the list.
// REQUIRES: nothing that compares equal to key is currently in the list.
virtual void Insert(KeyHandle handle) override {
num_entries_++;
memtable_->Insert(handle);
}
void InsertConcurrently(KeyHandle handle) override {
num_entries_++;
memtable_->Insert(handle);
}
// Returns true iff an entry that compares equal to key is in the list.
virtual bool Contains(const char* key) const override {
return memtable_->Contains(key);
}
virtual size_t ApproximateMemoryUsage() override {
// Return a high memory usage when number of entries exceeds the threshold
// to trigger a flush.
return (num_entries_ < num_entries_flush_) ? 0 : 1024 * 1024 * 1024;
}
virtual void Get(const LookupKey& k, void* callback_args,
bool (*callback_func)(void* arg,
const char* entry)) override {
memtable_->Get(k, callback_args, callback_func);
}
uint64_t ApproximateNumEntries(const Slice& start_ikey,
const Slice& end_ikey) override {
return memtable_->ApproximateNumEntries(start_ikey, end_ikey);
}
virtual MemTableRep::Iterator* GetIterator(Arena* arena = nullptr) override {
return memtable_->GetIterator(arena);
}
virtual ~SpecialMemTableRep() override {}
private:
std::unique_ptr<MemTableRep> memtable_;
int num_entries_flush_;
int num_entries_;
};
// The factory for the hacky skip list mem table that triggers flush after
// number of entries exceeds a threshold.
class SpecialSkipListFactory : public MemTableRepFactory {
public:
// After number of inserts exceeds `num_entries_flush` in a mem table, trigger
// flush.
explicit SpecialSkipListFactory(int num_entries_flush)
: num_entries_flush_(num_entries_flush) {}
using MemTableRepFactory::CreateMemTableRep;
virtual MemTableRep* CreateMemTableRep(
const MemTableRep::KeyComparator& compare, Allocator* allocator,
const SliceTransform* transform, Logger* /*logger*/) override {
return new SpecialMemTableRep(
allocator, factory_.CreateMemTableRep(compare, allocator, transform, 0),
num_entries_flush_);
}
virtual const char* Name() const override { return "SkipListFactory"; }
bool IsInsertConcurrentlySupported() const override {
return factory_.IsInsertConcurrentlySupported();
}
private:
SkipListFactory factory_;
int num_entries_flush_;
};
// Special Env used to delay background operations
class SpecialEnv : public EnvWrapper {
public:
explicit SpecialEnv(Env* base, bool time_elapse_only_sleep = false);
Status NewWritableFile(const std::string& f, std::unique_ptr<WritableFile>* r,
const EnvOptions& soptions) override {
class SSTableFile : public WritableFile {
private:
SpecialEnv* env_;
std::unique_ptr<WritableFile> base_;
public:
SSTableFile(SpecialEnv* env, std::unique_ptr<WritableFile>&& base)
: env_(env), base_(std::move(base)) {}
Status Append(const Slice& data) override {
if (env_->table_write_callback_) {
(*env_->table_write_callback_)();
}
if (env_->drop_writes_.load(std::memory_order_acquire)) {
// Drop writes on the floor
return Status::OK();
} else if (env_->no_space_.load(std::memory_order_acquire)) {
return Status::NoSpace("No space left on device");
} else {
env_->bytes_written_ += data.size();
return base_->Append(data);
}
}
Status Append(
const Slice& data,
const DataVerificationInfo& /* verification_info */) override {
return Append(data);
}
Status PositionedAppend(const Slice& data, uint64_t offset) override {
if (env_->table_write_callback_) {
(*env_->table_write_callback_)();
}
if (env_->drop_writes_.load(std::memory_order_acquire)) {
// Drop writes on the floor
return Status::OK();
} else if (env_->no_space_.load(std::memory_order_acquire)) {
return Status::NoSpace("No space left on device");
} else {
env_->bytes_written_ += data.size();
return base_->PositionedAppend(data, offset);
}
}
Status PositionedAppend(
const Slice& data, uint64_t offset,
const DataVerificationInfo& /* verification_info */) override {
return PositionedAppend(data, offset);
}
Status Truncate(uint64_t size) override { return base_->Truncate(size); }
Status RangeSync(uint64_t offset, uint64_t nbytes) override {
Status s = base_->RangeSync(offset, nbytes);
#if !(defined NDEBUG) || !defined(OS_WIN)
TEST_SYNC_POINT_CALLBACK("SpecialEnv::SStableFile::RangeSync", &s);
#endif // !(defined NDEBUG) || !defined(OS_WIN)
return s;
}
Status Close() override {
// SyncPoint is not supported in Released Windows Mode.
#if !(defined NDEBUG) || !defined(OS_WIN)
// Check preallocation size
// preallocation size is never passed to base file.
size_t preallocation_size = preallocation_block_size();
TEST_SYNC_POINT_CALLBACK("DBTestWritableFile.GetPreallocationStatus",
&preallocation_size);
#endif // !(defined NDEBUG) || !defined(OS_WIN)
Status s = base_->Close();
#if !(defined NDEBUG) || !defined(OS_WIN)
TEST_SYNC_POINT_CALLBACK("SpecialEnv::SStableFile::Close", &s);
#endif // !(defined NDEBUG) || !defined(OS_WIN)
return s;
}
Status Flush() override { return base_->Flush(); }
Status Sync() override {
++env_->sync_counter_;
while (env_->delay_sstable_sync_.load(std::memory_order_acquire)) {
env_->SleepForMicroseconds(100000);
}
Status s;
if (!env_->skip_fsync_) {
s = base_->Sync();
}
#if !(defined NDEBUG) || !defined(OS_WIN)
TEST_SYNC_POINT_CALLBACK("SpecialEnv::SStableFile::Sync", &s);
#endif // !(defined NDEBUG) || !defined(OS_WIN)
return s;
}
void SetIOPriority(Env::IOPriority pri) override {
base_->SetIOPriority(pri);
}
Env::IOPriority GetIOPriority() override {
return base_->GetIOPriority();
}
bool use_direct_io() const override {
return base_->use_direct_io();
}
Status Allocate(uint64_t offset, uint64_t len) override {
return base_->Allocate(offset, len);
}
};
class ManifestFile : public WritableFile {
public:
ManifestFile(SpecialEnv* env, std::unique_ptr<WritableFile>&& b)
: env_(env), base_(std::move(b)) {}
Status Append(const Slice& data) override {
if (env_->manifest_write_error_.load(std::memory_order_acquire)) {
return Status::IOError("simulated writer error");
} else {
return base_->Append(data);
}
}
Status Append(
const Slice& data,
const DataVerificationInfo& /*verification_info*/) override {
return Append(data);
}
Status Truncate(uint64_t size) override { return base_->Truncate(size); }
Status Close() override { return base_->Close(); }
Status Flush() override { return base_->Flush(); }
Status Sync() override {
++env_->sync_counter_;
if (env_->manifest_sync_error_.load(std::memory_order_acquire)) {
return Status::IOError("simulated sync error");
} else {
if (env_->skip_fsync_) {
return Status::OK();
} else {
return base_->Sync();
}
}
}
uint64_t GetFileSize() override { return base_->GetFileSize(); }
Status Allocate(uint64_t offset, uint64_t len) override {
return base_->Allocate(offset, len);
}
private:
SpecialEnv* env_;
std::unique_ptr<WritableFile> base_;
};
class WalFile : public WritableFile {
public:
WalFile(SpecialEnv* env, std::unique_ptr<WritableFile>&& b)
: env_(env), base_(std::move(b)) {
env_->num_open_wal_file_.fetch_add(1);
}
virtual ~WalFile() { env_->num_open_wal_file_.fetch_add(-1); }
Status Append(const Slice& data) override {
#if !(defined NDEBUG) || !defined(OS_WIN)
TEST_SYNC_POINT("SpecialEnv::WalFile::Append:1");
#endif
Status s;
if (env_->log_write_error_.load(std::memory_order_acquire)) {
s = Status::IOError("simulated writer error");
} else {
int slowdown =
env_->log_write_slowdown_.load(std::memory_order_acquire);
if (slowdown > 0) {
env_->SleepForMicroseconds(slowdown);
}
s = base_->Append(data);
}
#if !(defined NDEBUG) || !defined(OS_WIN)
TEST_SYNC_POINT("SpecialEnv::WalFile::Append:2");
#endif
return s;
}
Status Append(
const Slice& data,
const DataVerificationInfo& /* verification_info */) override {
return Append(data);
}
Status Truncate(uint64_t size) override { return base_->Truncate(size); }
void PrepareWrite(size_t offset, size_t len) override {
base_->PrepareWrite(offset, len);
}
void SetPreallocationBlockSize(size_t size) override {
base_->SetPreallocationBlockSize(size);
}
Status Close() override {
// SyncPoint is not supported in Released Windows Mode.
#if !(defined NDEBUG) || !defined(OS_WIN)
// Check preallocation size
size_t block_size, last_allocated_block;
base_->GetPreallocationStatus(&block_size, &last_allocated_block);
TEST_SYNC_POINT_CALLBACK("DBTestWalFile.GetPreallocationStatus",
&block_size);
#endif // !(defined NDEBUG) || !defined(OS_WIN)
return base_->Close();
}
Status Flush() override { return base_->Flush(); }
Status Sync() override {
++env_->sync_counter_;
if (env_->corrupt_in_sync_) {
Append(std::string(33000, ' '));
return Status::IOError("Ingested Sync Failure");
}
if (env_->skip_fsync_) {
return Status::OK();
} else {
return base_->Sync();
}
}
bool IsSyncThreadSafe() const override {
return env_->is_wal_sync_thread_safe_.load();
}
Status Allocate(uint64_t offset, uint64_t len) override {
return base_->Allocate(offset, len);
}
private:
SpecialEnv* env_;
std::unique_ptr<WritableFile> base_;
};
class OtherFile : public WritableFile {
public:
OtherFile(SpecialEnv* env, std::unique_ptr<WritableFile>&& b)
: env_(env), base_(std::move(b)) {}
Status Append(const Slice& data) override { return base_->Append(data); }
Status Append(
const Slice& data,
const DataVerificationInfo& /*verification_info*/) override {
return Append(data);
}
Status Truncate(uint64_t size) override { return base_->Truncate(size); }
Status Close() override { return base_->Close(); }
Status Flush() override { return base_->Flush(); }
Status Sync() override {
if (env_->skip_fsync_) {
return Status::OK();
} else {
return base_->Sync();
}
}
uint64_t GetFileSize() override { return base_->GetFileSize(); }
Status Allocate(uint64_t offset, uint64_t len) override {
return base_->Allocate(offset, len);
}
private:
SpecialEnv* env_;
std::unique_ptr<WritableFile> base_;
};
if (no_file_overwrite_.load(std::memory_order_acquire) &&
target()->FileExists(f).ok()) {
return Status::NotSupported("SpecialEnv::no_file_overwrite_ is true.");
}
if (non_writeable_rate_.load(std::memory_order_acquire) > 0) {
uint32_t random_number;
{
MutexLock l(&rnd_mutex_);
random_number = rnd_.Uniform(100);
}
if (random_number < non_writeable_rate_.load()) {
return Status::IOError("simulated random write error");
}
}
new_writable_count_++;
if (non_writable_count_.load() > 0) {
non_writable_count_--;
return Status::IOError("simulated write error");
}
EnvOptions optimized = soptions;
if (strstr(f.c_str(), "MANIFEST") != nullptr ||
strstr(f.c_str(), "log") != nullptr) {
optimized.use_mmap_writes = false;
optimized.use_direct_writes = false;
}
Status s = target()->NewWritableFile(f, r, optimized);
if (s.ok()) {
if (strstr(f.c_str(), ".sst") != nullptr) {
r->reset(new SSTableFile(this, std::move(*r)));
} else if (strstr(f.c_str(), "MANIFEST") != nullptr) {
r->reset(new ManifestFile(this, std::move(*r)));
} else if (strstr(f.c_str(), "log") != nullptr) {
r->reset(new WalFile(this, std::move(*r)));
} else {
r->reset(new OtherFile(this, std::move(*r)));
}
}
return s;
}
Status NewRandomAccessFile(const std::string& f,
std::unique_ptr<RandomAccessFile>* r,
const EnvOptions& soptions) override {
class CountingFile : public RandomAccessFile {
public:
CountingFile(std::unique_ptr<RandomAccessFile>&& target,
anon::AtomicCounter* counter,
std::atomic<size_t>* bytes_read)
: target_(std::move(target)),
counter_(counter),
bytes_read_(bytes_read) {}
virtual Status Read(uint64_t offset, size_t n, Slice* result,
char* scratch) const override {
counter_->Increment();
Status s = target_->Read(offset, n, result, scratch);
*bytes_read_ += result->size();
return s;
}
virtual Status Prefetch(uint64_t offset, size_t n) override {
Status s = target_->Prefetch(offset, n);
*bytes_read_ += n;
return s;
}
private:
std::unique_ptr<RandomAccessFile> target_;
anon::AtomicCounter* counter_;
std::atomic<size_t>* bytes_read_;
};
class RandomFailureFile : public RandomAccessFile {
public:
RandomFailureFile(std::unique_ptr<RandomAccessFile>&& target,
std::atomic<uint64_t>* failure_cnt, uint32_t fail_odd)
: target_(std::move(target)),
fail_cnt_(failure_cnt),
fail_odd_(fail_odd) {}
virtual Status Read(uint64_t offset, size_t n, Slice* result,
char* scratch) const override {
if (Random::GetTLSInstance()->OneIn(fail_odd_)) {
fail_cnt_->fetch_add(1);
return Status::IOError("random error");
}
return target_->Read(offset, n, result, scratch);
}
virtual Status Prefetch(uint64_t offset, size_t n) override {
return target_->Prefetch(offset, n);
}
private:
std::unique_ptr<RandomAccessFile> target_;
std::atomic<uint64_t>* fail_cnt_;
uint32_t fail_odd_;
};
Status s = target()->NewRandomAccessFile(f, r, soptions);
random_file_open_counter_++;
if (s.ok()) {
if (count_random_reads_) {
r->reset(new CountingFile(std::move(*r), &random_read_counter_,
&random_read_bytes_counter_));
} else if (rand_reads_fail_odd_ > 0) {
r->reset(new RandomFailureFile(std::move(*r), &num_reads_fails_,
rand_reads_fail_odd_));
}
}
if (s.ok() && soptions.compaction_readahead_size > 0) {
compaction_readahead_size_ = soptions.compaction_readahead_size;
}
return s;
}
virtual Status NewSequentialFile(const std::string& f,
std::unique_ptr<SequentialFile>* r,
const EnvOptions& soptions) override {
class CountingFile : public SequentialFile {
public:
CountingFile(std::unique_ptr<SequentialFile>&& target,
anon::AtomicCounter* counter)
: target_(std::move(target)), counter_(counter) {}
virtual Status Read(size_t n, Slice* result, char* scratch) override {
counter_->Increment();
return target_->Read(n, result, scratch);
}
virtual Status Skip(uint64_t n) override { return target_->Skip(n); }
private:
std::unique_ptr<SequentialFile> target_;
anon::AtomicCounter* counter_;
};
Status s = target()->NewSequentialFile(f, r, soptions);
if (s.ok() && count_sequential_reads_) {
r->reset(new CountingFile(std::move(*r), &sequential_read_counter_));
}
return s;
}
virtual void SleepForMicroseconds(int micros) override {
sleep_counter_.Increment();
if (no_slowdown_ || time_elapse_only_sleep_) {
addon_microseconds_.fetch_add(micros);
}
if (!no_slowdown_) {
target()->SleepForMicroseconds(micros);
}
}
void MockSleepForMicroseconds(int64_t micros) {
sleep_counter_.Increment();
assert(no_slowdown_);
addon_microseconds_.fetch_add(micros);
}
void MockSleepForSeconds(int64_t seconds) {
sleep_counter_.Increment();
assert(no_slowdown_);
addon_microseconds_.fetch_add(seconds * 1000000);
}
virtual Status GetCurrentTime(int64_t* unix_time) override {
Status s;
if (time_elapse_only_sleep_) {
*unix_time = maybe_starting_time_;
} else {
s = target()->GetCurrentTime(unix_time);
}
if (s.ok()) {
// mock microseconds elapsed to seconds of time
*unix_time += addon_microseconds_.load() / 1000000;
}
return s;
}
virtual uint64_t NowCPUNanos() override {
now_cpu_count_.fetch_add(1);
return target()->NowCPUNanos();
}
virtual uint64_t NowNanos() override {
return (time_elapse_only_sleep_ ? 0 : target()->NowNanos()) +
addon_microseconds_.load() * 1000;
}
virtual uint64_t NowMicros() override {
return (time_elapse_only_sleep_ ? 0 : target()->NowMicros()) +
addon_microseconds_.load();
}
virtual Status DeleteFile(const std::string& fname) override {
delete_count_.fetch_add(1);
return target()->DeleteFile(fname);
}
void SetMockSleep(bool enabled = true) { no_slowdown_ = enabled; }
Status NewDirectory(const std::string& name,
std::unique_ptr<Directory>* result) override {
if (!skip_fsync_) {
return target()->NewDirectory(name, result);
} else {
class NoopDirectory : public Directory {
public:
NoopDirectory() {}
~NoopDirectory() {}
Status Fsync() override { return Status::OK(); }
};
result->reset(new NoopDirectory());
return Status::OK();
}
}
// Something to return when mocking current time
const int64_t maybe_starting_time_;
Random rnd_;
port::Mutex rnd_mutex_; // Lock to pretect rnd_
// sstable Sync() calls are blocked while this pointer is non-nullptr.
std::atomic<bool> delay_sstable_sync_;
// Drop writes on the floor while this pointer is non-nullptr.
std::atomic<bool> drop_writes_;
// Simulate no-space errors while this pointer is non-nullptr.
std::atomic<bool> no_space_;
// Simulate non-writable file system while this pointer is non-nullptr
std::atomic<bool> non_writable_;
// Force sync of manifest files to fail while this pointer is non-nullptr
std::atomic<bool> manifest_sync_error_;
// Force write to manifest files to fail while this pointer is non-nullptr
std::atomic<bool> manifest_write_error_;
// Force write to log files to fail while this pointer is non-nullptr
std::atomic<bool> log_write_error_;
// Slow down every log write, in micro-seconds.
std::atomic<int> log_write_slowdown_;
// If true, returns Status::NotSupported for file overwrite.
std::atomic<bool> no_file_overwrite_;
// Number of WAL files that are still open for write.
std::atomic<int> num_open_wal_file_;
bool count_random_reads_;
uint32_t rand_reads_fail_odd_ = 0;
std::atomic<uint64_t> num_reads_fails_;
anon::AtomicCounter random_read_counter_;
std::atomic<size_t> random_read_bytes_counter_;
std::atomic<int> random_file_open_counter_;
bool count_sequential_reads_;
anon::AtomicCounter sequential_read_counter_;
anon::AtomicCounter sleep_counter_;
std::atomic<int64_t> bytes_written_;
std::atomic<int> sync_counter_;
// If true, all fsync to files and directories are skipped.
bool skip_fsync_ = false;
// If true, ingest the corruption to file during sync.
bool corrupt_in_sync_ = false;
std::atomic<uint32_t> non_writeable_rate_;
std::atomic<uint32_t> new_writable_count_;
std::atomic<uint32_t> non_writable_count_;
std::function<void()>* table_write_callback_;
std::atomic<int> now_cpu_count_;
std::atomic<int> delete_count_;
std::atomic<bool> is_wal_sync_thread_safe_{true};
std::atomic<size_t> compaction_readahead_size_{};
private: // accessing these directly is prone to error
friend class DBTestBase;
std::atomic<int64_t> addon_microseconds_{0};
// Do not modify in the env of a running DB (could cause deadlock)
std::atomic<bool> time_elapse_only_sleep_;
bool no_slowdown_;
};
#ifndef ROCKSDB_LITE
class OnFileDeletionListener : public EventListener {
public:
OnFileDeletionListener() : matched_count_(0), expected_file_name_("") {}
void SetExpectedFileName(const std::string file_name) {
expected_file_name_ = file_name;
}
void VerifyMatchedCount(size_t expected_value) {
ASSERT_EQ(matched_count_, expected_value);
}
void OnTableFileDeleted(const TableFileDeletionInfo& info) override {
if (expected_file_name_ != "") {
ASSERT_EQ(expected_file_name_, info.file_path);
expected_file_name_ = "";
matched_count_++;
}
}
private:
size_t matched_count_;
std::string expected_file_name_;
};
class FlushCounterListener : public EventListener {
public:
std::atomic<int> count{0};
std::atomic<FlushReason> expected_flush_reason{FlushReason::kOthers};
void OnFlushBegin(DB* /*db*/, const FlushJobInfo& flush_job_info) override {
count++;
ASSERT_EQ(expected_flush_reason.load(), flush_job_info.flush_reason);
}
};
#endif
// A test merge operator mimics put but also fails if one of merge operands is
// "corrupted".
class TestPutOperator : public MergeOperator {
public:
virtual bool FullMergeV2(const MergeOperationInput& merge_in,
MergeOperationOutput* merge_out) const override {
if (merge_in.existing_value != nullptr &&
*(merge_in.existing_value) == "corrupted") {
return false;
}
for (auto value : merge_in.operand_list) {
if (value == "corrupted") {
return false;
}
}
merge_out->existing_operand = merge_in.operand_list.back();
return true;
}
virtual const char* Name() const override { return "TestPutOperator"; }
};
// A wrapper around Cache that can easily be extended with instrumentation,
// etc.
class CacheWrapper : public Cache {
public:
explicit CacheWrapper(std::shared_ptr<Cache> target)
: target_(std::move(target)) {}
const char* Name() const override { return target_->Name(); }
Status Insert(const Slice& key, void* value, size_t charge,
void (*deleter)(const Slice& key, void* value),
Handle** handle = nullptr,
Priority priority = Priority::LOW) override {
return target_->Insert(key, value, charge, deleter, handle, priority);
}
Handle* Lookup(const Slice& key, Statistics* stats = nullptr) override {
return target_->Lookup(key, stats);
}
bool Ref(Handle* handle) override { return target_->Ref(handle); }
bool Release(Handle* handle, bool force_erase = false) override {
return target_->Release(handle, force_erase);
}
void* Value(Handle* handle) override { return target_->Value(handle); }
void Erase(const Slice& key) override { target_->Erase(key); }
uint64_t NewId() override { return target_->NewId(); }
void SetCapacity(size_t capacity) override { target_->SetCapacity(capacity); }
void SetStrictCapacityLimit(bool strict_capacity_limit) override {
target_->SetStrictCapacityLimit(strict_capacity_limit);
}
bool HasStrictCapacityLimit() const override {
return target_->HasStrictCapacityLimit();
}
size_t GetCapacity() const override { return target_->GetCapacity(); }
size_t GetUsage() const override { return target_->GetUsage(); }
size_t GetUsage(Handle* handle) const override {
return target_->GetUsage(handle);
}
size_t GetPinnedUsage() const override { return target_->GetPinnedUsage(); }
size_t GetCharge(Handle* handle) const override {
return target_->GetCharge(handle);
}
void ApplyToAllCacheEntries(void (*callback)(void*, size_t),
bool thread_safe) override {
target_->ApplyToAllCacheEntries(callback, thread_safe);
}
void ApplyToAllEntries(
const std::function<void(const Slice& key, void* value, size_t charge,
DeleterFn deleter)>& callback,
const ApplyToAllEntriesOptions& opts) override {
target_->ApplyToAllEntries(callback, opts);
}
void EraseUnRefEntries() override { target_->EraseUnRefEntries(); }
protected:
std::shared_ptr<Cache> target_;
};
class DBTestBase : public testing::Test {
public:
// Sequence of option configurations to try
enum OptionConfig : int {
kDefault = 0,
kBlockBasedTableWithPrefixHashIndex = 1,
kBlockBasedTableWithWholeKeyHashIndex = 2,
kPlainTableFirstBytePrefix = 3,
kPlainTableCappedPrefix = 4,
kPlainTableCappedPrefixNonMmap = 5,
kPlainTableAllBytesPrefix = 6,
kVectorRep = 7,
kHashLinkList = 8,
kMergePut = 9,
kFilter = 10,
kFullFilterWithNewTableReaderForCompactions = 11,
kUncompressed = 12,
kNumLevel_3 = 13,
kDBLogDir = 14,
kWalDirAndMmapReads = 15,
kManifestFileSize = 16,
kPerfOptions = 17,
kHashSkipList = 18,
kUniversalCompaction = 19,
kUniversalCompactionMultiLevel = 20,
kCompressedBlockCache = 21,
kInfiniteMaxOpenFiles = 22,
kxxHashChecksum = 23,
kFIFOCompaction = 24,
kOptimizeFiltersForHits = 25,
kRowCache = 26,
kRecycleLogFiles = 27,
kConcurrentSkipList = 28,
kPipelinedWrite = 29,
kConcurrentWALWrites = 30,
kDirectIO,
kLevelSubcompactions,
kBlockBasedTableWithIndexRestartInterval,
kBlockBasedTableWithPartitionedIndex,
kBlockBasedTableWithPartitionedIndexFormat4,
kPartitionedFilterWithNewTableReaderForCompactions,
kUniversalSubcompactions,
kxxHash64Checksum,
kUnorderedWrite,
// This must be the last line
kEnd,
};
public:
std::string dbname_;
std::string alternative_wal_dir_;
std::string alternative_db_log_dir_;
MockEnv* mem_env_;
Env* encrypted_env_;
SpecialEnv* env_;
std::shared_ptr<Env> env_guard_;
DB* db_;
std::vector<ColumnFamilyHandle*> handles_;
int option_config_;
Options last_options_;
// Skip some options, as they may not be applicable to a specific test.
// To add more skip constants, use values 4, 8, 16, etc.
enum OptionSkip {
kNoSkip = 0,
kSkipDeletesFilterFirst = 1,
kSkipUniversalCompaction = 2,
kSkipMergePut = 4,
kSkipPlainTable = 8,
kSkipHashIndex = 16,
kSkipNoSeekToLast = 32,
kSkipFIFOCompaction = 128,
kSkipMmapReads = 256,
};
const int kRangeDelSkipConfigs =
// Plain tables do not support range deletions.
kSkipPlainTable |
// MmapReads disables the iterator pinning that RangeDelAggregator
// requires.
kSkipMmapReads;
// `env_do_fsync` decides whether the special Env would do real
// fsync for files and directories. Skipping fsync can speed up
// tests, but won't cover the exact fsync logic.
DBTestBase(const std::string path, bool env_do_fsync);
~DBTestBase();
static std::string Key(int i) {
char buf[100];
snprintf(buf, sizeof(buf), "key%06d", i);
return std::string(buf);
}
static bool ShouldSkipOptions(int option_config, int skip_mask = kNoSkip);
// Switch to a fresh database with the next option configuration to
// test. Return false if there are no more configurations to test.
bool ChangeOptions(int skip_mask = kNoSkip);
// Switch between different compaction styles.
bool ChangeCompactOptions();
// Switch between different WAL-realted options.
bool ChangeWalOptions();
// Switch between different filter policy
// Jump from kDefault to kFilter to kFullFilter
bool ChangeFilterOptions();
// Switch between different DB options for file ingestion tests.
bool ChangeOptionsForFileIngestionTest();
// Return the current option configuration.
Options CurrentOptions(const anon::OptionsOverride& options_override =
anon::OptionsOverride()) const;
Options CurrentOptions(const Options& default_options,
const anon::OptionsOverride& options_override =
anon::OptionsOverride()) const;
Options GetDefaultOptions() const;
Options GetOptions(int option_config) const {
return GetOptions(option_config, GetDefaultOptions());
}
Options GetOptions(int option_config, const Options& default_options,
const anon::OptionsOverride& options_override =
anon::OptionsOverride()) const;
DBImpl* dbfull() { return static_cast_with_check<DBImpl>(db_); }
void CreateColumnFamilies(const std::vector<std::string>& cfs,
const Options& options);
void CreateAndReopenWithCF(const std::vector<std::string>& cfs,
const Options& options);
void ReopenWithColumnFamilies(const std::vector<std::string>& cfs,
const std::vector<Options>& options);
void ReopenWithColumnFamilies(const std::vector<std::string>& cfs,
const Options& options);
Status TryReopenWithColumnFamilies(const std::vector<std::string>& cfs,
const std::vector<Options>& options);
Status TryReopenWithColumnFamilies(const std::vector<std::string>& cfs,
const Options& options);
void Reopen(const Options& options);
void Close();
void DestroyAndReopen(const Options& options);
void Destroy(const Options& options, bool delete_cf_paths = false);
Status ReadOnlyReopen(const Options& options);
Status TryReopen(const Options& options);
bool IsDirectIOSupported();
bool IsMemoryMappedAccessSupported() const;
Status Flush(int cf = 0);
Status Flush(const std::vector<int>& cf_ids);
Status Put(const Slice& k, const Slice& v, WriteOptions wo = WriteOptions());
Status Put(int cf, const Slice& k, const Slice& v,
WriteOptions wo = WriteOptions());
Status Merge(const Slice& k, const Slice& v,
WriteOptions wo = WriteOptions());
Status Merge(int cf, const Slice& k, const Slice& v,
WriteOptions wo = WriteOptions());
Status Delete(const std::string& k);
Status Delete(int cf, const std::string& k);
Status SingleDelete(const std::string& k);
Status SingleDelete(int cf, const std::string& k);
bool SetPreserveDeletesSequenceNumber(SequenceNumber sn);
std::string Get(const std::string& k, const Snapshot* snapshot = nullptr);
std::string Get(int cf, const std::string& k,
const Snapshot* snapshot = nullptr);
Status Get(const std::string& k, PinnableSlice* v);
std::vector<std::string> MultiGet(std::vector<int> cfs,
const std::vector<std::string>& k,
const Snapshot* snapshot,
const bool batched);
std::vector<std::string> MultiGet(const std::vector<std::string>& k,
const Snapshot* snapshot = nullptr);
uint64_t GetNumSnapshots();
uint64_t GetTimeOldestSnapshots();
uint64_t GetSequenceOldestSnapshots();
// Return a string that contains all key,value pairs in order,
// formatted like "(k1->v1)(k2->v2)".
std::string Contents(int cf = 0);
std::string AllEntriesFor(const Slice& user_key, int cf = 0);
#ifndef ROCKSDB_LITE
int NumSortedRuns(int cf = 0);
uint64_t TotalSize(int cf = 0);
uint64_t SizeAtLevel(int level);
size_t TotalLiveFiles(int cf = 0);
size_t CountLiveFiles();
int NumTableFilesAtLevel(int level, int cf = 0);
double CompressionRatioAtLevel(int level, int cf = 0);
int TotalTableFiles(int cf = 0, int levels = -1);
#endif // ROCKSDB_LITE
std::vector<uint64_t> GetBlobFileNumbers();
// Return spread of files per level
std::string FilesPerLevel(int cf = 0);
size_t CountFiles();
Status CountFiles(size_t* count);
Status Size(const Slice& start, const Slice& limit, uint64_t* size) {
return Size(start, limit, 0, size);
}
Status Size(const Slice& start, const Slice& limit, int cf, uint64_t* size);
void Compact(int cf, const Slice& start, const Slice& limit,
uint32_t target_path_id);
void Compact(int cf, const Slice& start, const Slice& limit);
void Compact(const Slice& start, const Slice& limit);
// Do n memtable compactions, each of which produces an sstable
// covering the range [small,large].
void MakeTables(int n, const std::string& small, const std::string& large,
int cf = 0);
// Prevent pushing of new sstables into deeper levels by adding
// tables that cover a specified range to all levels.
void FillLevels(const std::string& smallest, const std::string& largest,
int cf);
void MoveFilesToLevel(int level, int cf = 0);
#ifndef ROCKSDB_LITE
void DumpFileCounts(const char* label);
#endif // ROCKSDB_LITE
std::string DumpSSTableList();
static void GetSstFiles(Env* env, std::string path,
std::vector<std::string>* files);
int GetSstFileCount(std::string path);
// this will generate non-overlapping files since it keeps increasing key_idx
void GenerateNewFile(Random* rnd, int* key_idx, bool nowait = false);
void GenerateNewFile(int fd, Random* rnd, int* key_idx, bool nowait = false);
static const int kNumKeysByGenerateNewRandomFile;
static const int KNumKeysByGenerateNewFile = 100;
void GenerateNewRandomFile(Random* rnd, bool nowait = false);
std::string IterStatus(Iterator* iter);
Options OptionsForLogIterTest();
std::string DummyString(size_t len, char c = 'a');
void VerifyIterLast(std::string expected_key, int cf = 0);
// Used to test InplaceUpdate
// If previous value is nullptr or delta is > than previous value,
// sets newValue with delta
// If previous value is not empty,
// updates previous value with 'b' string of previous value size - 1.
static UpdateStatus updateInPlaceSmallerSize(char* prevValue,
uint32_t* prevSize, Slice delta,
std::string* newValue);
static UpdateStatus updateInPlaceSmallerVarintSize(char* prevValue,
uint32_t* prevSize,
Slice delta,
std::string* newValue);
static UpdateStatus updateInPlaceLargerSize(char* prevValue,
uint32_t* prevSize, Slice delta,
std::string* newValue);
static UpdateStatus updateInPlaceNoAction(char* prevValue, uint32_t* prevSize,
Slice delta, std::string* newValue);
// Utility method to test InplaceUpdate
void validateNumberOfEntries(int numValues, int cf = 0);
void CopyFile(const std::string& source, const std::string& destination,
uint64_t size = 0);
Status GetAllDataFiles(const FileType file_type,
std::unordered_map<std::string, uint64_t>* sst_files,
uint64_t* total_size = nullptr);
std::vector<std::uint64_t> ListTableFiles(Env* env, const std::string& path);
void VerifyDBFromMap(
std::map<std::string, std::string> true_data,
size_t* total_reads_res = nullptr, bool tailing_iter = false,
std::map<std::string, Status> status = std::map<std::string, Status>());
void VerifyDBInternal(
std::vector<std::pair<std::string, std::string>> true_data);
#ifndef ROCKSDB_LITE
uint64_t GetNumberOfSstFilesForColumnFamily(DB* db,
std::string column_family_name);
#endif // ROCKSDB_LITE
uint64_t TestGetTickerCount(const Options& options, Tickers ticker_type) {
return options.statistics->getTickerCount(ticker_type);
}
uint64_t TestGetAndResetTickerCount(const Options& options,
Tickers ticker_type) {
return options.statistics->getAndResetTickerCount(ticker_type);
}
// Note: reverting this setting within the same test run is not yet
// supported
void SetTimeElapseOnlySleepOnReopen(DBOptions* options);
private: // Prone to error on direct use
void MaybeInstallTimeElapseOnlySleep(const DBOptions& options);
bool time_elapse_only_sleep_on_reopen_ = false;
};
} // namespace ROCKSDB_NAMESPACE