rocksdb/cache/cache_entry_stats.h
Peter Dillinger df5dc73bec Don't hold DB mutex for block cache entry stat scans (#8538)
Summary:
I previously didn't notice the DB mutex was being held during
block cache entry stat scans, probably because I primarily checked for
read performance regressions, because they require the block cache and
are traditionally latency-sensitive.

This change does some refactoring to avoid holding DB mutex and to
avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
Some tests have to be updated because now the stats collector is
populated in the Cache aggressively on DB startup rather than lazily.
(I hope to clean up some of this added complexity in the future.)

This change also ensures proper treatment of need_out_of_mutex for
non-int DB properties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538

Test Plan:
Added unit test logic that uses sync points to fail if the DB mutex
is held during a scan, covering the various ways that a scan might be
triggered.

Performance test - the known impact to holding the DB mutex is on
TransactionDB, and the easiest way to see the impact is to hack the
scan code to almost always miss and take an artificially long time
scanning. Here I've injected an unconditional 5s sleep at the call to
ApplyToAllEntries.

Before (hacked):

    $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     433.219 micros/op 2308 ops/sec;    0.1 MB/s ( transactions:78999 aborts:0)
    rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
    $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     448.802 micros/op 2228 ops/sec;    0.1 MB/s ( transactions:75999 aborts:0)
    rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323

Notice the 5s P100 write time.

After (hacked):

    $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     303.645 micros/op 3293 ops/sec;    0.1 MB/s ( transactions:98999 aborts:0)
    rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
    $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     310.383 micros/op 3221 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
    rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918

P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:

    $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     311.365 micros/op 3211 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
    rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
    $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     308.395 micros/op 3242 ops/sec;    0.1 MB/s ( transactions:97999 aborts:0)
    rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832

No substantial difference.

Reviewed By: siying

Differential Revision: D29738847

Pulled By: pdillinger

fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 14:13:08 -07:00

181 lines
6.8 KiB
C++

// Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
#pragma once
#include <array>
#include <cstdint>
#include <memory>
#include <mutex>
#include "cache/cache_helpers.h"
#include "port/lang.h"
#include "rocksdb/cache.h"
#include "rocksdb/status.h"
#include "rocksdb/system_clock.h"
#include "util/coding_lean.h"
namespace ROCKSDB_NAMESPACE {
// A generic helper object for gathering stats about cache entries by
// iterating over them with ApplyToAllEntries. This class essentially
// solves the problem of slowing down a Cache with too many stats
// collectors that could be sharing stat results, such as from multiple
// column families or multiple DBs sharing a Cache. We employ a few
// mitigations:
// * Only one collector for a particular kind of Stats is alive
// for each Cache. This is guaranteed using the Cache itself to hold
// the collector.
// * A mutex ensures only one thread is gathering stats for this
// collector.
// * The most recent gathered stats are saved and simply copied to
// satisfy requests within a time window (default: 3 minutes) of
// completion of the most recent stat gathering.
//
// Template parameter Stats must be copyable and trivially constructable,
// as well as...
// concept Stats {
// // Notification before applying callback to all entries
// void BeginCollection(Cache*, SystemClock*, uint64_t start_time_micros);
// // Get the callback to apply to all entries. `callback`
// // type must be compatible with Cache::ApplyToAllEntries
// callback GetEntryCallback();
// // Notification after applying callback to all entries
// void EndCollection(Cache*, SystemClock*, uint64_t end_time_micros);
// // Notification that a collection was skipped because of
// // sufficiently recent saved results.
// void SkippedCollection();
// }
template <class Stats>
class CacheEntryStatsCollector {
public:
// Gather and save stats if saved stats are too old. (Use GetStats() to
// read saved stats.)
//
// Maximum allowed age for a "hit" on saved results is determined by the
// two interval parameters. Both set to 0 forces a re-scan. For example
// with min_interval_seconds=300 and min_interval_factor=100, if the last
// scan took 10s, we would only rescan ("miss") if the age in seconds of
// the saved results is > max(300, 100*10).
// Justification: scans can vary wildly in duration, e.g. from 0.02 sec
// to as much as 20 seconds, so we want to be able to cap the absolute
// and relative frequency of scans.
void CollectStats(int min_interval_seconds, int min_interval_factor) {
// Waits for any pending reader or writer (collector)
std::lock_guard<std::mutex> lock(working_mutex_);
uint64_t max_age_micros =
static_cast<uint64_t>(std::max(min_interval_seconds, 0)) * 1000000U;
if (last_end_time_micros_ > last_start_time_micros_ &&
min_interval_factor > 0) {
max_age_micros = std::max(
max_age_micros, min_interval_factor * (last_end_time_micros_ -
last_start_time_micros_));
}
uint64_t start_time_micros = clock_->NowMicros();
if ((start_time_micros - last_end_time_micros_) > max_age_micros) {
last_start_time_micros_ = start_time_micros;
working_stats_.BeginCollection(cache_, clock_, start_time_micros);
cache_->ApplyToAllEntries(working_stats_.GetEntryCallback(), {});
TEST_SYNC_POINT_CALLBACK(
"CacheEntryStatsCollector::GetStats:AfterApplyToAllEntries", nullptr);
uint64_t end_time_micros = clock_->NowMicros();
last_end_time_micros_ = end_time_micros;
working_stats_.EndCollection(cache_, clock_, end_time_micros);
} else {
working_stats_.SkippedCollection();
}
// Save so that we don't need to wait for an outstanding collection in
// order to make of copy of the last saved stats
std::lock_guard<std::mutex> lock2(saved_mutex_);
saved_stats_ = working_stats_;
}
// Gets saved stats, regardless of age
void GetStats(Stats *stats) {
std::lock_guard<std::mutex> lock(saved_mutex_);
*stats = saved_stats_;
}
Cache *GetCache() const { return cache_; }
// Gets or creates a shared instance of CacheEntryStatsCollector in the
// cache itself, and saves into `ptr`. This shared_ptr will hold the
// entry in cache until all refs are destroyed.
static Status GetShared(Cache *cache, SystemClock *clock,
std::shared_ptr<CacheEntryStatsCollector> *ptr) {
std::array<uint64_t, 3> cache_key_data{
{// First 16 bytes == md5 of class name
0x7eba5a8fb5437c90U, 0x8ca68c9b11655855U,
// Last 8 bytes based on a function pointer to make unique for each
// template instantiation
reinterpret_cast<uint64_t>(&CacheEntryStatsCollector::GetShared)}};
Slice cache_key = GetSlice(&cache_key_data);
Cache::Handle *h = cache->Lookup(cache_key);
if (h == nullptr) {
// Not yet in cache, but Cache doesn't provide a built-in way to
// avoid racing insert. So we double-check under a shared mutex,
// inspired by TableCache.
STATIC_AVOID_DESTRUCTION(std::mutex, static_mutex);
std::lock_guard<std::mutex> lock(static_mutex);
h = cache->Lookup(cache_key);
if (h == nullptr) {
auto new_ptr = new CacheEntryStatsCollector(cache, clock);
// TODO: non-zero charge causes some tests that count block cache
// usage to go flaky. Fix the problem somehow so we can use an
// accurate charge.
size_t charge = 0;
Status s = cache->Insert(cache_key, new_ptr, charge, Deleter, &h,
Cache::Priority::HIGH);
if (!s.ok()) {
assert(h == nullptr);
delete new_ptr;
return s;
}
}
}
// If we reach here, shared entry is in cache with handle `h`.
assert(cache->GetDeleter(h) == Deleter);
// Build an aliasing shared_ptr that keeps `ptr` in cache while there
// are references.
*ptr = MakeSharedCacheHandleGuard<CacheEntryStatsCollector>(cache, h);
return Status::OK();
}
private:
explicit CacheEntryStatsCollector(Cache *cache, SystemClock *clock)
: saved_stats_(),
working_stats_(),
last_start_time_micros_(0),
last_end_time_micros_(/*pessimistic*/ 10000000),
cache_(cache),
clock_(clock) {}
static void Deleter(const Slice &, void *value) {
delete static_cast<CacheEntryStatsCollector *>(value);
}
std::mutex saved_mutex_;
Stats saved_stats_;
std::mutex working_mutex_;
Stats working_stats_;
uint64_t last_start_time_micros_;
uint64_t last_end_time_micros_;
Cache *const cache_;
SystemClock *const clock_;
};
} // namespace ROCKSDB_NAMESPACE