0050a73a4f
Summary: This change standardizes on a new 16-byte cache key format for block cache (incl compressed and secondary) and persistent cache (but not table cache and row cache). The goal is a really fast cache key with practically ideal stability and uniqueness properties without external dependencies (e.g. from FileSystem). A fixed key size of 16 bytes should enable future optimizations to the concurrent hash table for block cache, which is a heavy CPU user / bottleneck, but there appears to be measurable performance improvement even with no changes to LRUCache. This change replaces a lot of disjointed and ugly code handling cache keys with calls to a simple, clean new internal API (cache_key.h). (Preserving the old cache key logic under an option would be very ugly and likely negate the performance gain of the new approach. Complete replacement carries some inherent risk, but I think that's acceptable with sufficient analysis and testing.) The scheme for encoding new cache keys is complicated but explained in cache_key.cc. Also: EndianSwapValue is moved to math.h to be next to other bit operations. (Explains some new include "math.h".) ReverseBits operation added and unit tests added to hash_test for both. Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126 Test Plan: ### Basic correctness Several tests needed updates to work with the new functionality, mostly because we are no longer relying on filesystem for stable cache keys so table builders & readers need more context info to agree on cache keys. This functionality is so core, a huge number of existing tests exercise the cache key functionality. ### Performance Create db with `TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters` And test performance with `TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4` using DEBUG_LEVEL=0 and simultaneous before & after runs. Before ops/sec, avg over 100 runs: 121924 After ops/sec, avg over 100 runs: 125385 (+2.8%) ### Collision probability I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity over many months, by making some pessimistic simplifying assumptions: * Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys) * All of every file is cached for its entire lifetime We use a simple table with skewed address assignment and replacement on address collision to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output with `./cache_bench -stress_cache_key -sck_keep_bits=40`: ``` Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached) ``` These come from default settings of 2.5M files per day of 32 MB each, and `-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality. More default assumptions, relatively pessimistic: * 100 DBs in same process (doesn't matter much) * Re-open DB in same process (new session ID related to old session ID) on average every 100 files generated * Restart process (all new session IDs unrelated to old) 24 times per day After enough data, we get a result at the end: ``` (keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected) ``` If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data: ``` (keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected) (keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected) ``` The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases: ``` 197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected) ``` I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data. Reviewed By: zhichao-cao Differential Revision: D33171746 Pulled By: pdillinger fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
184 lines
6.8 KiB
C++
184 lines
6.8 KiB
C++
// Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
|
|
#pragma once
|
|
|
|
#include <array>
|
|
#include <cstdint>
|
|
#include <memory>
|
|
#include <mutex>
|
|
|
|
#include "cache/cache_helpers.h"
|
|
#include "cache/cache_key.h"
|
|
#include "port/lang.h"
|
|
#include "rocksdb/cache.h"
|
|
#include "rocksdb/status.h"
|
|
#include "rocksdb/system_clock.h"
|
|
#include "test_util/sync_point.h"
|
|
#include "util/coding_lean.h"
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
// A generic helper object for gathering stats about cache entries by
|
|
// iterating over them with ApplyToAllEntries. This class essentially
|
|
// solves the problem of slowing down a Cache with too many stats
|
|
// collectors that could be sharing stat results, such as from multiple
|
|
// column families or multiple DBs sharing a Cache. We employ a few
|
|
// mitigations:
|
|
// * Only one collector for a particular kind of Stats is alive
|
|
// for each Cache. This is guaranteed using the Cache itself to hold
|
|
// the collector.
|
|
// * A mutex ensures only one thread is gathering stats for this
|
|
// collector.
|
|
// * The most recent gathered stats are saved and simply copied to
|
|
// satisfy requests within a time window (default: 3 minutes) of
|
|
// completion of the most recent stat gathering.
|
|
//
|
|
// Template parameter Stats must be copyable and trivially constructable,
|
|
// as well as...
|
|
// concept Stats {
|
|
// // Notification before applying callback to all entries
|
|
// void BeginCollection(Cache*, SystemClock*, uint64_t start_time_micros);
|
|
// // Get the callback to apply to all entries. `callback`
|
|
// // type must be compatible with Cache::ApplyToAllEntries
|
|
// callback GetEntryCallback();
|
|
// // Notification after applying callback to all entries
|
|
// void EndCollection(Cache*, SystemClock*, uint64_t end_time_micros);
|
|
// // Notification that a collection was skipped because of
|
|
// // sufficiently recent saved results.
|
|
// void SkippedCollection();
|
|
// }
|
|
template <class Stats>
|
|
class CacheEntryStatsCollector {
|
|
public:
|
|
// Gather and save stats if saved stats are too old. (Use GetStats() to
|
|
// read saved stats.)
|
|
//
|
|
// Maximum allowed age for a "hit" on saved results is determined by the
|
|
// two interval parameters. Both set to 0 forces a re-scan. For example
|
|
// with min_interval_seconds=300 and min_interval_factor=100, if the last
|
|
// scan took 10s, we would only rescan ("miss") if the age in seconds of
|
|
// the saved results is > max(300, 100*10).
|
|
// Justification: scans can vary wildly in duration, e.g. from 0.02 sec
|
|
// to as much as 20 seconds, so we want to be able to cap the absolute
|
|
// and relative frequency of scans.
|
|
void CollectStats(int min_interval_seconds, int min_interval_factor) {
|
|
// Waits for any pending reader or writer (collector)
|
|
std::lock_guard<std::mutex> lock(working_mutex_);
|
|
|
|
uint64_t max_age_micros =
|
|
static_cast<uint64_t>(std::max(min_interval_seconds, 0)) * 1000000U;
|
|
|
|
if (last_end_time_micros_ > last_start_time_micros_ &&
|
|
min_interval_factor > 0) {
|
|
max_age_micros = std::max(
|
|
max_age_micros, min_interval_factor * (last_end_time_micros_ -
|
|
last_start_time_micros_));
|
|
}
|
|
|
|
uint64_t start_time_micros = clock_->NowMicros();
|
|
if ((start_time_micros - last_end_time_micros_) > max_age_micros) {
|
|
last_start_time_micros_ = start_time_micros;
|
|
working_stats_.BeginCollection(cache_, clock_, start_time_micros);
|
|
|
|
cache_->ApplyToAllEntries(working_stats_.GetEntryCallback(), {});
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
"CacheEntryStatsCollector::GetStats:AfterApplyToAllEntries", nullptr);
|
|
|
|
uint64_t end_time_micros = clock_->NowMicros();
|
|
last_end_time_micros_ = end_time_micros;
|
|
working_stats_.EndCollection(cache_, clock_, end_time_micros);
|
|
} else {
|
|
working_stats_.SkippedCollection();
|
|
}
|
|
|
|
// Save so that we don't need to wait for an outstanding collection in
|
|
// order to make of copy of the last saved stats
|
|
std::lock_guard<std::mutex> lock2(saved_mutex_);
|
|
saved_stats_ = working_stats_;
|
|
}
|
|
|
|
// Gets saved stats, regardless of age
|
|
void GetStats(Stats *stats) {
|
|
std::lock_guard<std::mutex> lock(saved_mutex_);
|
|
*stats = saved_stats_;
|
|
}
|
|
|
|
Cache *GetCache() const { return cache_; }
|
|
|
|
// Gets or creates a shared instance of CacheEntryStatsCollector in the
|
|
// cache itself, and saves into `ptr`. This shared_ptr will hold the
|
|
// entry in cache until all refs are destroyed.
|
|
static Status GetShared(Cache *cache, SystemClock *clock,
|
|
std::shared_ptr<CacheEntryStatsCollector> *ptr) {
|
|
const Slice &cache_key = GetCacheKey();
|
|
|
|
Cache::Handle *h = cache->Lookup(cache_key);
|
|
if (h == nullptr) {
|
|
// Not yet in cache, but Cache doesn't provide a built-in way to
|
|
// avoid racing insert. So we double-check under a shared mutex,
|
|
// inspired by TableCache.
|
|
STATIC_AVOID_DESTRUCTION(std::mutex, static_mutex);
|
|
std::lock_guard<std::mutex> lock(static_mutex);
|
|
|
|
h = cache->Lookup(cache_key);
|
|
if (h == nullptr) {
|
|
auto new_ptr = new CacheEntryStatsCollector(cache, clock);
|
|
// TODO: non-zero charge causes some tests that count block cache
|
|
// usage to go flaky. Fix the problem somehow so we can use an
|
|
// accurate charge.
|
|
size_t charge = 0;
|
|
Status s = cache->Insert(cache_key, new_ptr, charge, Deleter, &h,
|
|
Cache::Priority::HIGH);
|
|
if (!s.ok()) {
|
|
assert(h == nullptr);
|
|
delete new_ptr;
|
|
return s;
|
|
}
|
|
}
|
|
}
|
|
// If we reach here, shared entry is in cache with handle `h`.
|
|
assert(cache->GetDeleter(h) == Deleter);
|
|
|
|
// Build an aliasing shared_ptr that keeps `ptr` in cache while there
|
|
// are references.
|
|
*ptr = MakeSharedCacheHandleGuard<CacheEntryStatsCollector>(cache, h);
|
|
return Status::OK();
|
|
}
|
|
|
|
private:
|
|
explicit CacheEntryStatsCollector(Cache *cache, SystemClock *clock)
|
|
: saved_stats_(),
|
|
working_stats_(),
|
|
last_start_time_micros_(0),
|
|
last_end_time_micros_(/*pessimistic*/ 10000000),
|
|
cache_(cache),
|
|
clock_(clock) {}
|
|
|
|
static void Deleter(const Slice &, void *value) {
|
|
delete static_cast<CacheEntryStatsCollector *>(value);
|
|
}
|
|
|
|
static const Slice &GetCacheKey() {
|
|
// For each template instantiation
|
|
static CacheKey ckey = CacheKey::CreateUniqueForProcessLifetime();
|
|
static Slice ckey_slice = ckey.AsSlice();
|
|
return ckey_slice;
|
|
}
|
|
|
|
std::mutex saved_mutex_;
|
|
Stats saved_stats_;
|
|
|
|
std::mutex working_mutex_;
|
|
Stats working_stats_;
|
|
uint64_t last_start_time_micros_;
|
|
uint64_t last_end_time_micros_;
|
|
|
|
Cache *const cache_;
|
|
SystemClock *const clock_;
|
|
};
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|