rocksdb/cache/lru_cache_test.cc
Peter Dillinger 78a309bf86 New Cache API for gathering statistics (#8225)
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:

* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)

I have enhanced cache_bench to validate this approach:

* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.

To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225

Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.

The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.

Baseline typical output:

```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662

Operation latency (ns):
Count: 80000000 Average: 11223.9494  StdDev: 29.61
Min: 0  Median: 7759.3973  Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[       0,       1 ]       68   0.000%   0.000%
(    2900,    4400 ]       89   0.000%   0.000%
(    4400,    6600 ] 33630240  42.038%  42.038% ########
(    6600,    9900 ] 18129842  22.662%  64.700% #####
(    9900,   14000 ]  7877533   9.847%  74.547% ##
(   14000,   22000 ] 15193238  18.992%  93.539% ####
(   22000,   33000 ]  3037061   3.796%  97.335% #
(   33000,   50000 ]  1626316   2.033%  99.368%
(   50000,   75000 ]   421532   0.527%  99.895%
(   75000,  110000 ]    56910   0.071%  99.966%
(  110000,  170000 ]    16134   0.020%  99.986%
(  170000,  250000 ]     5166   0.006%  99.993%
(  250000,  380000 ]     3017   0.004%  99.996%
(  380000,  570000 ]     1337   0.002%  99.998%
(  570000,  860000 ]      805   0.001%  99.999%
(  860000, 1200000 ]      319   0.000% 100.000%
( 1200000, 1900000 ]      231   0.000% 100.000%
( 1900000, 2900000 ]      100   0.000% 100.000%
( 2900000, 4300000 ]       39   0.000% 100.000%
( 4300000, 6500000 ]       16   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
```

New, gather_stats=false. Median thread ops/sec of 5 runs:

```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458

Operation latency (ns):
Count: 80000000 Average: 11298.1027  StdDev: 42.18
Min: 0  Median: 7722.0822  Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[       0,       1 ]      109   0.000%   0.000%
(    2900,    4400 ]      793   0.001%   0.001%
(    4400,    6600 ] 34054563  42.568%  42.569% #########
(    6600,    9900 ] 17482646  21.853%  64.423% ####
(    9900,   14000 ]  7908180   9.885%  74.308% ##
(   14000,   22000 ] 15032072  18.790%  93.098% ####
(   22000,   33000 ]  3237834   4.047%  97.145% #
(   33000,   50000 ]  1736882   2.171%  99.316%
(   50000,   75000 ]   446851   0.559%  99.875%
(   75000,  110000 ]    68251   0.085%  99.960%
(  110000,  170000 ]    18592   0.023%  99.983%
(  170000,  250000 ]     7200   0.009%  99.992%
(  250000,  380000 ]     3334   0.004%  99.997%
(  380000,  570000 ]     1393   0.002%  99.998%
(  570000,  860000 ]      700   0.001%  99.999%
(  860000, 1200000 ]      293   0.000% 100.000%
( 1200000, 1900000 ]      196   0.000% 100.000%
( 1900000, 2900000 ]       69   0.000% 100.000%
( 2900000, 4300000 ]       32   0.000% 100.000%
( 4300000, 6500000 ]       10   0.000% 100.000%
```

New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:

```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551

Operation latency (ns):
Count: 80000000 Average: 11311.2629  StdDev: 45.28
Min: 0  Median: 7686.5458  Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[       0,       1 ]       71   0.000%   0.000%
(    2900,    4400 ]      291   0.000%   0.000%
(    4400,    6600 ] 34492060  43.115%  43.116% #########
(    6600,    9900 ] 16727328  20.909%  64.025% ####
(    9900,   14000 ]  7845828   9.807%  73.832% ##
(   14000,   22000 ] 15510654  19.388%  93.220% ####
(   22000,   33000 ]  3216533   4.021%  97.241% #
(   33000,   50000 ]  1680859   2.101%  99.342%
(   50000,   75000 ]   439059   0.549%  99.891%
(   75000,  110000 ]    60540   0.076%  99.967%
(  110000,  170000 ]    14649   0.018%  99.985%
(  170000,  250000 ]     5242   0.007%  99.991%
(  250000,  380000 ]     3260   0.004%  99.995%
(  380000,  570000 ]     1599   0.002%  99.997%
(  570000,  860000 ]     1043   0.001%  99.999%
(  860000, 1200000 ]      471   0.001%  99.999%
( 1200000, 1900000 ]      275   0.000% 100.000%
( 1900000, 2900000 ]      143   0.000% 100.000%
( 2900000, 4300000 ]       60   0.000% 100.000%
( 4300000, 6500000 ]       27   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
( 9800000, 14000000 ]        1   0.000% 100.000%

Gather stats latency (us):
Count: 46 Average: 980387.5870  StdDev: 60911.18
Min: 879155  Median: 1033777.7778  Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
(  860000, 1200000 ]       45  97.826%  97.826% ####################
( 1200000, 1900000 ]        1   2.174% 100.000%

Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```

Reviewed By: mrambacher

Differential Revision: D28295742

Pulled By: pdillinger

fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
2021-05-11 16:17:10 -07:00

201 lines
6.1 KiB
C++

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
#include "cache/lru_cache.h"
#include <string>
#include <vector>
#include "port/port.h"
#include "test_util/testharness.h"
namespace ROCKSDB_NAMESPACE {
class LRUCacheTest : public testing::Test {
public:
LRUCacheTest() {}
~LRUCacheTest() override { DeleteCache(); }
void DeleteCache() {
if (cache_ != nullptr) {
cache_->~LRUCacheShard();
port::cacheline_aligned_free(cache_);
cache_ = nullptr;
}
}
void NewCache(size_t capacity, double high_pri_pool_ratio = 0.0,
bool use_adaptive_mutex = kDefaultToAdaptiveMutex) {
DeleteCache();
cache_ = reinterpret_cast<LRUCacheShard*>(
port::cacheline_aligned_alloc(sizeof(LRUCacheShard)));
new (cache_)
LRUCacheShard(capacity, false /*strict_capacity_limit*/,
high_pri_pool_ratio, use_adaptive_mutex,
kDontChargeCacheMetadata, 24 /*max_upper_hash_bits*/);
}
void Insert(const std::string& key,
Cache::Priority priority = Cache::Priority::LOW) {
EXPECT_OK(cache_->Insert(key, 0 /*hash*/, nullptr /*value*/, 1 /*charge*/,
nullptr /*deleter*/, nullptr /*handle*/,
priority));
}
void Insert(char key, Cache::Priority priority = Cache::Priority::LOW) {
Insert(std::string(1, key), priority);
}
bool Lookup(const std::string& key) {
auto handle = cache_->Lookup(key, 0 /*hash*/);
if (handle) {
cache_->Release(handle);
return true;
}
return false;
}
bool Lookup(char key) { return Lookup(std::string(1, key)); }
void Erase(const std::string& key) { cache_->Erase(key, 0 /*hash*/); }
void ValidateLRUList(std::vector<std::string> keys,
size_t num_high_pri_pool_keys = 0) {
LRUHandle* lru;
LRUHandle* lru_low_pri;
cache_->TEST_GetLRUList(&lru, &lru_low_pri);
LRUHandle* iter = lru;
bool in_high_pri_pool = false;
size_t high_pri_pool_keys = 0;
if (iter == lru_low_pri) {
in_high_pri_pool = true;
}
for (const auto& key : keys) {
iter = iter->next;
ASSERT_NE(lru, iter);
ASSERT_EQ(key, iter->key().ToString());
ASSERT_EQ(in_high_pri_pool, iter->InHighPriPool());
if (in_high_pri_pool) {
high_pri_pool_keys++;
}
if (iter == lru_low_pri) {
ASSERT_FALSE(in_high_pri_pool);
in_high_pri_pool = true;
}
}
ASSERT_EQ(lru, iter->next);
ASSERT_TRUE(in_high_pri_pool);
ASSERT_EQ(num_high_pri_pool_keys, high_pri_pool_keys);
}
private:
LRUCacheShard* cache_ = nullptr;
};
TEST_F(LRUCacheTest, BasicLRU) {
NewCache(5);
for (char ch = 'a'; ch <= 'e'; ch++) {
Insert(ch);
}
ValidateLRUList({"a", "b", "c", "d", "e"});
for (char ch = 'x'; ch <= 'z'; ch++) {
Insert(ch);
}
ValidateLRUList({"d", "e", "x", "y", "z"});
ASSERT_FALSE(Lookup("b"));
ValidateLRUList({"d", "e", "x", "y", "z"});
ASSERT_TRUE(Lookup("e"));
ValidateLRUList({"d", "x", "y", "z", "e"});
ASSERT_TRUE(Lookup("z"));
ValidateLRUList({"d", "x", "y", "e", "z"});
Erase("x");
ValidateLRUList({"d", "y", "e", "z"});
ASSERT_TRUE(Lookup("d"));
ValidateLRUList({"y", "e", "z", "d"});
Insert("u");
ValidateLRUList({"y", "e", "z", "d", "u"});
Insert("v");
ValidateLRUList({"e", "z", "d", "u", "v"});
}
TEST_F(LRUCacheTest, MidpointInsertion) {
// Allocate 2 cache entries to high-pri pool.
NewCache(5, 0.45);
Insert("a", Cache::Priority::LOW);
Insert("b", Cache::Priority::LOW);
Insert("c", Cache::Priority::LOW);
Insert("x", Cache::Priority::HIGH);
Insert("y", Cache::Priority::HIGH);
ValidateLRUList({"a", "b", "c", "x", "y"}, 2);
// Low-pri entries inserted to the tail of low-pri list (the midpoint).
// After lookup, it will move to the tail of the full list.
Insert("d", Cache::Priority::LOW);
ValidateLRUList({"b", "c", "d", "x", "y"}, 2);
ASSERT_TRUE(Lookup("d"));
ValidateLRUList({"b", "c", "x", "y", "d"}, 2);
// High-pri entries will be inserted to the tail of full list.
Insert("z", Cache::Priority::HIGH);
ValidateLRUList({"c", "x", "y", "d", "z"}, 2);
}
TEST_F(LRUCacheTest, EntriesWithPriority) {
// Allocate 2 cache entries to high-pri pool.
NewCache(5, 0.45);
Insert("a", Cache::Priority::LOW);
Insert("b", Cache::Priority::LOW);
Insert("c", Cache::Priority::LOW);
ValidateLRUList({"a", "b", "c"}, 0);
// Low-pri entries can take high-pri pool capacity if available
Insert("u", Cache::Priority::LOW);
Insert("v", Cache::Priority::LOW);
ValidateLRUList({"a", "b", "c", "u", "v"}, 0);
Insert("X", Cache::Priority::HIGH);
Insert("Y", Cache::Priority::HIGH);
ValidateLRUList({"c", "u", "v", "X", "Y"}, 2);
// High-pri entries can overflow to low-pri pool.
Insert("Z", Cache::Priority::HIGH);
ValidateLRUList({"u", "v", "X", "Y", "Z"}, 2);
// Low-pri entries will be inserted to head of low-pri pool.
Insert("a", Cache::Priority::LOW);
ValidateLRUList({"v", "X", "a", "Y", "Z"}, 2);
// Low-pri entries will be inserted to head of high-pri pool after lookup.
ASSERT_TRUE(Lookup("v"));
ValidateLRUList({"X", "a", "Y", "Z", "v"}, 2);
// High-pri entries will be inserted to the head of the list after lookup.
ASSERT_TRUE(Lookup("X"));
ValidateLRUList({"a", "Y", "Z", "v", "X"}, 2);
ASSERT_TRUE(Lookup("Z"));
ValidateLRUList({"a", "Y", "v", "X", "Z"}, 2);
Erase("Y");
ValidateLRUList({"a", "v", "X", "Z"}, 2);
Erase("X");
ValidateLRUList({"a", "v", "Z"}, 1);
Insert("d", Cache::Priority::LOW);
Insert("e", Cache::Priority::LOW);
ValidateLRUList({"a", "v", "d", "e", "Z"}, 1);
Insert("f", Cache::Priority::LOW);
Insert("g", Cache::Priority::LOW);
ValidateLRUList({"d", "e", "f", "g", "Z"}, 1);
ASSERT_TRUE(Lookup("d"));
ValidateLRUList({"e", "f", "g", "Z", "d"}, 2);
}
} // namespace ROCKSDB_NAMESPACE
int main(int argc, char** argv) {
::testing::InitGoogleTest(&argc, argv);
return RUN_ALL_TESTS();
}