2016-02-09 15:12:00 -08:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-15 16:03:42 -07:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
2013-10-16 14:59:46 -07:00
|
|
|
//
|
2011-03-18 22:37:00 +00:00
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
|
2013-10-04 22:32:05 -07:00
|
|
|
#pragma once
|
2014-02-28 18:19:07 -08:00
|
|
|
|
2021-08-16 20:36:19 -07:00
|
|
|
#include <cstdint>
|
|
|
|
|
New stable, fixed-length cache keys (#9126)
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).
The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.
This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)
The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.
Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.
Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126
Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.
### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)
### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime
We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:
```
Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```
These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.
More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day
After enough data, we get a result at the end:
```
(keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```
If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:
```
(keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```
The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:
```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```
I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.
Reviewed By: zhichao-cao
Differential Revision: D33171746
Pulled By: pdillinger
fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
2021-12-16 17:13:55 -08:00
|
|
|
#include "cache/cache_key.h"
|
Cache fragmented range tombstones in BlockBasedTableReader (#4493)
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.
On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found)
```
Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
Tombstones? | avg micros/op | stddev micros/op | avg ops/s | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None | 0.6186 | 0.04637 | 1,625,252.90 | 124,679.41
500 Expanded | 0.6019 | 0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded | 0.6435 | 0.03994 | 1,559,979.40 | 104,090.52
1k Expanded | 0.6034 | 0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded | 0.6261 | 0.03093 | 1,600,457.50 | 79,024.94
5k Expanded | 0.6163 | 0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded | 0.6402 | 0.04002 | 1,567,804.70 | 100,965.55
10k Expanded | 0.6036 | 0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded | 0.6128 | 0.02598 | 1,634,633.40 | 72,161.82
25k Expanded | 0.6198 | 0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded | 0.5478 | 0.0362 | 1,833,059.10 | 121,233.81
50k Expanded | 0.5104 | 0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded | 0.4528 | 0.03387 | 2,219,034.50 | 170,984.32
```
After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493
Differential Revision: D10842844
Pulled By: abhimadan
fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
2018-10-25 19:25:00 -07:00
|
|
|
#include "db/range_tombstone_fragmenter.h"
|
2019-06-14 17:37:24 -07:00
|
|
|
#include "file/filename.h"
|
2022-01-21 11:36:36 -08:00
|
|
|
#include "rocksdb/slice_transform.h"
|
New stable, fixed-length cache keys (#9126)
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).
The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.
This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)
The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.
Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.
Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126
Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.
### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)
### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime
We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:
```
Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```
These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.
More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day
After enough data, we get a result at the end:
```
(keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```
If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:
```
(keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```
The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:
```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```
I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.
Reviewed By: zhichao-cao
Differential Revision: D33171746
Pulled By: pdillinger
fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
2021-12-16 17:13:55 -08:00
|
|
|
#include "rocksdb/table_properties.h"
|
2021-09-29 04:01:57 -07:00
|
|
|
#include "table/block_based/block.h"
|
2019-05-30 14:47:29 -07:00
|
|
|
#include "table/block_based/block_based_table_factory.h"
|
2019-06-06 11:28:54 -07:00
|
|
|
#include "table/block_based/block_type.h"
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 12:17:34 -07:00
|
|
|
#include "table/block_based/cachable_entry.h"
|
2019-05-30 14:47:29 -07:00
|
|
|
#include "table/block_based/filter_block.h"
|
2019-07-23 15:57:43 -07:00
|
|
|
#include "table/block_based/uncompression_dict_reader.h"
|
Improve / clean up meta block code & integrity (#9163)
Summary:
* Checksums are now checked on meta blocks unless specifically
suppressed or not applicable (e.g. plain table). (Was other way around.)
This means a number of cases that were not checking checksums now are,
including direct read TableProperties in Version::GetTableProperties
(fixed in meta_blocks ReadTableProperties), reading any block from
PersistentCache (fixed in BlockFetcher), read TableProperties in
SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open,
maybe more.
* For that to work, I moved the global_seqno+TableProperties checksum
logic to the shared table/ code, because that is used by many utilies
such as SstFileDumper.
* Also for that to work, we have to know when we're dealing with a block
that has a checksum (trailer), so added that capability to Footer based
on magic number, and from there BlockFetcher.
* Knowledge of trailer presence has also fixed a problem where other
table formats were reading blocks including bytes for a non-existant
trailer--and awkwardly kind-of not using them, e.g. no shared code
checking checksums. (BlockFetcher compression type was populated
incorrectly.) Now we only read what is needed.
* Minimized code duplication and differing/incompatible/awkward
abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block
without parsing block handle)
* Moved some meta block handling code from table_properties*.*
* Moved some code specific to block-based table from shared table/ code
to BlockBasedTable class. The checksum stuff means we can't completely
separate it, but things that don't need to be in shared table/ code
should not be.
* Use unique_ptr rather than raw ptr in more places. (Note: you can
std::move from unique_ptr to shared_ptr.)
Without enhancements to GetPropertiesOfAllTablesTest (see below),
net reduction of roughly 100 lines of code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163
Test Plan:
existing tests and
* Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that
checksums are now checked on direct read of table properties by TableCache
(new test would fail before this change)
* Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test
putting table properties under old meta name
* Also generally enhanced that same test to actually test what it was
supposed to be testing already, by kicking things out of table cache when
we don't want them there.
Reviewed By: ajkr, mrambacher
Differential Revision: D32514757
Pulled By: pdillinger
fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9
2021-11-18 11:42:12 -08:00
|
|
|
#include "table/format.h"
|
2021-12-10 08:12:09 -08:00
|
|
|
#include "table/persistent_cache_options.h"
|
2014-11-13 14:39:30 -05:00
|
|
|
#include "table/table_properties_internal.h"
|
2015-12-15 18:20:10 -08:00
|
|
|
#include "table/table_reader.h"
|
2017-02-06 16:29:29 -08:00
|
|
|
#include "table/two_level_iterator.h"
|
2019-06-13 15:39:52 -07:00
|
|
|
#include "trace_replay/block_cache_tracer.h"
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2020-02-20 12:07:53 -08:00
|
|
|
namespace ROCKSDB_NAMESPACE {
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2014-02-28 18:19:07 -08:00
|
|
|
class Cache;
|
|
|
|
class FilterBlockReader;
|
2014-09-08 10:37:05 -07:00
|
|
|
class BlockBasedFilterBlockReader;
|
|
|
|
class FullFilterBlockReader;
|
2012-04-17 08:36:46 -07:00
|
|
|
class Footer;
|
2014-02-28 18:19:07 -08:00
|
|
|
class InternalKeyComparator;
|
|
|
|
class Iterator;
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:47:08 -08:00
|
|
|
class FSRandomAccessFile;
|
2012-04-17 08:36:46 -07:00
|
|
|
class TableCache;
|
2013-10-30 10:52:33 -07:00
|
|
|
class TableReader;
|
2014-02-28 18:19:07 -08:00
|
|
|
class WritableFile;
|
2014-01-24 10:57:15 -08:00
|
|
|
struct BlockBasedTableOptions;
|
2014-02-28 18:19:07 -08:00
|
|
|
struct EnvOptions;
|
|
|
|
struct ReadOptions;
|
2014-09-29 11:09:09 -07:00
|
|
|
class GetContext;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2021-09-07 11:31:12 -07:00
|
|
|
using KVPairBlock = std::vector<std::pair<std::string, std::string>>;
|
2016-08-01 14:50:19 -07:00
|
|
|
|
2019-05-24 12:26:58 -07:00
|
|
|
// Reader class for BlockBasedTable format.
|
|
|
|
// For the format of BlockBasedTable refer to
|
|
|
|
// https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format.
|
|
|
|
// This is the default table type. Data is chucked into fixed size blocks and
|
|
|
|
// each block in-turn stores entries. When storing data, we can compress and/or
|
|
|
|
// encode data efficiently within a block, which often results in a much smaller
|
|
|
|
// data size compared with the raw data size. As for the record retrieval, we'll
|
|
|
|
// first locate the block where target record may reside, then read the block to
|
|
|
|
// memory, and finally search that record within the block. Of course, to avoid
|
|
|
|
// frequent reads of the same block, we introduced the block cache to keep the
|
|
|
|
// loaded blocks in the memory.
|
2013-10-30 10:52:33 -07:00
|
|
|
class BlockBasedTable : public TableReader {
|
2011-03-18 22:37:00 +00:00
|
|
|
public:
|
2013-10-10 11:43:24 -07:00
|
|
|
static const std::string kFilterBlockPrefix;
|
2014-09-08 10:37:05 -07:00
|
|
|
static const std::string kFullFilterBlockPrefix;
|
2017-03-07 13:48:02 -08:00
|
|
|
static const std::string kPartitionedFilterBlockPrefix;
|
2013-10-10 11:43:24 -07:00
|
|
|
|
2019-08-16 16:40:09 -07:00
|
|
|
// All the below fields control iterator readahead
|
|
|
|
static const size_t kInitAutoReadaheadSize = 8 * 1024;
|
|
|
|
static const int kMinNumFileReadsToStartAutoReadahead = 2;
|
|
|
|
|
Improve / clean up meta block code & integrity (#9163)
Summary:
* Checksums are now checked on meta blocks unless specifically
suppressed or not applicable (e.g. plain table). (Was other way around.)
This means a number of cases that were not checking checksums now are,
including direct read TableProperties in Version::GetTableProperties
(fixed in meta_blocks ReadTableProperties), reading any block from
PersistentCache (fixed in BlockFetcher), read TableProperties in
SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open,
maybe more.
* For that to work, I moved the global_seqno+TableProperties checksum
logic to the shared table/ code, because that is used by many utilies
such as SstFileDumper.
* Also for that to work, we have to know when we're dealing with a block
that has a checksum (trailer), so added that capability to Footer based
on magic number, and from there BlockFetcher.
* Knowledge of trailer presence has also fixed a problem where other
table formats were reading blocks including bytes for a non-existant
trailer--and awkwardly kind-of not using them, e.g. no shared code
checking checksums. (BlockFetcher compression type was populated
incorrectly.) Now we only read what is needed.
* Minimized code duplication and differing/incompatible/awkward
abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block
without parsing block handle)
* Moved some meta block handling code from table_properties*.*
* Moved some code specific to block-based table from shared table/ code
to BlockBasedTable class. The checksum stuff means we can't completely
separate it, but things that don't need to be in shared table/ code
should not be.
* Use unique_ptr rather than raw ptr in more places. (Note: you can
std::move from unique_ptr to shared_ptr.)
Without enhancements to GetPropertiesOfAllTablesTest (see below),
net reduction of roughly 100 lines of code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163
Test Plan:
existing tests and
* Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that
checksums are now checked on direct read of table properties by TableCache
(new test would fail before this change)
* Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test
putting table properties under old meta name
* Also generally enhanced that same test to actually test what it was
supposed to be testing already, by kicking things out of table cache when
we don't want them there.
Reviewed By: ajkr, mrambacher
Differential Revision: D32514757
Pulled By: pdillinger
fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9
2021-11-18 11:42:12 -08:00
|
|
|
// 1-byte compression type + 32-bit checksum
|
|
|
|
static constexpr size_t kBlockTrailerSize = 5;
|
|
|
|
|
2011-03-28 20:43:44 +00:00
|
|
|
// Attempt to open the table that is stored in bytes [0..file_size)
|
|
|
|
// of "file", and read the metadata entries necessary to allow
|
|
|
|
// retrieving data from the table.
|
2011-03-18 22:37:00 +00:00
|
|
|
//
|
2013-10-30 10:52:33 -07:00
|
|
|
// If successful, returns ok and sets "*table_reader" to the newly opened
|
|
|
|
// table. The client should delete "*table_reader" when no longer needed.
|
|
|
|
// If there was an error while initializing the table, sets "*table_reader"
|
|
|
|
// to nullptr and returns a non-ok status.
|
2011-03-18 22:37:00 +00:00
|
|
|
//
|
Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
2015-12-23 10:15:07 -08:00
|
|
|
// @param file must remain live while this Table is in use.
|
2016-07-20 11:23:31 -07:00
|
|
|
// @param prefetch_index_and_filter_in_cache can be used to disable
|
|
|
|
// prefetching of
|
|
|
|
// index and filter blocks into block cache at startup
|
Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
2015-12-23 10:15:07 -08:00
|
|
|
// @param skip_filters Disables loading/accessing the filter block. Overrides
|
2016-07-20 11:23:31 -07:00
|
|
|
// prefetch_index_and_filter_in_cache, so filter will be skipped if both
|
|
|
|
// are set.
|
2020-05-12 18:21:32 -07:00
|
|
|
// @param force_direct_prefetch if true, always prefetching to RocksDB
|
|
|
|
// buffer, rather than calling RandomAccessFile::Prefetch().
|
2022-01-21 11:36:36 -08:00
|
|
|
static Status Open(
|
|
|
|
const ReadOptions& ro, const ImmutableOptions& ioptions,
|
|
|
|
const EnvOptions& env_options,
|
|
|
|
const BlockBasedTableOptions& table_options,
|
|
|
|
const InternalKeyComparator& internal_key_comparator,
|
|
|
|
std::unique_ptr<RandomAccessFileReader>&& file, uint64_t file_size,
|
|
|
|
std::unique_ptr<TableReader>* table_reader,
|
|
|
|
const std::shared_ptr<const SliceTransform>& prefix_extractor = nullptr,
|
|
|
|
bool prefetch_index_and_filter_in_cache = true, bool skip_filters = false,
|
|
|
|
int level = -1, const bool immortal_table = false,
|
|
|
|
const SequenceNumber largest_seqno = 0,
|
|
|
|
bool force_direct_prefetch = false,
|
|
|
|
TailPrefetchStats* tail_prefetch_stats = nullptr,
|
|
|
|
BlockCacheTracer* const block_cache_tracer = nullptr,
|
|
|
|
size_t max_file_size_for_l0_meta_pin = 0,
|
|
|
|
const std::string& cur_db_session_id = "", uint64_t cur_file_num = 0);
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2018-05-21 14:33:55 -07:00
|
|
|
bool PrefixMayMatch(const Slice& internal_key,
|
2018-06-26 15:56:26 -07:00
|
|
|
const ReadOptions& read_options,
|
|
|
|
const SliceTransform* options_prefix_extractor,
|
2019-06-10 15:30:05 -07:00
|
|
|
const bool need_upper_bound_check,
|
|
|
|
BlockCacheLookupContext* lookup_context) const;
|
2013-08-13 14:04:56 -07:00
|
|
|
|
2011-03-18 22:37:00 +00:00
|
|
|
// Returns a new iterator over the table contents.
|
|
|
|
// The result of NewIterator() is initially invalid (caller must
|
|
|
|
// call one of the Seek methods on the iterator before using it).
|
2020-08-03 15:21:56 -07:00
|
|
|
// @param read_options Must outlive the returned iterator.
|
Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
2015-12-23 10:15:07 -08:00
|
|
|
// @param skip_filters Disables loading/accessing the filter block
|
2019-06-20 14:28:22 -07:00
|
|
|
// compaction_readahead_size: its value will only be used if caller =
|
|
|
|
// kCompaction.
|
|
|
|
InternalIterator* NewIterator(const ReadOptions&,
|
|
|
|
const SliceTransform* prefix_extractor,
|
|
|
|
Arena* arena, bool skip_filters,
|
|
|
|
TableReaderCaller caller,
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-15 17:37:23 -07:00
|
|
|
size_t compaction_readahead_size = 0,
|
|
|
|
bool allow_unprepared_value = false) override;
|
2013-10-28 17:54:09 -07:00
|
|
|
|
2018-11-28 15:26:56 -08:00
|
|
|
FragmentedRangeTombstoneIterator* NewRangeTombstoneIterator(
|
2016-08-19 15:10:31 -07:00
|
|
|
const ReadOptions& read_options) override;
|
|
|
|
|
Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
2015-12-23 10:15:07 -08:00
|
|
|
// @param skip_filters Disables loading/accessing the filter block
|
2014-01-27 21:58:46 -08:00
|
|
|
Status Get(const ReadOptions& readOptions, const Slice& key,
|
2018-05-21 14:33:55 -07:00
|
|
|
GetContext* get_context, const SliceTransform* prefix_extractor,
|
|
|
|
bool skip_filters = false) override;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:24:09 -07:00
|
|
|
void MultiGet(const ReadOptions& readOptions,
|
|
|
|
const MultiGetContext::Range* mget_range,
|
|
|
|
const SliceTransform* prefix_extractor,
|
|
|
|
bool skip_filters = false) override;
|
|
|
|
|
2015-03-02 17:07:03 -08:00
|
|
|
// Pre-fetch the disk blocks that correspond to the key range specified by
|
2016-04-28 17:30:44 +08:00
|
|
|
// (kbegin, kend). The call will return error status in the event of
|
2015-03-02 17:07:03 -08:00
|
|
|
// IO or iteration error.
|
|
|
|
Status Prefetch(const Slice* begin, const Slice* end) override;
|
|
|
|
|
2011-03-18 22:37:00 +00:00
|
|
|
// Given a key, return an approximate byte offset in the file where
|
|
|
|
// the data for that key begins (or would begin if the key were
|
2019-08-16 14:16:49 -07:00
|
|
|
// present in the file). The returned value is in terms of file
|
2011-03-18 22:37:00 +00:00
|
|
|
// bytes, and so includes effects like compression of the underlying data.
|
|
|
|
// E.g., the approximate offset of the last key in the table will
|
|
|
|
// be close to the file length.
|
2019-06-20 14:28:22 -07:00
|
|
|
uint64_t ApproximateOffsetOf(const Slice& key,
|
|
|
|
TableReaderCaller caller) override;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2019-08-16 14:16:49 -07:00
|
|
|
// Given start and end keys, return the approximate data size in the file
|
|
|
|
// between the keys. The returned value is in terms of file bytes, and so
|
|
|
|
// includes effects like compression of the underlying data.
|
|
|
|
// The start key must not be greater than the end key.
|
|
|
|
uint64_t ApproximateSize(const Slice& start, const Slice& end,
|
|
|
|
TableReaderCaller caller) override;
|
|
|
|
|
Move the index readers out of the block cache (#5298)
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
2019-05-30 11:49:36 -07:00
|
|
|
bool TEST_BlockInCache(const BlockHandle& handle) const;
|
|
|
|
|
2013-01-31 15:20:24 -08:00
|
|
|
// Returns true if the block for the specified key is in cache.
|
2014-06-20 10:23:02 +02:00
|
|
|
// REQUIRES: key is in this table && block cache enabled
|
2014-01-27 21:58:46 -08:00
|
|
|
bool TEST_KeyInCache(const ReadOptions& options, const Slice& key);
|
2013-01-31 15:20:24 -08:00
|
|
|
|
2013-06-13 17:25:09 -07:00
|
|
|
// Set up the table for Compaction. Might change some parameters with
|
|
|
|
// posix_fadvise
|
2013-10-28 17:54:09 -07:00
|
|
|
void SetupForCompaction() override;
|
|
|
|
|
2014-02-07 19:26:49 -08:00
|
|
|
std::shared_ptr<const TableProperties> GetTableProperties() const override;
|
2013-05-17 15:53:01 -07:00
|
|
|
|
2014-08-05 11:27:34 -07:00
|
|
|
size_t ApproximateMemoryUsage() const override;
|
|
|
|
|
2014-12-23 13:24:07 -08:00
|
|
|
// convert SST file to a human readable form
|
Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.
Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
Test Plan: make asan_check
Differential Revision: D16036974
Pulled By: ltamasi
fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:11:23 -07:00
|
|
|
Status DumpTable(WritableFile* out_file) override;
|
2014-12-23 13:24:07 -08:00
|
|
|
|
2019-08-16 16:40:09 -07:00
|
|
|
Status VerifyChecksum(const ReadOptions& readOptions,
|
|
|
|
TableReaderCaller caller) override;
|
2017-08-09 15:49:40 -07:00
|
|
|
|
2013-10-28 17:54:09 -07:00
|
|
|
~BlockBasedTable();
|
2013-10-10 11:43:24 -07:00
|
|
|
|
Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.
Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
Test Plan: make asan_check
Differential Revision: D16036974
Pulled By: ltamasi
fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:11:23 -07:00
|
|
|
bool TEST_FilterBlockInCache() const;
|
Move the index readers out of the block cache (#5298)
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
2019-05-30 11:49:36 -07:00
|
|
|
bool TEST_IndexBlockInCache() const;
|
2017-03-03 18:09:43 -08:00
|
|
|
|
Move the index readers out of the block cache (#5298)
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
2019-05-30 11:49:36 -07:00
|
|
|
// IndexReader is the interface that provides the functionality for index
|
2017-03-03 18:09:43 -08:00
|
|
|
// access.
|
|
|
|
class IndexReader {
|
|
|
|
public:
|
Move the index readers out of the block cache (#5298)
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
2019-05-30 11:49:36 -07:00
|
|
|
virtual ~IndexReader() = default;
|
|
|
|
|
|
|
|
// Create an iterator for index access. If iter is null, then a new object
|
|
|
|
// is created on the heap, and the callee will have the ownership.
|
|
|
|
// If a non-null iter is passed in, it will be used, and the returned value
|
|
|
|
// is either the same as iter or a new on-heap object that
|
|
|
|
// wraps the passed iter. In the latter case the return value points
|
|
|
|
// to a different object then iter, and the callee has the ownership of the
|
2017-03-03 18:09:43 -08:00
|
|
|
// returned object.
|
Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
Differential Revision: D15256423
Pulled By: al13n321
fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:50:35 -07:00
|
|
|
virtual InternalIteratorBase<IndexValue>* NewIterator(
|
Move the index readers out of the block cache (#5298)
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
2019-05-30 11:49:36 -07:00
|
|
|
const ReadOptions& read_options, bool disable_prefix_seek,
|
2019-06-10 15:30:05 -07:00
|
|
|
IndexBlockIter* iter, GetContext* get_context,
|
|
|
|
BlockCacheLookupContext* lookup_context) = 0;
|
Move the index readers out of the block cache (#5298)
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
2019-05-30 11:49:36 -07:00
|
|
|
|
2017-03-03 18:09:43 -08:00
|
|
|
// Report an approximation of how much memory has been used other than
|
Move the index readers out of the block cache (#5298)
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
2019-05-30 11:49:36 -07:00
|
|
|
// memory that was allocated in block cache.
|
2017-03-03 18:09:43 -08:00
|
|
|
virtual size_t ApproximateMemoryUsage() const = 0;
|
Move the index readers out of the block cache (#5298)
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
2019-05-30 11:49:36 -07:00
|
|
|
// Cache the dependencies of the index reader (e.g. the partitions
|
|
|
|
// of a partitioned index).
|
2020-08-25 18:59:19 -07:00
|
|
|
virtual Status CacheDependencies(const ReadOptions& /*ro*/,
|
|
|
|
bool /* pin */) {
|
|
|
|
return Status::OK();
|
|
|
|
}
|
2017-03-03 18:09:43 -08:00
|
|
|
};
|
2014-02-19 15:38:57 -08:00
|
|
|
|
Move the index readers out of the block cache (#5298)
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
2019-05-30 11:49:36 -07:00
|
|
|
class IndexReaderCommon;
|
|
|
|
|
New stable, fixed-length cache keys (#9126)
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).
The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.
This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)
The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.
Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.
Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126
Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.
### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)
### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime
We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:
```
Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```
These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.
More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day
After enough data, we get a result at the end:
```
(keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```
If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:
```
(keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```
The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:
```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```
I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.
Reviewed By: zhichao-cao
Differential Revision: D33171746
Pulled By: pdillinger
fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
2021-12-16 17:13:55 -08:00
|
|
|
// Maximum SST file size that uses standard CacheKey encoding scheme.
|
|
|
|
// See GetCacheKey to explain << 2. + 3 is permitted because it is trimmed
|
|
|
|
// off by >> 2 in GetCacheKey.
|
|
|
|
static constexpr uint64_t kMaxFileSizeStandardEncoding =
|
|
|
|
(OffsetableCacheKey::kMaxOffsetStandardEncoding << 2) + 3;
|
|
|
|
|
|
|
|
static void SetupBaseCacheKey(const TableProperties* properties,
|
|
|
|
const std::string& cur_db_session_id,
|
|
|
|
uint64_t cur_file_number, uint64_t file_size,
|
|
|
|
OffsetableCacheKey* out_base_cache_key,
|
|
|
|
bool* out_is_stable = nullptr);
|
|
|
|
|
|
|
|
static CacheKey GetCacheKey(const OffsetableCacheKey& base_cache_key,
|
|
|
|
const BlockHandle& handle);
|
2015-12-15 18:20:10 -08:00
|
|
|
|
2021-06-17 21:55:42 -07:00
|
|
|
static void UpdateCacheInsertionMetrics(BlockType block_type,
|
|
|
|
GetContext* get_context, size_t usage,
|
|
|
|
bool redundant,
|
|
|
|
Statistics* const statistics);
|
|
|
|
|
Improve / clean up meta block code & integrity (#9163)
Summary:
* Checksums are now checked on meta blocks unless specifically
suppressed or not applicable (e.g. plain table). (Was other way around.)
This means a number of cases that were not checking checksums now are,
including direct read TableProperties in Version::GetTableProperties
(fixed in meta_blocks ReadTableProperties), reading any block from
PersistentCache (fixed in BlockFetcher), read TableProperties in
SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open,
maybe more.
* For that to work, I moved the global_seqno+TableProperties checksum
logic to the shared table/ code, because that is used by many utilies
such as SstFileDumper.
* Also for that to work, we have to know when we're dealing with a block
that has a checksum (trailer), so added that capability to Footer based
on magic number, and from there BlockFetcher.
* Knowledge of trailer presence has also fixed a problem where other
table formats were reading blocks including bytes for a non-existant
trailer--and awkwardly kind-of not using them, e.g. no shared code
checking checksums. (BlockFetcher compression type was populated
incorrectly.) Now we only read what is needed.
* Minimized code duplication and differing/incompatible/awkward
abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block
without parsing block handle)
* Moved some meta block handling code from table_properties*.*
* Moved some code specific to block-based table from shared table/ code
to BlockBasedTable class. The checksum stuff means we can't completely
separate it, but things that don't need to be in shared table/ code
should not be.
* Use unique_ptr rather than raw ptr in more places. (Note: you can
std::move from unique_ptr to shared_ptr.)
Without enhancements to GetPropertiesOfAllTablesTest (see below),
net reduction of roughly 100 lines of code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163
Test Plan:
existing tests and
* Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that
checksums are now checked on direct read of table properties by TableCache
(new test would fail before this change)
* Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test
putting table properties under old meta name
* Also generally enhanced that same test to actually test what it was
supposed to be testing already, by kicking things out of table cache when
we don't want them there.
Reviewed By: ajkr, mrambacher
Differential Revision: D32514757
Pulled By: pdillinger
fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9
2021-11-18 11:42:12 -08:00
|
|
|
// Get the size to read from storage for a BlockHandle. size_t because we
|
|
|
|
// are about to load into memory.
|
|
|
|
static inline size_t BlockSizeWithTrailer(const BlockHandle& handle) {
|
|
|
|
return static_cast<size_t>(handle.size() + kBlockTrailerSize);
|
|
|
|
}
|
|
|
|
|
|
|
|
// It's the caller's responsibility to make sure that this is
|
|
|
|
// for raw block contents, which contains the compression
|
|
|
|
// byte in the end.
|
|
|
|
static inline CompressionType GetBlockCompressionType(const char* block_data,
|
|
|
|
size_t block_size) {
|
|
|
|
return static_cast<CompressionType>(block_data[block_size]);
|
|
|
|
}
|
|
|
|
static inline CompressionType GetBlockCompressionType(
|
|
|
|
const BlockContents& contents) {
|
|
|
|
assert(contents.is_raw_block);
|
|
|
|
return GetBlockCompressionType(contents.data.data(), contents.data.size());
|
|
|
|
}
|
|
|
|
|
2016-08-01 14:50:19 -07:00
|
|
|
// Retrieve all key value pairs from data blocks in the table.
|
|
|
|
// The key retrieved are internal keys.
|
|
|
|
Status GetKVPairsFromDataBlocks(std::vector<KVPairBlock>* kv_pair_blocks);
|
|
|
|
|
2018-02-12 16:57:56 -08:00
|
|
|
struct Rep;
|
|
|
|
|
|
|
|
Rep* get_rep() { return rep_; }
|
2019-05-31 11:37:21 -07:00
|
|
|
const Rep* get_rep() const { return rep_; }
|
2018-02-12 16:57:56 -08:00
|
|
|
|
|
|
|
// input_iter: if it is not null, update this one and return it as Iterator
|
2018-07-12 17:19:57 -07:00
|
|
|
template <typename TBlockIter>
|
2019-06-03 12:31:45 -07:00
|
|
|
TBlockIter* NewDataBlockIterator(
|
2019-06-06 11:28:54 -07:00
|
|
|
const ReadOptions& ro, const BlockHandle& block_handle,
|
Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
Differential Revision: D15256423
Pulled By: al13n321
fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:50:35 -07:00
|
|
|
TBlockIter* input_iter, BlockType block_type, GetContext* get_context,
|
2019-06-10 15:30:05 -07:00
|
|
|
BlockCacheLookupContext* lookup_context, Status s,
|
2019-06-19 14:07:36 -07:00
|
|
|
FilePrefetchBuffer* prefetch_buffer, bool for_compaction = false) const;
|
2018-02-12 16:57:56 -08:00
|
|
|
|
2019-06-30 20:52:34 -07:00
|
|
|
// input_iter: if it is not null, update this one and return it as Iterator
|
|
|
|
template <typename TBlockIter>
|
|
|
|
TBlockIter* NewDataBlockIterator(const ReadOptions& ro,
|
|
|
|
CachableEntry<Block>& block,
|
|
|
|
TBlockIter* input_iter, Status s) const;
|
|
|
|
|
2018-02-12 16:57:56 -08:00
|
|
|
class PartitionedIndexIteratorState;
|
2017-02-06 16:29:29 -08:00
|
|
|
|
Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.
Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
Test Plan: make asan_check
Differential Revision: D16036974
Pulled By: ltamasi
fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:11:23 -07:00
|
|
|
template <typename TBlocklike>
|
|
|
|
friend class FilterBlockReaderCommon;
|
2019-07-23 15:57:43 -07:00
|
|
|
|
2017-03-22 09:11:23 -07:00
|
|
|
friend class PartitionIndexReader;
|
|
|
|
|
2019-07-23 15:57:43 -07:00
|
|
|
friend class UncompressionDictReader;
|
|
|
|
|
2017-03-22 09:11:23 -07:00
|
|
|
protected:
|
2011-03-18 22:37:00 +00:00
|
|
|
Rep* rep_;
|
2019-06-13 15:39:52 -07:00
|
|
|
explicit BlockBasedTable(Rep* rep, BlockCacheTracer* const block_cache_tracer)
|
|
|
|
: rep_(rep), block_cache_tracer_(block_cache_tracer) {}
|
2019-09-11 18:07:12 -07:00
|
|
|
// No copying allowed
|
|
|
|
explicit BlockBasedTable(const TableReader&) = delete;
|
|
|
|
void operator=(const TableReader&) = delete;
|
2017-03-22 09:11:23 -07:00
|
|
|
|
|
|
|
private:
|
2017-08-23 07:48:54 -07:00
|
|
|
friend class MockedBlockBasedTable;
|
2020-06-05 11:06:26 -07:00
|
|
|
friend class BlockBasedTableReaderTestVerifyChecksum_ChecksumMismatch_Test;
|
2019-06-13 15:39:52 -07:00
|
|
|
BlockCacheTracer* const block_cache_tracer_;
|
2018-02-07 15:42:35 -08:00
|
|
|
|
2019-06-06 11:28:54 -07:00
|
|
|
void UpdateCacheHitMetrics(BlockType block_type, GetContext* get_context,
|
|
|
|
size_t usage) const;
|
|
|
|
void UpdateCacheMissMetrics(BlockType block_type,
|
|
|
|
GetContext* get_context) const;
|
2021-06-17 21:55:42 -07:00
|
|
|
|
2021-10-19 15:53:16 -07:00
|
|
|
Cache::Handle* GetEntryFromCache(const CacheTier& cache_tier,
|
|
|
|
Cache* block_cache, const Slice& key,
|
2021-06-18 09:35:03 -07:00
|
|
|
BlockType block_type, const bool wait,
|
2021-05-21 18:28:28 -07:00
|
|
|
GetContext* get_context,
|
|
|
|
const Cache::CacheItemHelper* cache_helper,
|
|
|
|
const Cache::CreateCallback& create_cb,
|
|
|
|
Cache::Priority priority) const;
|
2019-06-03 12:31:45 -07:00
|
|
|
|
2021-10-19 15:53:16 -07:00
|
|
|
template <typename TBlocklike>
|
|
|
|
Status InsertEntryToCache(const CacheTier& cache_tier, Cache* block_cache,
|
|
|
|
const Slice& key,
|
|
|
|
const Cache::CacheItemHelper* cache_helper,
|
|
|
|
std::unique_ptr<TBlocklike>& block_holder,
|
|
|
|
size_t charge, Cache::Handle** cache_handle,
|
|
|
|
Cache::Priority priority) const;
|
|
|
|
|
Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
Differential Revision: D15256423
Pulled By: al13n321
fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:50:35 -07:00
|
|
|
// Either Block::NewDataIterator() or Block::NewIndexIterator().
|
|
|
|
template <typename TBlockIter>
|
|
|
|
static TBlockIter* InitBlockIterator(const Rep* rep, Block* block,
|
2020-02-25 15:29:17 -08:00
|
|
|
BlockType block_type,
|
Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
Differential Revision: D15256423
Pulled By: al13n321
fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:50:35 -07:00
|
|
|
TBlockIter* input_iter,
|
|
|
|
bool block_contents_pinned);
|
|
|
|
|
2016-11-05 09:10:51 -07:00
|
|
|
// If block cache enabled (compressed or uncompressed), looks for the block
|
|
|
|
// identified by handle in (1) uncompressed cache, (2) compressed cache, and
|
|
|
|
// then (3) file. If found, inserts into the cache(s) that were searched
|
|
|
|
// unsuccessfully (e.g., if found in file, will add to both uncompressed and
|
|
|
|
// compressed caches if they're enabled).
|
|
|
|
//
|
|
|
|
// @param block_entry value is set to the uncompressed block if found. If
|
|
|
|
// in uncompressed block cache, also sets cache_handle to reference that
|
|
|
|
// block.
|
Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.
Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
Test Plan: make asan_check
Differential Revision: D16036974
Pulled By: ltamasi
fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:11:23 -07:00
|
|
|
template <typename TBlocklike>
|
2019-06-03 12:31:45 -07:00
|
|
|
Status MaybeReadBlockAndLoadToCache(
|
|
|
|
FilePrefetchBuffer* prefetch_buffer, const ReadOptions& ro,
|
|
|
|
const BlockHandle& handle, const UncompressionDict& uncompression_dict,
|
2021-06-18 09:35:03 -07:00
|
|
|
const bool wait, CachableEntry<TBlocklike>* block_entry,
|
|
|
|
BlockType block_type, GetContext* get_context,
|
|
|
|
BlockCacheLookupContext* lookup_context, BlockContents* contents) const;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
Move the index readers out of the block cache (#5298)
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
2019-05-30 11:49:36 -07:00
|
|
|
// Similar to the above, with one crucial difference: it will retrieve the
|
|
|
|
// block from the file even if there are no caches configured (assuming the
|
|
|
|
// read options allow I/O).
|
Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.
Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
Test Plan: make asan_check
Differential Revision: D16036974
Pulled By: ltamasi
fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:11:23 -07:00
|
|
|
template <typename TBlocklike>
|
2019-06-03 12:31:45 -07:00
|
|
|
Status RetrieveBlock(FilePrefetchBuffer* prefetch_buffer,
|
|
|
|
const ReadOptions& ro, const BlockHandle& handle,
|
|
|
|
const UncompressionDict& uncompression_dict,
|
Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.
Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
Test Plan: make asan_check
Differential Revision: D16036974
Pulled By: ltamasi
fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:11:23 -07:00
|
|
|
CachableEntry<TBlocklike>* block_entry,
|
|
|
|
BlockType block_type, GetContext* get_context,
|
2019-06-19 14:07:36 -07:00
|
|
|
BlockCacheLookupContext* lookup_context,
|
2021-06-18 09:35:03 -07:00
|
|
|
bool for_compaction, bool use_cache,
|
|
|
|
bool wait_for_cache) const;
|
Move the index readers out of the block cache (#5298)
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
2019-05-30 11:49:36 -07:00
|
|
|
|
2019-08-22 08:47:36 -07:00
|
|
|
void RetrieveMultipleBlocks(
|
2019-06-30 20:52:34 -07:00
|
|
|
const ReadOptions& options, const MultiGetRange* batch,
|
2019-08-23 08:25:52 -07:00
|
|
|
const autovector<BlockHandle, MultiGetContext::MAX_BATCH_SIZE>* handles,
|
2019-06-30 20:52:34 -07:00
|
|
|
autovector<Status, MultiGetContext::MAX_BATCH_SIZE>* statuses,
|
2019-08-23 08:25:52 -07:00
|
|
|
autovector<CachableEntry<Block>, MultiGetContext::MAX_BATCH_SIZE>*
|
|
|
|
results,
|
2019-06-30 20:52:34 -07:00
|
|
|
char* scratch, const UncompressionDict& uncompression_dict) const;
|
|
|
|
|
2014-02-28 18:19:07 -08:00
|
|
|
// Get the iterator from the index reader.
|
2019-07-23 15:30:59 -07:00
|
|
|
//
|
|
|
|
// If input_iter is not set, return a new Iterator.
|
|
|
|
// If input_iter is set, try to update it and return it as Iterator.
|
|
|
|
// However note that in some cases the returned iterator may be different
|
|
|
|
// from input_iter. In such case the returned iterator should be freed.
|
2013-11-12 22:46:51 -08:00
|
|
|
//
|
2014-02-28 18:19:07 -08:00
|
|
|
// Note: ErrorIterator with Status::Incomplete shall be returned if all the
|
|
|
|
// following conditions are met:
|
|
|
|
// 1. We enabled table_options.cache_index_and_filter_blocks.
|
|
|
|
// 2. index is not present in block cache.
|
|
|
|
// 3. We disallowed any io to be performed, that is, read_options ==
|
|
|
|
// kBlockCacheTier
|
Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
Differential Revision: D15256423
Pulled By: al13n321
fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:50:35 -07:00
|
|
|
InternalIteratorBase<IndexValue>* NewIndexIterator(
|
2019-06-10 15:30:05 -07:00
|
|
|
const ReadOptions& read_options, bool need_upper_bound_check,
|
|
|
|
IndexBlockIter* input_iter, GetContext* get_context,
|
|
|
|
BlockCacheLookupContext* lookup_context) const;
|
2014-02-28 18:19:07 -08:00
|
|
|
|
|
|
|
// Read block cache from block caches (if set): block_cache and
|
|
|
|
// block_cache_compressed.
|
|
|
|
// On success, Status::OK with be returned and @block will be populated with
|
|
|
|
// pointer to the block as well as its block handle.
|
2019-01-23 18:11:08 -08:00
|
|
|
// @param uncompression_dict Data for presetting the compression library's
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
2016-04-27 17:36:03 -07:00
|
|
|
// dictionary.
|
Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.
Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
Test Plan: make asan_check
Differential Revision: D16036974
Pulled By: ltamasi
fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:11:23 -07:00
|
|
|
template <typename TBlocklike>
|
New stable, fixed-length cache keys (#9126)
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).
The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.
This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)
The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.
Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.
Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126
Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.
### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)
### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime
We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:
```
Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```
These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.
More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day
After enough data, we get a result at the end:
```
(keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```
If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:
```
(keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```
The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:
```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```
I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.
Reviewed By: zhichao-cao
Differential Revision: D33171746
Pulled By: pdillinger
fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
2021-12-16 17:13:55 -08:00
|
|
|
Status GetDataBlockFromCache(const Slice& cache_key, Cache* block_cache,
|
|
|
|
Cache* block_cache_compressed,
|
|
|
|
const ReadOptions& read_options,
|
|
|
|
CachableEntry<TBlocklike>* block,
|
|
|
|
const UncompressionDict& uncompression_dict,
|
|
|
|
BlockType block_type, const bool wait,
|
|
|
|
GetContext* get_context) const;
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
2016-04-27 17:36:03 -07:00
|
|
|
|
2014-02-28 18:19:07 -08:00
|
|
|
// Put a raw block (maybe compressed) to the corresponding block caches.
|
|
|
|
// This method will perform decompression against raw_block if needed and then
|
|
|
|
// populate the block caches.
|
|
|
|
// On success, Status::OK will be returned; also @block will be populated with
|
|
|
|
// uncompressed block and its cache handle.
|
2013-11-12 22:46:51 -08:00
|
|
|
//
|
2018-11-13 17:00:49 -08:00
|
|
|
// Allocated memory managed by raw_block_contents will be transferred to
|
|
|
|
// PutDataBlockToCache(). After the call, the object will be invalid.
|
2019-01-23 18:11:08 -08:00
|
|
|
// @param uncompression_dict Data for presetting the compression library's
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
2016-04-27 17:36:03 -07:00
|
|
|
// dictionary.
|
Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.
Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
Test Plan: make asan_check
Differential Revision: D16036974
Pulled By: ltamasi
fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:11:23 -07:00
|
|
|
template <typename TBlocklike>
|
New stable, fixed-length cache keys (#9126)
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).
The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.
This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)
The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.
Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.
Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126
Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.
### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)
### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime
We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:
```
Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```
These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.
More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day
After enough data, we get a result at the end:
```
(keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```
If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:
```
(keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```
The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:
```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```
I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.
Reviewed By: zhichao-cao
Differential Revision: D33171746
Pulled By: pdillinger
fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
2021-12-16 17:13:55 -08:00
|
|
|
Status PutDataBlockToCache(const Slice& cache_key, Cache* block_cache,
|
|
|
|
Cache* block_cache_compressed,
|
2020-02-25 15:29:17 -08:00
|
|
|
CachableEntry<TBlocklike>* cached_block,
|
|
|
|
BlockContents* raw_block_contents,
|
|
|
|
CompressionType raw_block_comp_type,
|
|
|
|
const UncompressionDict& uncompression_dict,
|
|
|
|
MemoryAllocator* memory_allocator,
|
|
|
|
BlockType block_type,
|
|
|
|
GetContext* get_context) const;
|
2013-11-12 22:46:51 -08:00
|
|
|
|
2013-03-21 15:59:47 -07:00
|
|
|
// Calls (*handle_result)(arg, ...) repeatedly, starting with the entry found
|
|
|
|
// after a call to Seek(key), until handle_result returns false.
|
|
|
|
// May not make such a call if filter policy says that key is not present.
|
2012-04-17 08:36:46 -07:00
|
|
|
friend class TableCache;
|
2013-09-01 23:23:40 -07:00
|
|
|
friend class BlockBasedTableBuilder;
|
2012-04-17 08:36:46 -07:00
|
|
|
|
2014-05-15 14:09:03 -07:00
|
|
|
// Create a index reader based on the index type stored in the table.
|
|
|
|
// Optionally, user can pass a preloaded meta_index_iter for the index that
|
|
|
|
// need to access extra meta blocks for index construction. This parameter
|
|
|
|
// helps avoid re-reading meta index block if caller already created one.
|
2020-06-29 14:51:57 -07:00
|
|
|
Status CreateIndexReader(const ReadOptions& ro,
|
|
|
|
FilePrefetchBuffer* prefetch_buffer,
|
2019-05-30 17:39:43 -07:00
|
|
|
InternalIterator* preloaded_meta_index_iter,
|
|
|
|
bool use_cache, bool prefetch, bool pin,
|
Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.
Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
Test Plan: make asan_check
Differential Revision: D16036974
Pulled By: ltamasi
fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:11:23 -07:00
|
|
|
BlockCacheLookupContext* lookup_context,
|
|
|
|
std::unique_ptr<IndexReader>* index_reader);
|
2012-04-17 08:36:46 -07:00
|
|
|
|
2022-01-31 19:45:17 -08:00
|
|
|
bool FullFilterKeyMayMatch(FilterBlockReader* filter, const Slice& user_key,
|
2019-06-10 15:30:05 -07:00
|
|
|
const bool no_io,
|
|
|
|
const SliceTransform* prefix_extractor,
|
Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.
Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
Test Plan: make asan_check
Differential Revision: D16036974
Pulled By: ltamasi
fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:11:23 -07:00
|
|
|
GetContext* get_context,
|
2019-06-10 15:30:05 -07:00
|
|
|
BlockCacheLookupContext* lookup_context) const;
|
2015-02-02 17:42:57 -08:00
|
|
|
|
2022-01-31 19:45:17 -08:00
|
|
|
void FullFilterKeysMayMatch(FilterBlockReader* filter, MultiGetRange* range,
|
2019-06-10 15:30:05 -07:00
|
|
|
const bool no_io,
|
|
|
|
const SliceTransform* prefix_extractor,
|
|
|
|
BlockCacheLookupContext* lookup_context) const;
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:24:09 -07:00
|
|
|
|
2020-05-12 18:21:32 -07:00
|
|
|
// If force_direct_prefetch is true, always prefetching to RocksDB
|
|
|
|
// buffer, rather than calling RandomAccessFile::Prefetch().
|
2018-12-07 13:15:09 -08:00
|
|
|
static Status PrefetchTail(
|
2020-06-29 14:51:57 -07:00
|
|
|
const ReadOptions& ro, RandomAccessFileReader* file, uint64_t file_size,
|
2020-05-12 18:21:32 -07:00
|
|
|
bool force_direct_prefetch, TailPrefetchStats* tail_prefetch_stats,
|
|
|
|
const bool prefetch_all, const bool preload_all,
|
2018-12-07 13:15:09 -08:00
|
|
|
std::unique_ptr<FilePrefetchBuffer>* prefetch_buffer);
|
2020-06-29 14:51:57 -07:00
|
|
|
Status ReadMetaIndexBlock(const ReadOptions& ro,
|
|
|
|
FilePrefetchBuffer* prefetch_buffer,
|
2019-11-05 17:17:36 -08:00
|
|
|
std::unique_ptr<Block>* metaindex_block,
|
|
|
|
std::unique_ptr<InternalIterator>* iter);
|
2020-06-29 14:51:57 -07:00
|
|
|
Status ReadPropertiesBlock(const ReadOptions& ro,
|
|
|
|
FilePrefetchBuffer* prefetch_buffer,
|
2019-06-03 12:31:45 -07:00
|
|
|
InternalIterator* meta_iter,
|
|
|
|
const SequenceNumber largest_seqno);
|
2020-06-29 14:51:57 -07:00
|
|
|
Status ReadRangeDelBlock(const ReadOptions& ro,
|
|
|
|
FilePrefetchBuffer* prefetch_buffer,
|
2019-06-03 12:31:45 -07:00
|
|
|
InternalIterator* meta_iter,
|
2019-06-10 15:30:05 -07:00
|
|
|
const InternalKeyComparator& internal_comparator,
|
|
|
|
BlockCacheLookupContext* lookup_context);
|
2019-06-03 12:31:45 -07:00
|
|
|
Status PrefetchIndexAndFilterBlocks(
|
2020-06-29 14:51:57 -07:00
|
|
|
const ReadOptions& ro, FilePrefetchBuffer* prefetch_buffer,
|
|
|
|
InternalIterator* meta_iter, BlockBasedTable* new_table,
|
|
|
|
bool prefetch_all, const BlockBasedTableOptions& table_options,
|
|
|
|
const int level, size_t file_size, size_t max_file_size_for_l0_meta_pin,
|
2019-06-10 15:30:05 -07:00
|
|
|
BlockCacheLookupContext* lookup_context);
|
2013-11-12 22:46:51 -08:00
|
|
|
|
2019-06-18 19:00:03 -07:00
|
|
|
static BlockType GetBlockTypeForMetaBlockByName(const Slice& meta_block_name);
|
|
|
|
|
2019-03-26 10:15:43 -07:00
|
|
|
Status VerifyChecksumInMetaBlocks(InternalIteratorBase<Slice>* index_iter);
|
2019-08-16 16:40:09 -07:00
|
|
|
Status VerifyChecksumInBlocks(const ReadOptions& read_options,
|
|
|
|
InternalIteratorBase<IndexValue>* index_iter);
|
2017-08-09 15:49:40 -07:00
|
|
|
|
2013-11-12 22:46:51 -08:00
|
|
|
// Create the filter from the filter block.
|
Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.
Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
Test Plan: make asan_check
Differential Revision: D16036974
Pulled By: ltamasi
fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:11:23 -07:00
|
|
|
std::unique_ptr<FilterBlockReader> CreateFilterBlockReader(
|
2020-06-29 14:51:57 -07:00
|
|
|
const ReadOptions& ro, FilePrefetchBuffer* prefetch_buffer,
|
|
|
|
bool use_cache, bool prefetch, bool pin,
|
|
|
|
BlockCacheLookupContext* lookup_context);
|
2013-11-12 22:46:51 -08:00
|
|
|
|
For ApproximateSizes, pro-rate table metadata size over data blocks (#6784)
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.
It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.
So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.
Also includes miscellaneous comment improvements / clarifications.
Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784
Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...
[ RUN ] DBTest.ApproximateSizesFilesWithErrorMargin
db/db_test.cc:1562: Failure
Expected: (size) <= (11 * 100), actual: 9478 vs 1100
Other tests updated to reflect consistent accounting of metadata.
Reviewed By: siying
Differential Revision: D21334706
Pulled By: pdillinger
fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
2020-06-02 12:27:59 -07:00
|
|
|
// Size of all data blocks, maybe approximate
|
|
|
|
uint64_t GetApproximateDataSize();
|
|
|
|
|
|
|
|
// Given an iterator return its offset in data block section of file.
|
|
|
|
uint64_t ApproximateDataOffsetOf(
|
|
|
|
const InternalIteratorBase<IndexValue>& index_iter,
|
|
|
|
uint64_t data_size) const;
|
2019-08-16 14:16:49 -07:00
|
|
|
|
2014-12-23 13:24:07 -08:00
|
|
|
// Helper functions for DumpTable()
|
2020-09-04 19:25:20 -07:00
|
|
|
Status DumpIndexBlock(std::ostream& out_stream);
|
|
|
|
Status DumpDataBlocks(std::ostream& out_stream);
|
2016-11-12 09:23:05 -08:00
|
|
|
void DumpKeyValue(const Slice& key, const Slice& value,
|
2020-09-04 19:25:20 -07:00
|
|
|
std::ostream& out_stream);
|
2014-12-23 13:24:07 -08:00
|
|
|
|
2022-01-31 19:45:17 -08:00
|
|
|
// Returns false if prefix_extractor exists and is compatible with that used
|
|
|
|
// in building the table file, otherwise true.
|
2022-01-21 11:36:36 -08:00
|
|
|
bool PrefixExtractorChanged(const SliceTransform* prefix_extractor) const;
|
|
|
|
|
2019-11-11 16:57:49 -08:00
|
|
|
// A cumulative data block file read in MultiGet lower than this size will
|
|
|
|
// use a stack buffer
|
|
|
|
static constexpr size_t kMultiGetReadStackBufSize = 8192;
|
|
|
|
|
2017-03-22 09:11:23 -07:00
|
|
|
friend class PartitionedFilterBlockReader;
|
|
|
|
friend class PartitionedFilterBlockTest;
|
2019-11-11 16:57:49 -08:00
|
|
|
friend class DBBasicTest_MultiGetIOBufferOverrun_Test;
|
2013-10-30 10:52:33 -07:00
|
|
|
};
|
|
|
|
|
2020-04-23 12:26:56 -07:00
|
|
|
// Maintaining state of a two-level iteration on a partitioned index structure.
|
2018-02-12 16:57:56 -08:00
|
|
|
class BlockBasedTable::PartitionedIndexIteratorState
|
|
|
|
: public TwoLevelIteratorState {
|
2017-02-06 16:29:29 -08:00
|
|
|
public:
|
2018-02-12 16:57:56 -08:00
|
|
|
PartitionedIndexIteratorState(
|
2019-05-31 11:37:21 -07:00
|
|
|
const BlockBasedTable* table,
|
Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
Differential Revision: D15256423
Pulled By: al13n321
fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:50:35 -07:00
|
|
|
std::unordered_map<uint64_t, CachableEntry<Block>>* block_map);
|
|
|
|
InternalIteratorBase<IndexValue>* NewSecondaryIterator(
|
2018-08-09 16:49:45 -07:00
|
|
|
const BlockHandle& index_value) override;
|
2017-02-06 16:29:29 -08:00
|
|
|
|
|
|
|
private:
|
|
|
|
// Don't own table_
|
2019-05-31 11:37:21 -07:00
|
|
|
const BlockBasedTable* table_;
|
2017-08-23 07:48:54 -07:00
|
|
|
std::unordered_map<uint64_t, CachableEntry<Block>>* block_map_;
|
2017-02-06 16:29:29 -08:00
|
|
|
};
|
|
|
|
|
2019-05-24 12:26:58 -07:00
|
|
|
// Stores all the properties associated with a BlockBasedTable.
|
|
|
|
// These are immutable.
|
2017-03-03 18:09:43 -08:00
|
|
|
struct BlockBasedTable::Rep {
|
2021-05-05 13:59:21 -07:00
|
|
|
Rep(const ImmutableOptions& _ioptions, const EnvOptions& _env_options,
|
2017-03-03 18:09:43 -08:00
|
|
|
const BlockBasedTableOptions& _table_opt,
|
2018-06-27 17:09:29 -07:00
|
|
|
const InternalKeyComparator& _internal_comparator, bool skip_filters,
|
For ApproximateSizes, pro-rate table metadata size over data blocks (#6784)
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.
It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.
So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.
Also includes miscellaneous comment improvements / clarifications.
Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784
Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...
[ RUN ] DBTest.ApproximateSizesFilesWithErrorMargin
db/db_test.cc:1562: Failure
Expected: (size) <= (11 * 100), actual: 9478 vs 1100
Other tests updated to reflect consistent accounting of metadata.
Reviewed By: siying
Differential Revision: D21334706
Pulled By: pdillinger
fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
2020-06-02 12:27:59 -07:00
|
|
|
uint64_t _file_size, int _level, const bool _immortal_table)
|
2017-03-03 18:09:43 -08:00
|
|
|
: ioptions(_ioptions),
|
|
|
|
env_options(_env_options),
|
|
|
|
table_options(_table_opt),
|
|
|
|
filter_policy(skip_filters ? nullptr : _table_opt.filter_policy.get()),
|
|
|
|
internal_comparator(_internal_comparator),
|
|
|
|
filter_type(FilterType::kNoFilter),
|
2017-12-07 11:50:49 -08:00
|
|
|
index_type(BlockBasedTableOptions::IndexType::kBinarySearch),
|
2017-03-03 18:09:43 -08:00
|
|
|
whole_key_filtering(_table_opt.whole_key_filtering),
|
|
|
|
prefix_filtering(true),
|
2018-06-27 17:09:29 -07:00
|
|
|
global_seqno(kDisableGlobalSequenceNumber),
|
For ApproximateSizes, pro-rate table metadata size over data blocks (#6784)
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.
It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.
So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.
Also includes miscellaneous comment improvements / clarifications.
Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784
Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...
[ RUN ] DBTest.ApproximateSizesFilesWithErrorMargin
db/db_test.cc:1562: Failure
Expected: (size) <= (11 * 100), actual: 9478 vs 1100
Other tests updated to reflect consistent accounting of metadata.
Reviewed By: siying
Differential Revision: D21334706
Pulled By: pdillinger
fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
2020-06-02 12:27:59 -07:00
|
|
|
file_size(_file_size),
|
2018-10-24 12:10:59 -07:00
|
|
|
level(_level),
|
2021-05-05 13:59:21 -07:00
|
|
|
immortal_table(_immortal_table) {}
|
2020-08-20 19:16:56 -07:00
|
|
|
~Rep() { status.PermitUncheckedError(); }
|
2021-05-05 13:59:21 -07:00
|
|
|
const ImmutableOptions& ioptions;
|
2017-03-03 18:09:43 -08:00
|
|
|
const EnvOptions& env_options;
|
Fix segfault caused by object premature destruction
Summary:
Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609).
There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change.
To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed.
Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised.
Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion.
Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler.
Test plan
```
$ make -j16
$ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion
```
Expected behavior:
All tests should pass.
Closes https://github.com/facebook/rocksdb/pull/3898
Differential Revision: D8149421
Pulled By: riversand963
fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438
2018-05-25 11:45:12 -07:00
|
|
|
const BlockBasedTableOptions table_options;
|
2017-03-03 18:09:43 -08:00
|
|
|
const FilterPolicy* const filter_policy;
|
|
|
|
const InternalKeyComparator& internal_comparator;
|
|
|
|
Status status;
|
2018-11-09 11:17:34 -08:00
|
|
|
std::unique_ptr<RandomAccessFileReader> file;
|
New stable, fixed-length cache keys (#9126)
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).
The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.
This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)
The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.
Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.
Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126
Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.
### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)
### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime
We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:
```
Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```
These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.
More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day
After enough data, we get a result at the end:
```
(keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```
If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:
```
(keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```
The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:
```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```
I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.
Reviewed By: zhichao-cao
Differential Revision: D33171746
Pulled By: pdillinger
fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
2021-12-16 17:13:55 -08:00
|
|
|
OffsetableCacheKey base_cache_key;
|
2017-03-03 18:09:43 -08:00
|
|
|
PersistentCacheOptions persistent_cache_options;
|
|
|
|
|
|
|
|
// Footer contains the fixed table information
|
|
|
|
Footer footer;
|
Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.
Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
Test Plan: make asan_check
Differential Revision: D16036974
Pulled By: ltamasi
fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:11:23 -07:00
|
|
|
|
2018-11-09 11:17:34 -08:00
|
|
|
std::unique_ptr<IndexReader> index_reader;
|
|
|
|
std::unique_ptr<FilterBlockReader> filter;
|
2019-07-23 15:57:43 -07:00
|
|
|
std::unique_ptr<UncompressionDictReader> uncompression_dict_reader;
|
2017-03-03 18:09:43 -08:00
|
|
|
|
|
|
|
enum class FilterType {
|
|
|
|
kNoFilter,
|
|
|
|
kFullFilter,
|
|
|
|
kBlockFilter,
|
|
|
|
kPartitionedFilter,
|
|
|
|
};
|
|
|
|
FilterType filter_type;
|
|
|
|
BlockHandle filter_handle;
|
2019-01-23 18:11:08 -08:00
|
|
|
BlockHandle compression_dict_handle;
|
2017-03-03 18:09:43 -08:00
|
|
|
|
|
|
|
std::shared_ptr<const TableProperties> table_properties;
|
|
|
|
BlockBasedTableOptions::IndexType index_type;
|
|
|
|
bool whole_key_filtering;
|
|
|
|
bool prefix_filtering;
|
|
|
|
// TODO(kailiu) It is very ugly to use internal key in table, since table
|
|
|
|
// module should not be relying on db module. However to make things easier
|
|
|
|
// and compatible with existing code, we introduce a wrapper that allows
|
|
|
|
// block to extract prefix without knowing if a key is internal or not.
|
2020-01-27 15:41:57 -08:00
|
|
|
// null if no prefix_extractor is passed in when opening the table reader.
|
2018-11-09 11:17:34 -08:00
|
|
|
std::unique_ptr<SliceTransform> internal_prefix_transform;
|
2018-06-26 15:56:26 -07:00
|
|
|
std::shared_ptr<const SliceTransform> table_prefix_extractor;
|
2017-03-03 18:09:43 -08:00
|
|
|
|
Cache fragmented range tombstones in BlockBasedTableReader (#4493)
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.
On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found)
```
Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
Tombstones? | avg micros/op | stddev micros/op | avg ops/s | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None | 0.6186 | 0.04637 | 1,625,252.90 | 124,679.41
500 Expanded | 0.6019 | 0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded | 0.6435 | 0.03994 | 1,559,979.40 | 104,090.52
1k Expanded | 0.6034 | 0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded | 0.6261 | 0.03093 | 1,600,457.50 | 79,024.94
5k Expanded | 0.6163 | 0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded | 0.6402 | 0.04002 | 1,567,804.70 | 100,965.55
10k Expanded | 0.6036 | 0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded | 0.6128 | 0.02598 | 1,634,633.40 | 72,161.82
25k Expanded | 0.6198 | 0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded | 0.5478 | 0.0362 | 1,833,059.10 | 121,233.81
50k Expanded | 0.5104 | 0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded | 0.4528 | 0.03387 | 2,219,034.50 | 170,984.32
```
After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493
Differential Revision: D10842844
Pulled By: abhimadan
fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
2018-10-25 19:25:00 -07:00
|
|
|
std::shared_ptr<const FragmentedRangeTombstoneList> fragmented_range_dels;
|
2017-03-03 18:09:43 -08:00
|
|
|
|
|
|
|
// If global_seqno is used, all Keys in this file will have the same
|
|
|
|
// seqno with value `global_seqno`.
|
|
|
|
//
|
|
|
|
// A value of kDisableGlobalSequenceNumber means that this feature is disabled
|
|
|
|
// and every key have it's own seqno.
|
|
|
|
SequenceNumber global_seqno;
|
2018-02-07 15:42:35 -08:00
|
|
|
|
For ApproximateSizes, pro-rate table metadata size over data blocks (#6784)
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.
It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.
So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.
Also includes miscellaneous comment improvements / clarifications.
Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784
Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...
[ RUN ] DBTest.ApproximateSizesFilesWithErrorMargin
db/db_test.cc:1562: Failure
Expected: (size) <= (11 * 100), actual: 9478 vs 1100
Other tests updated to reflect consistent accounting of metadata.
Reviewed By: siying
Differential Revision: D21334706
Pulled By: pdillinger
fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
2020-06-02 12:27:59 -07:00
|
|
|
// Size of the table file on disk
|
|
|
|
uint64_t file_size;
|
|
|
|
|
2018-10-24 12:10:59 -07:00
|
|
|
// the level when the table is opened, could potentially change when trivial
|
|
|
|
// move is involved
|
|
|
|
int level;
|
|
|
|
|
2018-02-07 15:42:35 -08:00
|
|
|
// If false, blocks in this file are definitely all uncompressed. Knowing this
|
|
|
|
// before reading individual blocks enables certain optimizations.
|
|
|
|
bool blocks_maybe_compressed = true;
|
|
|
|
|
2019-01-23 18:11:08 -08:00
|
|
|
// If true, data blocks in this file are definitely ZSTD compressed. If false
|
|
|
|
// they might not be. When false we skip creating a ZSTD digested
|
|
|
|
// uncompression dictionary. Even if we get a false negative, things should
|
|
|
|
// still work, just not as quickly.
|
|
|
|
bool blocks_definitely_zstd_compressed = false;
|
|
|
|
|
Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
Differential Revision: D15256423
Pulled By: al13n321
fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:50:35 -07:00
|
|
|
// These describe how index is encoded.
|
|
|
|
bool index_has_first_key = false;
|
|
|
|
bool index_key_includes_seq = true;
|
|
|
|
bool index_value_is_full = true;
|
|
|
|
|
2018-06-27 17:09:29 -07:00
|
|
|
const bool immortal_table;
|
2018-11-13 17:00:49 -08:00
|
|
|
|
2019-06-06 11:28:54 -07:00
|
|
|
SequenceNumber get_global_seqno(BlockType block_type) const {
|
|
|
|
return (block_type == BlockType::kFilter ||
|
|
|
|
block_type == BlockType::kCompressionDictionary)
|
|
|
|
? kDisableGlobalSequenceNumber
|
|
|
|
: global_seqno;
|
2018-11-13 17:00:49 -08:00
|
|
|
}
|
2019-06-14 17:37:24 -07:00
|
|
|
|
|
|
|
uint64_t cf_id_for_tracing() const {
|
2020-02-20 12:07:53 -08:00
|
|
|
return table_properties
|
|
|
|
? table_properties->column_family_id
|
|
|
|
: ROCKSDB_NAMESPACE::TablePropertiesCollectorFactory::Context::
|
|
|
|
kUnknownColumnFamily;
|
2019-06-14 17:37:24 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
Slice cf_name_for_tracing() const {
|
|
|
|
return table_properties ? table_properties->column_family_name
|
|
|
|
: BlockCacheTraceHelper::kUnknownColumnFamilyName;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint32_t level_for_tracing() const { return level >= 0 ? level : UINT32_MAX; }
|
|
|
|
|
|
|
|
uint64_t sst_number_for_tracing() const {
|
|
|
|
return file ? TableFileNameToNumber(file->file_name()) : UINT64_MAX;
|
|
|
|
}
|
2021-04-28 12:52:53 -07:00
|
|
|
void CreateFilePrefetchBuffer(size_t readahead_size,
|
|
|
|
size_t max_readahead_size,
|
|
|
|
std::unique_ptr<FilePrefetchBuffer>* fpb,
|
|
|
|
bool implicit_auto_readahead) const {
|
2021-11-19 17:52:42 -08:00
|
|
|
fpb->reset(new FilePrefetchBuffer(readahead_size, max_readahead_size,
|
|
|
|
!ioptions.allow_mmap_reads /* enable */,
|
|
|
|
false /* track_min_offset */,
|
|
|
|
implicit_auto_readahead));
|
2019-12-18 10:59:21 -08:00
|
|
|
}
|
2020-08-27 18:15:11 -07:00
|
|
|
|
|
|
|
void CreateFilePrefetchBufferIfNotExists(
|
|
|
|
size_t readahead_size, size_t max_readahead_size,
|
2021-04-28 12:52:53 -07:00
|
|
|
std::unique_ptr<FilePrefetchBuffer>* fpb,
|
|
|
|
bool implicit_auto_readahead) const {
|
2020-08-27 18:15:11 -07:00
|
|
|
if (!(*fpb)) {
|
2021-04-28 12:52:53 -07:00
|
|
|
CreateFilePrefetchBuffer(readahead_size, max_readahead_size, fpb,
|
|
|
|
implicit_auto_readahead);
|
2020-08-27 18:15:11 -07:00
|
|
|
}
|
|
|
|
}
|
2017-03-03 18:09:43 -08:00
|
|
|
};
|
2020-09-04 19:25:20 -07:00
|
|
|
|
|
|
|
// This is an adapter class for `WritableFile` to be used for `std::ostream`.
|
|
|
|
// The adapter wraps a `WritableFile`, which can be passed to a `std::ostream`
|
|
|
|
// constructor for storing streaming data.
|
|
|
|
// Note:
|
|
|
|
// * This adapter doesn't provide any buffering, each write is forwarded to
|
|
|
|
// `WritableFile->Append()` directly.
|
|
|
|
// * For a failed write, the user needs to check the status by `ostream.good()`
|
|
|
|
class WritableFileStringStreamAdapter : public std::stringbuf {
|
|
|
|
public:
|
|
|
|
explicit WritableFileStringStreamAdapter(WritableFile* writable_file)
|
|
|
|
: file_(writable_file) {}
|
|
|
|
|
Append all characters not captured by xsputn() in overflow() function (#7991)
Summary:
In the adapter class `WritableFileStringStreamAdapter`, which wraps WritableFile to be used for std::ostream, previouly only `std::endl` is considered a special case because `endl` is written by `os.put()` directly without going through `xsputn()`. `os.put()` will call `sputc()` and if we further check the internal implementation of `sputc()`, we will see it is
```
int_type __CLR_OR_THIS_CALL sputc(_Elem _Ch) { // put a character
return 0 < _Pnavail() ? _Traits::to_int_type(*_Pninc() = _Ch) : overflow(_Traits::to_int_type(_Ch));
```
As we explicitly disabled buffering, _Pnavail() is always 0. Thus every write, not captured by xsputn, becomes an overflow.
When I run tests on Windows, I found not only `std::endl` will drop into this case, writing an unsigned long long will also call `os.put()` then followed by `sputc()` and eventually call `overflow()`. Therefore, instead of only checking `std::endl`, we should try to append other characters as well unless the appending operation fails.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7991
Reviewed By: jay-zhuang
Differential Revision: D26615692
Pulled By: ajkr
fbshipit-source-id: 4c0003de1645b9531545b23df69b000e07014468
2021-02-23 21:42:55 -08:00
|
|
|
// Override overflow() to handle `sputc()`. There are cases that will not go
|
|
|
|
// through `xsputn()` e.g. `std::endl` or an unsigned long long is written by
|
|
|
|
// `os.put()` directly and will call `sputc()` By internal implementation:
|
|
|
|
// int_type __CLR_OR_THIS_CALL sputc(_Elem _Ch) { // put a character
|
|
|
|
// return 0 < _Pnavail() ? _Traits::to_int_type(*_Pninc() = _Ch) :
|
|
|
|
// overflow(_Traits::to_int_type(_Ch));
|
|
|
|
// }
|
|
|
|
// As we explicitly disabled buffering (_Pnavail() is always 0), every write,
|
|
|
|
// not captured by xsputn(), becomes an overflow here.
|
2020-09-04 19:25:20 -07:00
|
|
|
int overflow(int ch = EOF) override {
|
Append all characters not captured by xsputn() in overflow() function (#7991)
Summary:
In the adapter class `WritableFileStringStreamAdapter`, which wraps WritableFile to be used for std::ostream, previouly only `std::endl` is considered a special case because `endl` is written by `os.put()` directly without going through `xsputn()`. `os.put()` will call `sputc()` and if we further check the internal implementation of `sputc()`, we will see it is
```
int_type __CLR_OR_THIS_CALL sputc(_Elem _Ch) { // put a character
return 0 < _Pnavail() ? _Traits::to_int_type(*_Pninc() = _Ch) : overflow(_Traits::to_int_type(_Ch));
```
As we explicitly disabled buffering, _Pnavail() is always 0. Thus every write, not captured by xsputn, becomes an overflow.
When I run tests on Windows, I found not only `std::endl` will drop into this case, writing an unsigned long long will also call `os.put()` then followed by `sputc()` and eventually call `overflow()`. Therefore, instead of only checking `std::endl`, we should try to append other characters as well unless the appending operation fails.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7991
Reviewed By: jay-zhuang
Differential Revision: D26615692
Pulled By: ajkr
fbshipit-source-id: 4c0003de1645b9531545b23df69b000e07014468
2021-02-23 21:42:55 -08:00
|
|
|
if (ch != EOF) {
|
|
|
|
Status s = file_->Append(Slice((char*)&ch, 1));
|
|
|
|
if (s.ok()) {
|
|
|
|
return ch;
|
|
|
|
}
|
2020-09-04 19:25:20 -07:00
|
|
|
}
|
|
|
|
return EOF;
|
|
|
|
}
|
|
|
|
|
|
|
|
std::streamsize xsputn(char const* p, std::streamsize n) override {
|
|
|
|
Status s = file_->Append(Slice(p, n));
|
|
|
|
if (!s.ok()) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return n;
|
|
|
|
}
|
|
|
|
|
|
|
|
private:
|
|
|
|
WritableFile* file_;
|
|
|
|
};
|
|
|
|
|
2020-02-20 12:07:53 -08:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|