De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 20:17:34 +01:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
|
|
//
|
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
#include "table/block_based/block_based_table_iterator.h"
|
|
|
|
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
void BlockBasedTableIterator::Seek(const Slice& target) { SeekImpl(&target); }
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::SeekToFirst() { SeekImpl(nullptr); }
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::SeekImpl(const Slice* target) {
|
|
|
|
is_out_of_bound_ = false;
|
|
|
|
is_at_first_key_from_index_ = false;
|
|
|
|
if (target && !CheckPrefixMayMatch(*target, IterDirection::kForward)) {
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool need_seek_index = true;
|
|
|
|
if (block_iter_points_to_real_block_ && block_iter_.Valid()) {
|
|
|
|
// Reseek.
|
|
|
|
prev_block_offset_ = index_iter_->value().handle.offset();
|
|
|
|
|
|
|
|
if (target) {
|
|
|
|
// We can avoid an index seek if:
|
|
|
|
// 1. The new seek key is larger than the current key
|
|
|
|
// 2. The new seek key is within the upper bound of the block
|
|
|
|
// Since we don't necessarily know the internal key for either
|
|
|
|
// the current key or the upper bound, we check user keys and
|
|
|
|
// exclude the equality case. Considering internal keys can
|
|
|
|
// improve for the boundary cases, but it would complicate the
|
|
|
|
// code.
|
|
|
|
if (user_comparator_.Compare(ExtractUserKey(*target),
|
|
|
|
block_iter_.user_key()) > 0 &&
|
|
|
|
user_comparator_.Compare(ExtractUserKey(*target),
|
|
|
|
index_iter_->user_key()) < 0) {
|
|
|
|
need_seek_index = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (need_seek_index) {
|
|
|
|
if (target) {
|
|
|
|
index_iter_->Seek(*target);
|
|
|
|
} else {
|
|
|
|
index_iter_->SeekToFirst();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!index_iter_->Valid()) {
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
IndexValue v = index_iter_->value();
|
|
|
|
const bool same_block = block_iter_points_to_real_block_ &&
|
|
|
|
v.handle.offset() == prev_block_offset_;
|
|
|
|
|
|
|
|
if (!v.first_internal_key.empty() && !same_block &&
|
|
|
|
(!target || icomp_.Compare(*target, v.first_internal_key) <= 0) &&
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-16 02:37:23 +02:00
|
|
|
allow_unprepared_value_) {
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 20:17:34 +01:00
|
|
|
// Index contains the first key of the block, and it's >= target.
|
|
|
|
// We can defer reading the block.
|
|
|
|
is_at_first_key_from_index_ = true;
|
|
|
|
// ResetDataIter() will invalidate block_iter_. Thus, there is no need to
|
|
|
|
// call CheckDataBlockWithinUpperBound() to check for iterate_upper_bound
|
|
|
|
// as that will be done later when the data block is actually read.
|
|
|
|
ResetDataIter();
|
|
|
|
} else {
|
|
|
|
// Need to use the data block.
|
|
|
|
if (!same_block) {
|
|
|
|
InitDataBlock();
|
|
|
|
} else {
|
|
|
|
// When the user does a reseek, the iterate_upper_bound might have
|
|
|
|
// changed. CheckDataBlockWithinUpperBound() needs to be called
|
|
|
|
// explicitly if the reseek ends up in the same data block.
|
|
|
|
// If the reseek ends up in a different block, InitDataBlock() will do
|
|
|
|
// the iterator upper bound check.
|
|
|
|
CheckDataBlockWithinUpperBound();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (target) {
|
|
|
|
block_iter_.Seek(*target);
|
|
|
|
} else {
|
|
|
|
block_iter_.SeekToFirst();
|
|
|
|
}
|
|
|
|
FindKeyForward();
|
|
|
|
}
|
|
|
|
|
|
|
|
CheckOutOfBound();
|
|
|
|
|
|
|
|
if (target) {
|
|
|
|
assert(!Valid() || icomp_.Compare(*target, key()) <= 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::SeekForPrev(const Slice& target) {
|
|
|
|
is_out_of_bound_ = false;
|
|
|
|
is_at_first_key_from_index_ = false;
|
|
|
|
// For now totally disable prefix seek in auto prefix mode because we don't
|
|
|
|
// have logic
|
|
|
|
if (!CheckPrefixMayMatch(target, IterDirection::kBackward)) {
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
SavePrevIndexValue();
|
|
|
|
|
|
|
|
// Call Seek() rather than SeekForPrev() in the index block, because the
|
|
|
|
// target data block will likely to contain the position for `target`, the
|
|
|
|
// same as Seek(), rather than than before.
|
|
|
|
// For example, if we have three data blocks, each containing two keys:
|
|
|
|
// [2, 4] [6, 8] [10, 12]
|
|
|
|
// (the keys in the index block would be [4, 8, 12])
|
|
|
|
// and the user calls SeekForPrev(7), we need to go to the second block,
|
|
|
|
// just like if they call Seek(7).
|
|
|
|
// The only case where the block is difference is when they seek to a position
|
|
|
|
// in the boundary. For example, if they SeekForPrev(5), we should go to the
|
|
|
|
// first block, rather than the second. However, we don't have the information
|
|
|
|
// to distinguish the two unless we read the second block. In this case, we'll
|
|
|
|
// end up with reading two blocks.
|
|
|
|
index_iter_->Seek(target);
|
|
|
|
|
|
|
|
if (!index_iter_->Valid()) {
|
|
|
|
auto seek_status = index_iter_->status();
|
|
|
|
// Check for IO error
|
|
|
|
if (!seek_status.IsNotFound() && !seek_status.ok()) {
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
// With prefix index, Seek() returns NotFound if the prefix doesn't exist
|
|
|
|
if (seek_status.IsNotFound()) {
|
|
|
|
// Any key less than the target is fine for prefix seek
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
} else {
|
|
|
|
index_iter_->SeekToLast();
|
|
|
|
}
|
|
|
|
// Check for IO error
|
|
|
|
if (!index_iter_->Valid()) {
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
InitDataBlock();
|
|
|
|
|
|
|
|
block_iter_.SeekForPrev(target);
|
|
|
|
|
|
|
|
FindKeyBackward();
|
|
|
|
CheckDataBlockWithinUpperBound();
|
|
|
|
assert(!block_iter_.Valid() ||
|
|
|
|
icomp_.Compare(target, block_iter_.key()) >= 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::SeekToLast() {
|
|
|
|
is_out_of_bound_ = false;
|
|
|
|
is_at_first_key_from_index_ = false;
|
|
|
|
SavePrevIndexValue();
|
|
|
|
index_iter_->SeekToLast();
|
|
|
|
if (!index_iter_->Valid()) {
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
InitDataBlock();
|
|
|
|
block_iter_.SeekToLast();
|
|
|
|
FindKeyBackward();
|
|
|
|
CheckDataBlockWithinUpperBound();
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::Next() {
|
|
|
|
if (is_at_first_key_from_index_ && !MaterializeCurrentBlock()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
assert(block_iter_points_to_real_block_);
|
|
|
|
block_iter_.Next();
|
|
|
|
FindKeyForward();
|
|
|
|
CheckOutOfBound();
|
|
|
|
}
|
|
|
|
|
|
|
|
bool BlockBasedTableIterator::NextAndGetResult(IterateResult* result) {
|
|
|
|
Next();
|
|
|
|
bool is_valid = Valid();
|
|
|
|
if (is_valid) {
|
|
|
|
result->key = key();
|
2020-08-05 19:42:56 +02:00
|
|
|
result->bound_check_result = UpperBoundCheckResult();
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-16 02:37:23 +02:00
|
|
|
result->value_prepared = !is_at_first_key_from_index_;
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 20:17:34 +01:00
|
|
|
}
|
|
|
|
return is_valid;
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::Prev() {
|
|
|
|
if (is_at_first_key_from_index_) {
|
|
|
|
is_at_first_key_from_index_ = false;
|
|
|
|
|
|
|
|
index_iter_->Prev();
|
|
|
|
if (!index_iter_->Valid()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
InitDataBlock();
|
|
|
|
block_iter_.SeekToLast();
|
|
|
|
} else {
|
|
|
|
assert(block_iter_points_to_real_block_);
|
|
|
|
block_iter_.Prev();
|
|
|
|
}
|
|
|
|
|
|
|
|
FindKeyBackward();
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::InitDataBlock() {
|
|
|
|
BlockHandle data_block_handle = index_iter_->value().handle;
|
|
|
|
if (!block_iter_points_to_real_block_ ||
|
|
|
|
data_block_handle.offset() != prev_block_offset_ ||
|
|
|
|
// if previous attempt of reading the block missed cache, try again
|
|
|
|
block_iter_.status().IsIncomplete()) {
|
|
|
|
if (block_iter_points_to_real_block_) {
|
|
|
|
ResetDataIter();
|
|
|
|
}
|
|
|
|
auto* rep = table_->get_rep();
|
|
|
|
|
|
|
|
bool is_for_compaction =
|
|
|
|
lookup_context_.caller == TableReaderCaller::kCompaction;
|
|
|
|
// Prefetch additional data for range scans (iterators).
|
|
|
|
// Implicit auto readahead:
|
|
|
|
// Enabled after 2 sequential IOs when ReadOptions.readahead_size == 0.
|
|
|
|
// Explicit user requested readahead:
|
|
|
|
// Enabled from the very first IO when ReadOptions.readahead_size is set.
|
2022-03-21 15:12:43 +01:00
|
|
|
block_prefetcher_.PrefetchIfNeeded(
|
|
|
|
rep, data_block_handle, read_options_.readahead_size, is_for_compaction,
|
Set Read rate limiter priority dynamically and pass it to FS (#9996)
Summary:
### Context:
Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.
### Solution
User, Flush, and Compaction reads share some code path. For this task, we update the rate_limiter_priority in ReadOptions for code paths (e.g. FindTable (mainly in BlockBasedTable::Open()) and various iterators), and eventually update the rate_limiter_priority in IOOptions for FSRandomAccessFile.
**This PR is for the Read path.** The **Read:** dynamic priority for different state are listed as follows:
| State | Normal | Delayed | Stalled |
| ----- | ------ | ------- | ------- |
| Flush (verification read in BuildTable()) | IO_USER | IO_USER | IO_USER |
| Compaction | IO_LOW | IO_USER | IO_USER |
| User | User provided | User provided | User provided |
We will respect the read_options that the user provided and will not set it.
The only sst read for Flush is the verification read in BuildTable(). It claims to be "regard as user read".
**Details**
1. Set read_options.rate_limiter_priority dynamically:
- User: Do not update the read_options. Use the read_options that the user provided.
- Compaction: Update read_options in CompactionJob::ProcessKeyValueCompaction().
- Flush: Update read_options in BuildTable().
2. Pass the rate limiter priority to FSRandomAccessFile functions:
- After calling the FindTable(), read_options is passed through GetTableReader(table_cache.cc), BlockBasedTableFactory::NewTableReader(block_based_table_factory.cc), and BlockBasedTable::Open(). The Open() needs some updates for the ReadOptions variable and the updates are also needed for the called functions, including PrefetchTail(), PrepareIOOptions(), ReadFooterFromFile(), ReadMetaIndexblock(), ReadPropertiesBlock(), PrefetchIndexAndFilterBlocks(), and ReadRangeDelBlock().
- In RandomAccessFileReader, the functions to be updated include Read(), MultiRead(), ReadAsync(), and Prefetch().
- Update the downstream functions of NewIndexIterator(), NewDataBlockIterator(), and BlockBasedTableIterator().
### Test Plans
Add unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9996
Reviewed By: anand1976
Differential Revision: D36452483
Pulled By: gitbw95
fbshipit-source-id: 60978204a4f849bb9261cb78d9bc1cb56d6008cf
2022-05-19 04:41:44 +02:00
|
|
|
read_options_.async_io, read_options_.rate_limiter_priority);
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 20:17:34 +01:00
|
|
|
Status s;
|
|
|
|
table_->NewDataBlockIterator<DataBlockIter>(
|
|
|
|
read_options_, data_block_handle, &block_iter_, BlockType::kData,
|
|
|
|
/*get_context=*/nullptr, &lookup_context_, s,
|
|
|
|
block_prefetcher_.prefetch_buffer(),
|
|
|
|
/*for_compaction=*/is_for_compaction);
|
|
|
|
block_iter_points_to_real_block_ = true;
|
|
|
|
CheckDataBlockWithinUpperBound();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
bool BlockBasedTableIterator::MaterializeCurrentBlock() {
|
|
|
|
assert(is_at_first_key_from_index_);
|
|
|
|
assert(!block_iter_points_to_real_block_);
|
|
|
|
assert(index_iter_->Valid());
|
|
|
|
|
|
|
|
is_at_first_key_from_index_ = false;
|
|
|
|
InitDataBlock();
|
|
|
|
assert(block_iter_points_to_real_block_);
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-16 02:37:23 +02:00
|
|
|
|
|
|
|
if (!block_iter_.status().ok()) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 20:17:34 +01:00
|
|
|
block_iter_.SeekToFirst();
|
|
|
|
|
|
|
|
if (!block_iter_.Valid() ||
|
|
|
|
icomp_.Compare(block_iter_.key(),
|
|
|
|
index_iter_->value().first_internal_key) != 0) {
|
|
|
|
block_iter_.Invalidate(Status::Corruption(
|
|
|
|
"first key in index doesn't match first key in block"));
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::FindKeyForward() {
|
|
|
|
// This method's code is kept short to make it likely to be inlined.
|
|
|
|
|
|
|
|
assert(!is_out_of_bound_);
|
|
|
|
assert(block_iter_points_to_real_block_);
|
|
|
|
|
|
|
|
if (!block_iter_.Valid()) {
|
|
|
|
// This is the only call site of FindBlockForward(), but it's extracted into
|
|
|
|
// a separate method to keep FindKeyForward() short and likely to be
|
|
|
|
// inlined. When transitioning to a different block, we call
|
|
|
|
// FindBlockForward(), which is much longer and is probably not inlined.
|
|
|
|
FindBlockForward();
|
|
|
|
} else {
|
|
|
|
// This is the fast path that avoids a function call.
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::FindBlockForward() {
|
|
|
|
// TODO the while loop inherits from two-level-iterator. We don't know
|
|
|
|
// whether a block can be empty so it can be replaced by an "if".
|
|
|
|
do {
|
|
|
|
if (!block_iter_.status().ok()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
// Whether next data block is out of upper bound, if there is one.
|
|
|
|
const bool next_block_is_out_of_bound =
|
|
|
|
read_options_.iterate_upper_bound != nullptr &&
|
2020-08-04 20:28:02 +02:00
|
|
|
block_iter_points_to_real_block_ &&
|
|
|
|
block_upper_bound_check_ == BlockUpperBound::kUpperBoundInCurBlock;
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 20:17:34 +01:00
|
|
|
assert(!next_block_is_out_of_bound ||
|
|
|
|
user_comparator_.CompareWithoutTimestamp(
|
|
|
|
*read_options_.iterate_upper_bound, /*a_has_ts=*/false,
|
|
|
|
index_iter_->user_key(), /*b_has_ts=*/true) <= 0);
|
|
|
|
ResetDataIter();
|
|
|
|
index_iter_->Next();
|
|
|
|
if (next_block_is_out_of_bound) {
|
|
|
|
// The next block is out of bound. No need to read it.
|
|
|
|
TEST_SYNC_POINT_CALLBACK("BlockBasedTableIterator:out_of_bound", nullptr);
|
|
|
|
// We need to make sure this is not the last data block before setting
|
|
|
|
// is_out_of_bound_, since the index key for the last data block can be
|
|
|
|
// larger than smallest key of the next file on the same level.
|
|
|
|
if (index_iter_->Valid()) {
|
|
|
|
is_out_of_bound_ = true;
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!index_iter_->Valid()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
IndexValue v = index_iter_->value();
|
|
|
|
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-16 02:37:23 +02:00
|
|
|
if (!v.first_internal_key.empty() && allow_unprepared_value_) {
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 20:17:34 +01:00
|
|
|
// Index contains the first key of the block. Defer reading the block.
|
|
|
|
is_at_first_key_from_index_ = true;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
InitDataBlock();
|
|
|
|
block_iter_.SeekToFirst();
|
|
|
|
} while (!block_iter_.Valid());
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::FindKeyBackward() {
|
|
|
|
while (!block_iter_.Valid()) {
|
|
|
|
if (!block_iter_.status().ok()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
ResetDataIter();
|
|
|
|
index_iter_->Prev();
|
|
|
|
|
|
|
|
if (index_iter_->Valid()) {
|
|
|
|
InitDataBlock();
|
|
|
|
block_iter_.SeekToLast();
|
|
|
|
} else {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// We could have check lower bound here too, but we opt not to do it for
|
|
|
|
// code simplicity.
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::CheckOutOfBound() {
|
2020-08-04 20:28:02 +02:00
|
|
|
if (read_options_.iterate_upper_bound != nullptr &&
|
|
|
|
block_upper_bound_check_ != BlockUpperBound::kUpperBoundBeyondCurBlock &&
|
|
|
|
Valid()) {
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 20:17:34 +01:00
|
|
|
is_out_of_bound_ =
|
|
|
|
user_comparator_.CompareWithoutTimestamp(
|
|
|
|
*read_options_.iterate_upper_bound, /*a_has_ts=*/false, user_key(),
|
|
|
|
/*b_has_ts=*/true) <= 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::CheckDataBlockWithinUpperBound() {
|
|
|
|
if (read_options_.iterate_upper_bound != nullptr &&
|
|
|
|
block_iter_points_to_real_block_) {
|
2020-08-04 20:28:02 +02:00
|
|
|
block_upper_bound_check_ = (user_comparator_.CompareWithoutTimestamp(
|
|
|
|
*read_options_.iterate_upper_bound,
|
|
|
|
/*a_has_ts=*/false, index_iter_->user_key(),
|
|
|
|
/*b_has_ts=*/true) > 0)
|
|
|
|
? BlockUpperBound::kUpperBoundBeyondCurBlock
|
|
|
|
: BlockUpperBound::kUpperBoundInCurBlock;
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 20:17:34 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|