5bf9a7d5ee
Summary: Somewhat confusingly, index and filter partition blocks are never owned by table readers, even with cache_index_and_filter_blocks=false. They still go into block cache (possibly pinned by table reader) if there is a block cache. If no block cache, they are only loaded transiently on demand. This PR primarily clarifies the options APIs and some internal code comments. Also, this closes a hypothetical data corruption vulnerability where some but not all index partitions are pinned. I haven't been able to reproduce a case where it can happen (the failure seems to propagate to abort table open) but it's worth patching nonetheless. Fixes https://github.com/facebook/rocksdb/issues/8979 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9068 Test Plan: existing tests :-/ I could cover the new code using sync points, but then I'd have to very carefully relax my `assert(false)` Reviewed By: ajkr Differential Revision: D31898284 Pulled By: pdillinger fbshipit-source-id: f2511a7d3a36bc04b627935d8e6cfea6422f98be
55 lines
2.4 KiB
C++
55 lines
2.4 KiB
C++
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
//
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
#pragma once
|
|
#include "table/block_based/index_reader_common.h"
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
// Index that allows binary search lookup in a two-level index structure.
|
|
class PartitionIndexReader : public BlockBasedTable::IndexReaderCommon {
|
|
public:
|
|
// Read the partition index from the file and create an instance for
|
|
// `PartitionIndexReader`.
|
|
// On success, index_reader will be populated; otherwise it will remain
|
|
// unmodified.
|
|
static Status Create(const BlockBasedTable* table, const ReadOptions& ro,
|
|
FilePrefetchBuffer* prefetch_buffer, bool use_cache,
|
|
bool prefetch, bool pin,
|
|
BlockCacheLookupContext* lookup_context,
|
|
std::unique_ptr<IndexReader>* index_reader);
|
|
|
|
// return a two-level iterator: first level is on the partition index
|
|
InternalIteratorBase<IndexValue>* NewIterator(
|
|
const ReadOptions& read_options, bool /* disable_prefix_seek */,
|
|
IndexBlockIter* iter, GetContext* get_context,
|
|
BlockCacheLookupContext* lookup_context) override;
|
|
|
|
Status CacheDependencies(const ReadOptions& ro, bool pin) override;
|
|
size_t ApproximateMemoryUsage() const override {
|
|
size_t usage = ApproximateIndexBlockMemoryUsage();
|
|
#ifdef ROCKSDB_MALLOC_USABLE_SIZE
|
|
usage += malloc_usable_size(const_cast<PartitionIndexReader*>(this));
|
|
#else
|
|
usage += sizeof(*this);
|
|
#endif // ROCKSDB_MALLOC_USABLE_SIZE
|
|
// TODO(myabandeh): more accurate estimate of partition_map_ mem usage
|
|
return usage;
|
|
}
|
|
|
|
private:
|
|
PartitionIndexReader(const BlockBasedTable* t,
|
|
CachableEntry<Block>&& index_block)
|
|
: IndexReaderCommon(t, std::move(index_block)) {}
|
|
|
|
// For partition blocks pinned in cache. This is expected to be "all or
|
|
// none" so that !partition_map_.empty() can use an iterator expecting
|
|
// all partitions to be saved here.
|
|
std::unordered_map<uint64_t, CachableEntry<Block>> partition_map_;
|
|
};
|
|
} // namespace ROCKSDB_NAMESPACE
|