e45673dece
Summary: Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype. Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling. It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas. Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621 Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats. Reviewed By: siying Differential Revision: D20786930 Pulled By: al13n321 fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
102 lines
3.4 KiB
C++
102 lines
3.4 KiB
C++
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
//
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
#pragma once
|
|
#ifndef ROCKSDB_LITE
|
|
#include <string>
|
|
#include <memory>
|
|
#include <utility>
|
|
#include <vector>
|
|
|
|
#include "db/dbformat.h"
|
|
#include "file/random_access_file_reader.h"
|
|
#include "options/cf_options.h"
|
|
#include "rocksdb/env.h"
|
|
#include "rocksdb/options.h"
|
|
#include "table/table_reader.h"
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
class Arena;
|
|
class TableReader;
|
|
|
|
class CuckooTableReader: public TableReader {
|
|
public:
|
|
CuckooTableReader(const ImmutableCFOptions& ioptions,
|
|
std::unique_ptr<RandomAccessFileReader>&& file,
|
|
uint64_t file_size, const Comparator* user_comparator,
|
|
uint64_t (*get_slice_hash)(const Slice&, uint32_t,
|
|
uint64_t));
|
|
~CuckooTableReader() {}
|
|
|
|
std::shared_ptr<const TableProperties> GetTableProperties() const override {
|
|
return table_props_;
|
|
}
|
|
|
|
Status status() const { return status_; }
|
|
|
|
Status Get(const ReadOptions& readOptions, const Slice& key,
|
|
GetContext* get_context, const SliceTransform* prefix_extractor,
|
|
bool skip_filters = false) override;
|
|
|
|
// Returns a new iterator over table contents
|
|
// compaction_readahead_size: its value will only be used if for_compaction =
|
|
// true
|
|
InternalIterator* NewIterator(const ReadOptions&,
|
|
const SliceTransform* prefix_extractor,
|
|
Arena* arena, bool skip_filters,
|
|
TableReaderCaller caller,
|
|
size_t compaction_readahead_size = 0,
|
|
bool allow_unprepared_value = false) override;
|
|
void Prepare(const Slice& target) override;
|
|
|
|
// Report an approximation of how much memory has been used.
|
|
size_t ApproximateMemoryUsage() const override;
|
|
|
|
// Following methods are not implemented for Cuckoo Table Reader
|
|
uint64_t ApproximateOffsetOf(const Slice& /*key*/,
|
|
TableReaderCaller /*caller*/) override {
|
|
return 0;
|
|
}
|
|
|
|
uint64_t ApproximateSize(const Slice& /*start*/, const Slice& /*end*/,
|
|
TableReaderCaller /*caller*/) override {
|
|
return 0;
|
|
}
|
|
|
|
void SetupForCompaction() override {}
|
|
// End of methods not implemented.
|
|
|
|
private:
|
|
friend class CuckooTableIterator;
|
|
void LoadAllKeys(std::vector<std::pair<Slice, uint32_t>>* key_to_bucket_id);
|
|
std::unique_ptr<RandomAccessFileReader> file_;
|
|
Slice file_data_;
|
|
bool is_last_level_;
|
|
bool identity_as_first_hash_;
|
|
bool use_module_hash_;
|
|
std::shared_ptr<const TableProperties> table_props_;
|
|
Status status_;
|
|
uint32_t num_hash_func_;
|
|
std::string unused_key_;
|
|
uint32_t key_length_;
|
|
uint32_t user_key_length_;
|
|
uint32_t value_length_;
|
|
uint32_t bucket_length_;
|
|
uint32_t cuckoo_block_size_;
|
|
uint32_t cuckoo_block_bytes_minus_one_;
|
|
uint64_t table_size_;
|
|
const Comparator* ucomp_;
|
|
uint64_t (*get_slice_hash_)(const Slice& s, uint32_t index,
|
|
uint64_t max_num_buckets);
|
|
};
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|
|
#endif // ROCKSDB_LITE
|