rocksdb/table/internal_iterator.h
Mike Kolupaev b4d7209428 Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.

Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.

So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.

Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.

This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289

Differential Revision: D15256423

Pulled By: al13n321

fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:54:04 -07:00

163 lines
5.9 KiB
C++

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
//
#pragma once
#include <string>
#include "db/dbformat.h"
#include "rocksdb/comparator.h"
#include "rocksdb/iterator.h"
#include "rocksdb/status.h"
#include "table/format.h"
namespace rocksdb {
class PinnedIteratorsManager;
template <class TValue>
class InternalIteratorBase : public Cleanable {
public:
InternalIteratorBase() : is_mutable_(true) {}
InternalIteratorBase(bool _is_mutable) : is_mutable_(_is_mutable) {}
virtual ~InternalIteratorBase() {}
// An iterator is either positioned at a key/value pair, or
// not valid. This method returns true iff the iterator is valid.
// Always returns false if !status().ok().
virtual bool Valid() const = 0;
// Position at the first key in the source. The iterator is Valid()
// after this call iff the source is not empty.
virtual void SeekToFirst() = 0;
// Position at the last key in the source. The iterator is
// Valid() after this call iff the source is not empty.
virtual void SeekToLast() = 0;
// Position at the first key in the source that at or past target
// The iterator is Valid() after this call iff the source contains
// an entry that comes at or past target.
// All Seek*() methods clear any error status() that the iterator had prior to
// the call; after the seek, status() indicates only the error (if any) that
// happened during the seek, not any past errors.
virtual void Seek(const Slice& target) = 0;
// Position at the first key in the source that at or before target
// The iterator is Valid() after this call iff the source contains
// an entry that comes at or before target.
virtual void SeekForPrev(const Slice& target) = 0;
// Moves to the next entry in the source. After this call, Valid() is
// true iff the iterator was not positioned at the last entry in the source.
// REQUIRES: Valid()
virtual void Next() = 0;
virtual bool NextAndGetResult(Slice* ret_key) {
Next();
bool is_valid = Valid();
if (is_valid) {
*ret_key = key();
}
return is_valid;
}
// Moves to the previous entry in the source. After this call, Valid() is
// true iff the iterator was not positioned at the first entry in source.
// REQUIRES: Valid()
virtual void Prev() = 0;
// Return the key for the current entry. The underlying storage for
// the returned slice is valid only until the next modification of
// the iterator.
// REQUIRES: Valid()
virtual Slice key() const = 0;
// Return user key for the current entry.
// REQUIRES: Valid()
virtual Slice user_key() const { return ExtractUserKey(key()); }
// Return the value for the current entry. The underlying storage for
// the returned slice is valid only until the next modification of
// the iterator.
// REQUIRES: Valid()
virtual TValue value() const = 0;
// If an error has occurred, return it. Else return an ok status.
// If non-blocking IO is requested and this operation cannot be
// satisfied without doing some IO, then this returns Status::Incomplete().
virtual Status status() const = 0;
// True if the iterator is invalidated because it reached a key that is above
// the iterator upper bound. Used by LevelIterator to decide whether it should
// stop or move on to the next file.
// Important: if iterator reached the end of the file without encountering any
// keys above the upper bound, IsOutOfBound() must return false.
virtual bool IsOutOfBound() { return false; }
// Pass the PinnedIteratorsManager to the Iterator, most Iterators dont
// communicate with PinnedIteratorsManager so default implementation is no-op
// but for Iterators that need to communicate with PinnedIteratorsManager
// they will implement this function and use the passed pointer to communicate
// with PinnedIteratorsManager.
virtual void SetPinnedItersMgr(PinnedIteratorsManager* /*pinned_iters_mgr*/) {
}
// If true, this means that the Slice returned by key() is valid as long as
// PinnedIteratorsManager::ReleasePinnedData is not called and the
// Iterator is not deleted.
//
// IsKeyPinned() is guaranteed to always return true if
// - Iterator is created with ReadOptions::pin_data = true
// - DB tables were created with BlockBasedTableOptions::use_delta_encoding
// set to false.
virtual bool IsKeyPinned() const { return false; }
// If true, this means that the Slice returned by value() is valid as long as
// PinnedIteratorsManager::ReleasePinnedData is not called and the
// Iterator is not deleted.
virtual bool IsValuePinned() const { return false; }
virtual Status GetProperty(std::string /*prop_name*/, std::string* /*prop*/) {
return Status::NotSupported("");
}
bool is_mutable() const { return is_mutable_; }
protected:
void SeekForPrevImpl(const Slice& target, const Comparator* cmp) {
Seek(target);
if (!Valid()) {
SeekToLast();
}
while (Valid() && cmp->Compare(target, key()) < 0) {
Prev();
}
}
bool is_mutable_;
private:
// No copying allowed
InternalIteratorBase(const InternalIteratorBase&) = delete;
InternalIteratorBase& operator=(const InternalIteratorBase&) = delete;
};
using InternalIterator = InternalIteratorBase<Slice>;
// Return an empty iterator (yields nothing).
template <class TValue = Slice>
extern InternalIteratorBase<TValue>* NewEmptyInternalIterator();
// Return an empty iterator with the specified status.
template <class TValue = Slice>
extern InternalIteratorBase<TValue>* NewErrorInternalIterator(
const Status& status);
// Return an empty iterator with the specified status, allocated arena.
template <class TValue = Slice>
extern InternalIteratorBase<TValue>* NewErrorInternalIterator(
const Status& status, Arena* arena);
} // namespace rocksdb