babe56ddba
Summary: Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working. `RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`. There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads). The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424 Test Plan: - new unit tests - new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart. - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true` - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true` - crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10` Reviewed By: hx235 Differential Revision: D33747386 Pulled By: ajkr fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
176 lines
6.5 KiB
C++
176 lines
6.5 KiB
C++
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
//
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
#pragma once
|
|
#include <atomic>
|
|
#include <sstream>
|
|
#include <string>
|
|
|
|
#include "env/file_system_tracer.h"
|
|
#include "port/port.h"
|
|
#include "rocksdb/file_system.h"
|
|
#include "rocksdb/listener.h"
|
|
#include "rocksdb/options.h"
|
|
#include "rocksdb/rate_limiter.h"
|
|
#include "util/aligned_buffer.h"
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
class Statistics;
|
|
class HistogramImpl;
|
|
class SystemClock;
|
|
|
|
using AlignedBuf = std::unique_ptr<char[]>;
|
|
|
|
// Align the request r according to alignment and return the aligned result.
|
|
FSReadRequest Align(const FSReadRequest& r, size_t alignment);
|
|
|
|
// Try to merge src to dest if they have overlap.
|
|
//
|
|
// Each request represents an inclusive interval [offset, offset + len].
|
|
// If the intervals have overlap, update offset and len to represent the
|
|
// merged interval, and return true.
|
|
// Otherwise, do nothing and return false.
|
|
bool TryMerge(FSReadRequest* dest, const FSReadRequest& src);
|
|
|
|
// RandomAccessFileReader is a wrapper on top of Env::RandomAccessFile. It is
|
|
// responsible for:
|
|
// - Handling Buffered and Direct reads appropriately.
|
|
// - Rate limiting compaction reads.
|
|
// - Notifying any interested listeners on the completion of a read.
|
|
// - Updating IO stats.
|
|
class RandomAccessFileReader {
|
|
private:
|
|
#ifndef ROCKSDB_LITE
|
|
void NotifyOnFileReadFinish(
|
|
uint64_t offset, size_t length,
|
|
const FileOperationInfo::StartTimePoint& start_ts,
|
|
const FileOperationInfo::FinishTimePoint& finish_ts,
|
|
const Status& status) const {
|
|
FileOperationInfo info(FileOperationType::kRead, file_name_, start_ts,
|
|
finish_ts, status);
|
|
info.offset = offset;
|
|
info.length = length;
|
|
|
|
for (auto& listener : listeners_) {
|
|
listener->OnFileReadFinish(info);
|
|
}
|
|
info.status.PermitUncheckedError();
|
|
}
|
|
|
|
void NotifyOnIOError(const IOStatus& io_status, FileOperationType operation,
|
|
const std::string& file_path, size_t length,
|
|
uint64_t offset) const {
|
|
if (listeners_.empty()) {
|
|
return;
|
|
}
|
|
IOErrorInfo io_error_info(io_status, operation, file_path, length, offset);
|
|
|
|
for (auto& listener : listeners_) {
|
|
listener->OnIOError(io_error_info);
|
|
}
|
|
io_status.PermitUncheckedError();
|
|
}
|
|
|
|
#endif // ROCKSDB_LITE
|
|
|
|
bool ShouldNotifyListeners() const { return !listeners_.empty(); }
|
|
|
|
FSRandomAccessFilePtr file_;
|
|
std::string file_name_;
|
|
SystemClock* clock_;
|
|
Statistics* stats_;
|
|
uint32_t hist_type_;
|
|
HistogramImpl* file_read_hist_;
|
|
RateLimiter* rate_limiter_;
|
|
std::vector<std::shared_ptr<EventListener>> listeners_;
|
|
Temperature file_temperature_;
|
|
|
|
public:
|
|
explicit RandomAccessFileReader(
|
|
std::unique_ptr<FSRandomAccessFile>&& raf, const std::string& _file_name,
|
|
SystemClock* clock = nullptr,
|
|
const std::shared_ptr<IOTracer>& io_tracer = nullptr,
|
|
Statistics* stats = nullptr, uint32_t hist_type = 0,
|
|
HistogramImpl* file_read_hist = nullptr,
|
|
RateLimiter* rate_limiter = nullptr,
|
|
const std::vector<std::shared_ptr<EventListener>>& listeners = {},
|
|
Temperature file_temperature = Temperature::kUnknown)
|
|
: file_(std::move(raf), io_tracer, _file_name),
|
|
file_name_(std::move(_file_name)),
|
|
clock_(clock),
|
|
stats_(stats),
|
|
hist_type_(hist_type),
|
|
file_read_hist_(file_read_hist),
|
|
rate_limiter_(rate_limiter),
|
|
listeners_(),
|
|
file_temperature_(file_temperature) {
|
|
#ifndef ROCKSDB_LITE
|
|
std::for_each(listeners.begin(), listeners.end(),
|
|
[this](const std::shared_ptr<EventListener>& e) {
|
|
if (e->ShouldBeNotifiedOnFileIO()) {
|
|
listeners_.emplace_back(e);
|
|
}
|
|
});
|
|
#else // !ROCKSDB_LITE
|
|
(void)listeners;
|
|
#endif
|
|
}
|
|
|
|
static IOStatus Create(const std::shared_ptr<FileSystem>& fs,
|
|
const std::string& fname, const FileOptions& file_opts,
|
|
std::unique_ptr<RandomAccessFileReader>* reader,
|
|
IODebugContext* dbg);
|
|
RandomAccessFileReader(const RandomAccessFileReader&) = delete;
|
|
RandomAccessFileReader& operator=(const RandomAccessFileReader&) = delete;
|
|
|
|
// In non-direct IO mode,
|
|
// 1. if using mmap, result is stored in a buffer other than scratch;
|
|
// 2. if not using mmap, result is stored in the buffer starting from scratch.
|
|
//
|
|
// In direct IO mode, an aligned buffer is allocated internally.
|
|
// 1. If aligned_buf is null, then results are copied to the buffer
|
|
// starting from scratch;
|
|
// 2. Otherwise, scratch is not used and can be null, the aligned_buf owns
|
|
// the internally allocated buffer on return, and the result refers to a
|
|
// region in aligned_buf.
|
|
//
|
|
// `rate_limiter_priority` is used to charge the internal rate limiter when
|
|
// enabled. The special value `Env::IO_TOTAL` makes this operation bypass the
|
|
// rate limiter.
|
|
IOStatus Read(const IOOptions& opts, uint64_t offset, size_t n, Slice* result,
|
|
char* scratch, AlignedBuf* aligned_buf,
|
|
Env::IOPriority rate_limiter_priority) const;
|
|
|
|
// REQUIRES:
|
|
// num_reqs > 0, reqs do not overlap, and offsets in reqs are increasing.
|
|
// In non-direct IO mode, aligned_buf should be null;
|
|
// In direct IO mode, aligned_buf stores the aligned buffer allocated inside
|
|
// MultiRead, the result Slices in reqs refer to aligned_buf.
|
|
//
|
|
// `rate_limiter_priority` will be used to charge the internal rate limiter.
|
|
// It is not yet supported so the client must provide the special value
|
|
// `Env::IO_TOTAL` to bypass the rate limiter.
|
|
IOStatus MultiRead(const IOOptions& opts, FSReadRequest* reqs,
|
|
size_t num_reqs, AlignedBuf* aligned_buf,
|
|
Env::IOPriority rate_limiter_priority) const;
|
|
|
|
IOStatus Prefetch(uint64_t offset, size_t n) const {
|
|
return file_->Prefetch(offset, n, IOOptions(), nullptr);
|
|
}
|
|
|
|
FSRandomAccessFile* file() { return file_.get(); }
|
|
|
|
const std::string& file_name() const { return file_name_; }
|
|
|
|
bool use_direct_io() const { return file_->use_direct_io(); }
|
|
|
|
IOStatus PrepareIOOptions(const ReadOptions& ro, IOOptions& opts);
|
|
};
|
|
} // namespace ROCKSDB_NAMESPACE
|