2016-02-10 00:12:00 +01:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-16 01:03:42 +02:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
//
|
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
#pragma once
|
2017-01-12 01:42:07 +01:00
|
|
|
#include <atomic>
|
RocksDB Trace Analyzer (#4091)
Summary:
A framework of trace analyzing for RocksDB
After collecting the trace by using the tool of [PR #3837](https://github.com/facebook/rocksdb/pull/3837). User can use the Trace Analyzer to interpret, analyze, and characterize the collected workload.
**Input:**
1. trace file
2. Whole keys space file
**Statistics:**
1. Access count of each operation (Get, Put, Delete, SingleDelete, DeleteRange, Merge) in each column family.
2. Key hotness (access count) of each one
3. Key space separation based on given prefix
4. Key size distribution
5. Value size distribution if appliable
6. Top K accessed keys
7. QPS statistics including the average QPS and peak QPS
8. Top K accessed prefix
9. The query correlation analyzing, output the number of X after Y and the corresponding average time
intervals
**Output:**
1. key access heat map (either in the accessed key space or whole key space)
2. trace sequence file (interpret the raw trace file to line base text file for future use)
3. Time serial (The key space ID and its access time)
4. Key access count distritbution
5. Key size distribution
6. Value size distribution (in each intervals)
7. whole key space separation by the prefix
8. Accessed key space separation by the prefix
9. QPS of each operation and each column family
10. Top K QPS and their accessed prefix range
**Test:**
1. Added the unit test of analyzing Get, Put, Delete, SingleDelete, DeleteRange, Merge
2. Generated the trace and analyze the trace
**Implemented but not tested (due to the limitation of trace_replay):**
1. Analyzing Iterator, supporting Seek() and SeekForPrev() analyzing
2. Analyzing the number of Key found by Get
**Future Work:**
1. Support execution time analyzing of each requests
2. Support cache hit situation and block read situation of Get
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4091
Differential Revision: D9256157
Pulled By: zhichao-cao
fbshipit-source-id: f0ceacb7eedbc43a3eee6e85b76087d7832a8fe6
2018-08-13 20:32:04 +02:00
|
|
|
#include <sstream>
|
2015-10-16 23:33:47 +02:00
|
|
|
#include <string>
|
2017-01-12 01:42:07 +01:00
|
|
|
#include "port/port.h"
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
#include "rocksdb/env.h"
|
2018-10-13 03:34:03 +02:00
|
|
|
#include "rocksdb/listener.h"
|
2017-06-13 23:51:22 +02:00
|
|
|
#include "rocksdb/rate_limiter.h"
|
2015-09-11 18:57:02 +02:00
|
|
|
#include "util/aligned_buffer.h"
|
2017-10-31 21:49:25 +01:00
|
|
|
#include "util/sync_point.h"
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
|
|
|
|
namespace rocksdb {
|
2015-08-05 21:11:30 +02:00
|
|
|
|
|
|
|
class Statistics;
|
Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44193
2015-08-13 23:35:54 +02:00
|
|
|
class HistogramImpl;
|
2015-08-05 21:11:30 +02:00
|
|
|
|
2015-08-27 00:25:59 +02:00
|
|
|
std::unique_ptr<RandomAccessFile> NewReadaheadRandomAccessFile(
|
2015-09-11 18:57:02 +02:00
|
|
|
std::unique_ptr<RandomAccessFile>&& file, size_t readahead_size);
|
2015-08-27 00:25:59 +02:00
|
|
|
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
class SequentialFileReader {
|
|
|
|
private:
|
|
|
|
std::unique_ptr<SequentialFile> file_;
|
2018-06-21 17:34:24 +02:00
|
|
|
std::string file_name_;
|
2017-01-12 01:42:07 +01:00
|
|
|
std::atomic<size_t> offset_; // read offset
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
|
|
|
|
public:
|
2018-06-21 17:34:24 +02:00
|
|
|
explicit SequentialFileReader(std::unique_ptr<SequentialFile>&& _file,
|
|
|
|
const std::string& _file_name)
|
|
|
|
: file_(std::move(_file)), file_name_(_file_name), offset_(0) {}
|
2015-09-11 18:57:02 +02:00
|
|
|
|
|
|
|
SequentialFileReader(SequentialFileReader&& o) ROCKSDB_NOEXCEPT {
|
|
|
|
*this = std::move(o);
|
|
|
|
}
|
|
|
|
|
|
|
|
SequentialFileReader& operator=(SequentialFileReader&& o) ROCKSDB_NOEXCEPT {
|
|
|
|
file_ = std::move(o.file_);
|
|
|
|
return *this;
|
|
|
|
}
|
|
|
|
|
2015-10-16 23:33:47 +02:00
|
|
|
SequentialFileReader(const SequentialFileReader&) = delete;
|
|
|
|
SequentialFileReader& operator=(const SequentialFileReader&) = delete;
|
2015-09-11 18:57:02 +02:00
|
|
|
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
Status Read(size_t n, Slice* result, char* scratch);
|
|
|
|
|
|
|
|
Status Skip(uint64_t n);
|
|
|
|
|
2017-05-10 23:54:35 +02:00
|
|
|
void Rewind();
|
|
|
|
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
SequentialFile* file() { return file_.get(); }
|
2017-01-12 01:42:07 +01:00
|
|
|
|
2018-06-21 17:34:24 +02:00
|
|
|
std::string file_name() { return file_name_; }
|
|
|
|
|
2017-01-13 21:01:08 +01:00
|
|
|
bool use_direct_io() const { return file_->use_direct_io(); }
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
};
|
|
|
|
|
2015-09-23 03:21:10 +02:00
|
|
|
class RandomAccessFileReader {
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
private:
|
2018-10-13 03:34:03 +02:00
|
|
|
#ifndef ROCKSDB_LITE
|
2019-01-16 18:48:01 +01:00
|
|
|
void NotifyOnFileReadFinish(uint64_t offset, size_t length,
|
|
|
|
const FileOperationInfo::TimePoint& start_ts,
|
|
|
|
const FileOperationInfo::TimePoint& finish_ts,
|
2018-10-13 03:34:03 +02:00
|
|
|
const Status& status) const {
|
2019-01-16 18:48:01 +01:00
|
|
|
FileOperationInfo info(file_name_, start_ts, finish_ts);
|
2018-10-13 03:34:03 +02:00
|
|
|
info.offset = offset;
|
|
|
|
info.length = length;
|
|
|
|
info.status = status;
|
|
|
|
|
|
|
|
for (auto& listener : listeners_) {
|
|
|
|
listener->OnFileReadFinish(info);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif // ROCKSDB_LITE
|
|
|
|
|
|
|
|
bool ShouldNotifyListeners() const { return !listeners_.empty(); }
|
|
|
|
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
std::unique_ptr<RandomAccessFile> file_;
|
2017-06-29 06:26:03 +02:00
|
|
|
std::string file_name_;
|
2015-09-11 18:57:02 +02:00
|
|
|
Env* env_;
|
|
|
|
Statistics* stats_;
|
|
|
|
uint32_t hist_type_;
|
|
|
|
HistogramImpl* file_read_hist_;
|
2017-06-13 23:51:22 +02:00
|
|
|
RateLimiter* rate_limiter_;
|
|
|
|
bool for_compaction_;
|
2018-10-13 03:34:03 +02:00
|
|
|
std::vector<std::shared_ptr<EventListener>> listeners_;
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
|
|
|
|
public:
|
2018-10-13 03:34:03 +02:00
|
|
|
explicit RandomAccessFileReader(
|
|
|
|
std::unique_ptr<RandomAccessFile>&& raf, std::string _file_name,
|
|
|
|
Env* env = nullptr, Statistics* stats = nullptr, uint32_t hist_type = 0,
|
|
|
|
HistogramImpl* file_read_hist = nullptr,
|
|
|
|
RateLimiter* rate_limiter = nullptr, bool for_compaction = false,
|
|
|
|
const std::vector<std::shared_ptr<EventListener>>& listeners = {})
|
2015-08-05 21:11:30 +02:00
|
|
|
: file_(std::move(raf)),
|
2017-06-29 06:26:03 +02:00
|
|
|
file_name_(std::move(_file_name)),
|
2015-08-05 21:11:30 +02:00
|
|
|
env_(env),
|
|
|
|
stats_(stats),
|
Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44193
2015-08-13 23:35:54 +02:00
|
|
|
hist_type_(hist_type),
|
2017-06-13 23:51:22 +02:00
|
|
|
file_read_hist_(file_read_hist),
|
|
|
|
rate_limiter_(rate_limiter),
|
2018-10-13 03:34:03 +02:00
|
|
|
for_compaction_(for_compaction),
|
|
|
|
listeners_() {
|
|
|
|
#ifndef ROCKSDB_LITE
|
|
|
|
std::for_each(listeners.begin(), listeners.end(),
|
|
|
|
[this](const std::shared_ptr<EventListener>& e) {
|
|
|
|
if (e->ShouldBeNotifiedOnFileIO()) {
|
|
|
|
listeners_.emplace_back(e);
|
|
|
|
}
|
|
|
|
});
|
|
|
|
#else // !ROCKSDB_LITE
|
|
|
|
(void)listeners;
|
|
|
|
#endif
|
|
|
|
}
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
|
2015-09-11 18:57:02 +02:00
|
|
|
RandomAccessFileReader(RandomAccessFileReader&& o) ROCKSDB_NOEXCEPT {
|
|
|
|
*this = std::move(o);
|
|
|
|
}
|
|
|
|
|
2017-06-13 23:51:22 +02:00
|
|
|
RandomAccessFileReader& operator=(RandomAccessFileReader&& o)
|
|
|
|
ROCKSDB_NOEXCEPT {
|
2015-09-11 18:57:02 +02:00
|
|
|
file_ = std::move(o.file_);
|
|
|
|
env_ = std::move(o.env_);
|
|
|
|
stats_ = std::move(o.stats_);
|
|
|
|
hist_type_ = std::move(o.hist_type_);
|
|
|
|
file_read_hist_ = std::move(o.file_read_hist_);
|
2017-06-13 23:51:22 +02:00
|
|
|
rate_limiter_ = std::move(o.rate_limiter_);
|
|
|
|
for_compaction_ = std::move(o.for_compaction_);
|
2015-09-11 18:57:02 +02:00
|
|
|
return *this;
|
|
|
|
}
|
|
|
|
|
|
|
|
RandomAccessFileReader(const RandomAccessFileReader&) = delete;
|
|
|
|
RandomAccessFileReader& operator=(const RandomAccessFileReader&) = delete;
|
|
|
|
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
Status Read(uint64_t offset, size_t n, Slice* result, char* scratch) const;
|
|
|
|
|
2017-04-15 03:43:32 +02:00
|
|
|
Status Prefetch(uint64_t offset, size_t n) const {
|
|
|
|
return file_->Prefetch(offset, n);
|
|
|
|
}
|
|
|
|
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
RandomAccessFile* file() { return file_.get(); }
|
2017-01-12 01:42:07 +01:00
|
|
|
|
2017-06-29 06:26:03 +02:00
|
|
|
std::string file_name() const { return file_name_; }
|
|
|
|
|
2017-01-13 21:01:08 +01:00
|
|
|
bool use_direct_io() const { return file_->use_direct_io(); }
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
};
|
|
|
|
|
|
|
|
// Use posix write to write data to a file.
|
|
|
|
class WritableFileWriter {
|
|
|
|
private:
|
2018-10-13 03:34:03 +02:00
|
|
|
#ifndef ROCKSDB_LITE
|
2019-01-16 18:48:01 +01:00
|
|
|
void NotifyOnFileWriteFinish(uint64_t offset, size_t length,
|
|
|
|
const FileOperationInfo::TimePoint& start_ts,
|
|
|
|
const FileOperationInfo::TimePoint& finish_ts,
|
2018-10-13 03:34:03 +02:00
|
|
|
const Status& status) {
|
2019-01-16 18:48:01 +01:00
|
|
|
FileOperationInfo info(file_name_, start_ts, finish_ts);
|
2018-10-13 03:34:03 +02:00
|
|
|
info.offset = offset;
|
|
|
|
info.length = length;
|
|
|
|
info.status = status;
|
|
|
|
|
|
|
|
for (auto& listener : listeners_) {
|
|
|
|
listener->OnFileWriteFinish(info);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif // ROCKSDB_LITE
|
|
|
|
|
|
|
|
bool ShouldNotifyListeners() const { return !listeners_.empty(); }
|
|
|
|
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
std::unique_ptr<WritableFile> writable_file_;
|
2018-08-23 19:04:10 +02:00
|
|
|
std::string file_name_;
|
2015-09-11 18:57:02 +02:00
|
|
|
AlignedBuffer buf_;
|
2015-10-30 06:10:25 +01:00
|
|
|
size_t max_buffer_size_;
|
2015-09-11 18:57:02 +02:00
|
|
|
// Actually written data size can be used for truncate
|
|
|
|
// not counting padding data
|
|
|
|
uint64_t filesize_;
|
2017-06-12 15:32:01 +02:00
|
|
|
#ifndef ROCKSDB_LITE
|
2015-09-11 18:57:02 +02:00
|
|
|
// This is necessary when we use unbuffered access
|
|
|
|
// and writes must happen on aligned offsets
|
|
|
|
// so we need to go back and write that page again
|
|
|
|
uint64_t next_write_offset_;
|
2017-06-12 15:32:01 +02:00
|
|
|
#endif // ROCKSDB_LITE
|
2015-09-11 18:57:02 +02:00
|
|
|
bool pending_sync_;
|
|
|
|
uint64_t last_sync_size_;
|
|
|
|
uint64_t bytes_per_sync_;
|
|
|
|
RateLimiter* rate_limiter_;
|
2017-03-03 02:40:24 +01:00
|
|
|
Statistics* stats_;
|
2018-10-13 03:34:03 +02:00
|
|
|
std::vector<std::shared_ptr<EventListener>> listeners_;
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
|
|
|
|
public:
|
2018-10-13 03:34:03 +02:00
|
|
|
WritableFileWriter(
|
|
|
|
std::unique_ptr<WritableFile>&& file, const std::string& _file_name,
|
|
|
|
const EnvOptions& options, Statistics* stats = nullptr,
|
|
|
|
const std::vector<std::shared_ptr<EventListener>>& listeners = {})
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
: writable_file_(std::move(file)),
|
2018-08-23 19:04:10 +02:00
|
|
|
file_name_(_file_name),
|
2015-09-11 18:57:02 +02:00
|
|
|
buf_(),
|
2015-10-30 06:10:25 +01:00
|
|
|
max_buffer_size_(options.writable_file_max_buffer_size),
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
filesize_(0),
|
2017-06-12 15:32:01 +02:00
|
|
|
#ifndef ROCKSDB_LITE
|
2015-09-11 18:57:02 +02:00
|
|
|
next_write_offset_(0),
|
2017-06-12 15:32:01 +02:00
|
|
|
#endif // ROCKSDB_LITE
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
pending_sync_(false),
|
|
|
|
last_sync_size_(0),
|
|
|
|
bytes_per_sync_(options.bytes_per_sync),
|
2017-03-03 02:40:24 +01:00
|
|
|
rate_limiter_(options.rate_limiter),
|
2018-10-13 03:34:03 +02:00
|
|
|
stats_(stats),
|
|
|
|
listeners_() {
|
2017-10-31 21:49:25 +01:00
|
|
|
TEST_SYNC_POINT_CALLBACK("WritableFileWriter::WritableFileWriter:0",
|
|
|
|
reinterpret_cast<void*>(max_buffer_size_));
|
2015-09-12 02:36:48 +02:00
|
|
|
buf_.Alignment(writable_file_->GetRequiredBufferAlignment());
|
2017-06-13 13:34:51 +02:00
|
|
|
buf_.AllocateNewBuffer(std::min((size_t)65536, max_buffer_size_));
|
2018-10-13 03:34:03 +02:00
|
|
|
#ifndef ROCKSDB_LITE
|
|
|
|
std::for_each(listeners.begin(), listeners.end(),
|
|
|
|
[this](const std::shared_ptr<EventListener>& e) {
|
|
|
|
if (e->ShouldBeNotifiedOnFileIO()) {
|
|
|
|
listeners_.emplace_back(e);
|
|
|
|
}
|
|
|
|
});
|
|
|
|
#else // !ROCKSDB_LITE
|
|
|
|
(void)listeners;
|
|
|
|
#endif
|
2015-09-11 18:57:02 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
WritableFileWriter(const WritableFileWriter&) = delete;
|
|
|
|
|
|
|
|
WritableFileWriter& operator=(const WritableFileWriter&) = delete;
|
|
|
|
|
|
|
|
~WritableFileWriter() { Close(); }
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
|
2018-08-23 19:04:10 +02:00
|
|
|
std::string file_name() const { return file_name_; }
|
|
|
|
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
Status Append(const Slice& data);
|
|
|
|
|
2018-03-27 05:14:24 +02:00
|
|
|
Status Pad(const size_t pad_bytes);
|
|
|
|
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
Status Flush();
|
|
|
|
|
|
|
|
Status Close();
|
|
|
|
|
|
|
|
Status Sync(bool use_fsync);
|
|
|
|
|
[wal changes 3/3] method in DB to sync WAL without blocking writers
Summary:
Subj. We really need this feature.
Previous diff D40899 has most of the changes to make this possible, this diff just adds the method.
Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind.
Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba
Differential Revision: https://reviews.facebook.net/D40905
2015-08-05 15:06:39 +02:00
|
|
|
// Sync only the data that was already Flush()ed. Safe to call concurrently
|
|
|
|
// with Append() and Flush(). If !writable_file_->IsSyncThreadSafe(),
|
|
|
|
// returns NotSupported status.
|
|
|
|
Status SyncWithoutFlush(bool use_fsync);
|
|
|
|
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
uint64_t GetFileSize() { return filesize_; }
|
|
|
|
|
|
|
|
Status InvalidateCache(size_t offset, size_t length) {
|
|
|
|
return writable_file_->InvalidateCache(offset, length);
|
|
|
|
}
|
|
|
|
|
|
|
|
WritableFile* writable_file() const { return writable_file_.get(); }
|
|
|
|
|
2017-01-13 21:01:08 +01:00
|
|
|
bool use_direct_io() { return writable_file_->use_direct_io(); }
|
2017-01-12 01:42:07 +01:00
|
|
|
|
2018-05-14 19:53:32 +02:00
|
|
|
bool TEST_BufferIsEmpty() { return buf_.CurrentSize() == 0; }
|
|
|
|
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
private:
|
2015-09-11 18:57:02 +02:00
|
|
|
// Used when os buffering is OFF and we are writing
|
2016-12-22 21:51:29 +01:00
|
|
|
// DMA such as in Direct I/O mode
|
2017-02-16 19:25:06 +01:00
|
|
|
#ifndef ROCKSDB_LITE
|
2016-12-22 21:51:29 +01:00
|
|
|
Status WriteDirect();
|
2017-02-16 19:25:06 +01:00
|
|
|
#endif // !ROCKSDB_LITE
|
2015-09-11 18:57:02 +02:00
|
|
|
// Normal write
|
|
|
|
Status WriteBuffered(const char* data, size_t size);
|
2015-11-11 02:03:42 +01:00
|
|
|
Status RangeSync(uint64_t offset, uint64_t nbytes);
|
[wal changes 3/3] method in DB to sync WAL without blocking writers
Summary:
Subj. We really need this feature.
Previous diff D40899 has most of the changes to make this possible, this diff just adds the method.
Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind.
Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba
Differential Revision: https://reviews.facebook.net/D40905
2015-08-05 15:06:39 +02:00
|
|
|
Status SyncInternal(bool use_fsync);
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
};
|
2015-10-16 23:33:47 +02:00
|
|
|
|
Improve direct IO range scan performance with readahead (#3884)
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.
**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.
**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.
**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.
**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```
```
Before:
seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884
Differential Revision: D8082143
Pulled By: sagar0
fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
2018-06-21 20:02:49 +02:00
|
|
|
// FilePrefetchBuffer can automatically do the readahead if file_reader,
|
|
|
|
// readahead_size, and max_readahead_size are passed in.
|
|
|
|
// max_readahead_size should be greater than or equal to readahead_size.
|
|
|
|
// readahead_size will be doubled on every IO, until max_readahead_size.
|
2017-08-11 20:59:13 +02:00
|
|
|
class FilePrefetchBuffer {
|
|
|
|
public:
|
2018-07-20 23:31:27 +02:00
|
|
|
// If `track_min_offset` is true, track minimum offset ever read.
|
Improve direct IO range scan performance with readahead (#3884)
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.
**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.
**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.
**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.
**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```
```
Before:
seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884
Differential Revision: D8082143
Pulled By: sagar0
fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
2018-06-21 20:02:49 +02:00
|
|
|
FilePrefetchBuffer(RandomAccessFileReader* file_reader = nullptr,
|
2018-07-20 23:31:27 +02:00
|
|
|
size_t readadhead_size = 0, size_t max_readahead_size = 0,
|
|
|
|
bool enable = true, bool track_min_offset = false)
|
Improve direct IO range scan performance with readahead (#3884)
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.
**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.
**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.
**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.
**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```
```
Before:
seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884
Differential Revision: D8082143
Pulled By: sagar0
fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
2018-06-21 20:02:49 +02:00
|
|
|
: buffer_offset_(0),
|
|
|
|
file_reader_(file_reader),
|
|
|
|
readahead_size_(readadhead_size),
|
2018-07-20 23:31:27 +02:00
|
|
|
max_readahead_size_(max_readahead_size),
|
|
|
|
min_offset_read_(port::kMaxSizet),
|
|
|
|
enable_(enable),
|
|
|
|
track_min_offset_(track_min_offset) {}
|
2017-08-11 20:59:13 +02:00
|
|
|
Status Prefetch(RandomAccessFileReader* reader, uint64_t offset, size_t n);
|
Improve direct IO range scan performance with readahead (#3884)
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.
**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.
**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.
**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.
**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```
```
Before:
seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884
Differential Revision: D8082143
Pulled By: sagar0
fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
2018-06-21 20:02:49 +02:00
|
|
|
bool TryReadFromCache(uint64_t offset, size_t n, Slice* result);
|
2017-08-11 20:59:13 +02:00
|
|
|
|
2018-07-20 23:31:27 +02:00
|
|
|
// The minimum `offset` ever passed to TryReadFromCache(). Only be tracked
|
|
|
|
// if track_min_offset = true.
|
|
|
|
size_t min_offset_read() const { return min_offset_read_; }
|
|
|
|
|
2017-08-11 20:59:13 +02:00
|
|
|
private:
|
|
|
|
AlignedBuffer buffer_;
|
|
|
|
uint64_t buffer_offset_;
|
Improve direct IO range scan performance with readahead (#3884)
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.
**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.
**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.
**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.
**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```
```
Before:
seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884
Differential Revision: D8082143
Pulled By: sagar0
fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
2018-06-21 20:02:49 +02:00
|
|
|
RandomAccessFileReader* file_reader_;
|
|
|
|
size_t readahead_size_;
|
|
|
|
size_t max_readahead_size_;
|
2018-07-20 23:31:27 +02:00
|
|
|
// The minimum `offset` ever passed to TryReadFromCache().
|
|
|
|
size_t min_offset_read_;
|
|
|
|
// if false, TryReadFromCache() always return false, and we only take stats
|
|
|
|
// for track_min_offset_ if track_min_offset_ = true
|
|
|
|
bool enable_;
|
|
|
|
// If true, track minimum `offset` ever passed to TryReadFromCache(), which
|
|
|
|
// can be fetched from min_offset_read().
|
|
|
|
bool track_min_offset_;
|
2017-08-11 20:59:13 +02:00
|
|
|
};
|
|
|
|
|
2015-10-16 23:33:47 +02:00
|
|
|
extern Status NewWritableFile(Env* env, const std::string& fname,
|
2018-11-09 20:17:34 +01:00
|
|
|
std::unique_ptr<WritableFile>* result,
|
2015-10-16 23:33:47 +02:00
|
|
|
const EnvOptions& options);
|
RocksDB Trace Analyzer (#4091)
Summary:
A framework of trace analyzing for RocksDB
After collecting the trace by using the tool of [PR #3837](https://github.com/facebook/rocksdb/pull/3837). User can use the Trace Analyzer to interpret, analyze, and characterize the collected workload.
**Input:**
1. trace file
2. Whole keys space file
**Statistics:**
1. Access count of each operation (Get, Put, Delete, SingleDelete, DeleteRange, Merge) in each column family.
2. Key hotness (access count) of each one
3. Key space separation based on given prefix
4. Key size distribution
5. Value size distribution if appliable
6. Top K accessed keys
7. QPS statistics including the average QPS and peak QPS
8. Top K accessed prefix
9. The query correlation analyzing, output the number of X after Y and the corresponding average time
intervals
**Output:**
1. key access heat map (either in the accessed key space or whole key space)
2. trace sequence file (interpret the raw trace file to line base text file for future use)
3. Time serial (The key space ID and its access time)
4. Key access count distritbution
5. Key size distribution
6. Value size distribution (in each intervals)
7. whole key space separation by the prefix
8. Accessed key space separation by the prefix
9. QPS of each operation and each column family
10. Top K QPS and their accessed prefix range
**Test:**
1. Added the unit test of analyzing Get, Put, Delete, SingleDelete, DeleteRange, Merge
2. Generated the trace and analyze the trace
**Implemented but not tested (due to the limitation of trace_replay):**
1. Analyzing Iterator, supporting Seek() and SeekForPrev() analyzing
2. Analyzing the number of Key found by Get
**Future Work:**
1. Support execution time analyzing of each requests
2. Support cache hit situation and block read situation of Get
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4091
Differential Revision: D9256157
Pulled By: zhichao-cao
fbshipit-source-id: f0ceacb7eedbc43a3eee6e85b76087d7832a8fe6
2018-08-13 20:32:04 +02:00
|
|
|
bool ReadOneLine(std::istringstream* iss, SequentialFile* seq_file,
|
|
|
|
std::string* output, bool* has_data, Status* result);
|
|
|
|
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
} // namespace rocksdb
|