Improve performance of long range scans with readahead

Summary:
This change improves the performance of iterators doing long range scans (e.g. big/full table scans in MyRocks) by using readahead and prefetching additional data on each disk IO. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.

Constraints:
- The prefetched data is stored by the OS in page cache. So this currently works only for non direct-reads use-cases i.e applications which use page cache. (Direct-I/O support will be enabled in a later PR).
- This gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).

Thanks to siying for the original idea and implementation.

**Benchmarks:**
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```

Page cache was cleared before each experiment with the command:
```
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
```
```
Before:
seekrandom   :   34020.945 micros/op 29 ops/sec;   32.5 MB/s (1636 of 1999 found)
With this change:
seekrandom   :    8726.912 micros/op 114 ops/sec;  126.8 MB/s (5702 of 6999 found)
```
~3.9X performance improvement.

Also verified with strace and gdb that the readahead size is increasing as expected.
```
strace -e readahead -f -T -t -p <db_bench process pid>
```
Closes https://github.com/facebook/rocksdb/pull/3282

Differential Revision: D6586477

Pulled By: sagar0

fbshipit-source-id: 8a118a0ed4594fbb7f5b1cafb242d7a4033cb58c
This commit is contained in:
Sagar Vemuri 2018-01-25 21:34:35 -08:00 committed by Sagar Vemuri
parent 1fe7db5fb8
commit df23e80e5f
3 changed files with 34 additions and 0 deletions

View File

@ -9,6 +9,7 @@
* Add a new histogram stat called rocksdb.db.flush.micros for memtable flush.
* Add "--use_txn" option to use transactional API in db_stress.
* Disable onboard cache for compaction output in Windows platform.
* Improve the performance of iterators doing long range scans by using readahead.
### Bug Fixes
* Fix a stack-use-after-scope bug in ForwardIterator.

View File

@ -1594,6 +1594,9 @@ BlockBasedTable::BlockEntryIteratorState::BlockEntryIteratorState(
is_index_(is_index),
block_map_(block_map) {}
const size_t BlockBasedTable::BlockEntryIteratorState::kMaxReadaheadSize =
256 * 1024;
InternalIterator*
BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator(
const Slice& index_value) {
@ -1618,6 +1621,28 @@ BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator(
&rep->internal_comparator, nullptr, true, rep->ioptions.statistics);
}
}
// Automatically prefetch additional data when a range scan (iterator) does
// more than 2 sequential IOs. This is enabled only when
// ReadOptions.readahead_size is 0.
if (read_options_.readahead_size == 0) {
if (num_file_reads_ < 2) {
num_file_reads_++;
} else if (handle.offset() + static_cast<size_t>(handle.size()) +
kBlockTrailerSize >
readahead_limit_) {
num_file_reads_++;
// Do not readahead more than kMaxReadaheadSize.
readahead_size_ =
std::min(BlockBasedTable::BlockEntryIteratorState::kMaxReadaheadSize,
readahead_size_);
table_->rep_->file->Prefetch(handle.offset(), readahead_size_);
readahead_limit_ = handle.offset() + readahead_size_;
// Keep exponentially increasing readahead size until kMaxReadaheadSize.
readahead_size_ *= 2;
}
}
return NewDataBlockIterator(rep, read_options_, handle,
/* input_iter */ nullptr, is_index_,
/* get_context */ nullptr, s);

View File

@ -376,6 +376,14 @@ class BlockBasedTable::BlockEntryIteratorState : public TwoLevelIteratorState {
bool is_index_;
std::unordered_map<uint64_t, CachableEntry<Block>>* block_map_;
port::RWMutex cleaner_mu;
static const size_t kInitReadaheadSize = 8 * 1024;
// Found that 256 KB readahead size provides the best performance, based on
// experiments.
static const size_t kMaxReadaheadSize;
size_t readahead_size_ = kInitReadaheadSize;
size_t readahead_limit_ = 0;
int num_file_reads_ = 0;
};
// CachableEntry represents the entries that *may* be fetched from block cache.