7103559f49
Summary: This PR extends the improvements in #3282 to also work when using Direct IO. We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash. **Description:** This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan. **Implementation Details:** - Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead. - `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled. - `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer. - Made sure not to re-read partial chunks of data that were already available in the buffer, from device again. - Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date. **Constraints:** - Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value). - Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously. - Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them. **Benchmarks:** I used the same benchmark as used in #3282. Data fill: ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes ``` Do a long range scan: Seekrandom with large number of nexts ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram ``` ``` Before: seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found) With this change: seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found) ``` ~4.5X perf improvement. Taken on an average of 3 runs. Closes https://github.com/facebook/rocksdb/pull/3884 Differential Revision: D8082143 Pulled By: sagar0 fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb |
||
---|---|---|
.. | ||
aligned_buffer.h | ||
allocator.h | ||
arena_test.cc | ||
arena.cc | ||
arena.h | ||
auto_roll_logger_test.cc | ||
auto_roll_logger.cc | ||
auto_roll_logger.h | ||
autovector_test.cc | ||
autovector.h | ||
bloom_test.cc | ||
bloom.cc | ||
build_version.cc.in | ||
build_version.h | ||
cast_util.h | ||
channel.h | ||
coding_test.cc | ||
coding.cc | ||
coding.h | ||
compaction_job_stats_impl.cc | ||
comparator.cc | ||
compression_context_cache.cc | ||
compression_context_cache.h | ||
compression.h | ||
concurrent_arena.cc | ||
concurrent_arena.h | ||
core_local.h | ||
crc32c_ppc_asm.S | ||
crc32c_ppc_constants.h | ||
crc32c_ppc.c | ||
crc32c_ppc.h | ||
crc32c_test.cc | ||
crc32c.cc | ||
crc32c.h | ||
delete_scheduler_test.cc | ||
delete_scheduler.cc | ||
delete_scheduler.h | ||
duplicate_detector.h | ||
dynamic_bloom_test.cc | ||
dynamic_bloom.cc | ||
dynamic_bloom.h | ||
event_logger_test.cc | ||
event_logger.cc | ||
event_logger.h | ||
fault_injection_test_env.cc | ||
fault_injection_test_env.h | ||
file_reader_writer_test.cc | ||
file_reader_writer.cc | ||
file_reader_writer.h | ||
file_util.cc | ||
file_util.h | ||
filelock_test.cc | ||
filename.cc | ||
filename.h | ||
filter_policy.cc | ||
gflags_compat.h | ||
hash_map.h | ||
hash_test.cc | ||
hash.cc | ||
hash.h | ||
heap_test.cc | ||
heap.h | ||
kv_map.h | ||
log_buffer.cc | ||
log_buffer.h | ||
log_write_bench.cc | ||
logging.h | ||
memory_usage.h | ||
murmurhash.cc | ||
murmurhash.h | ||
mutexlock.h | ||
ppc-opcode.h | ||
random.cc | ||
random.h | ||
rate_limiter_test.cc | ||
rate_limiter.cc | ||
rate_limiter.h | ||
set_comparator.h | ||
slice_transform_test.cc | ||
slice.cc | ||
sst_file_manager_impl.cc | ||
sst_file_manager_impl.h | ||
status_message.cc | ||
status.cc | ||
stderr_logger.h | ||
stop_watch.h | ||
string_util.cc | ||
string_util.h | ||
sync_point_impl.cc | ||
sync_point_impl.h | ||
sync_point.cc | ||
sync_point.h | ||
testharness.cc | ||
testharness.h | ||
testutil.cc | ||
testutil.h | ||
thread_list_test.cc | ||
thread_local_test.cc | ||
thread_local.cc | ||
thread_local.h | ||
thread_operation.h | ||
threadpool_imp.cc | ||
threadpool_imp.h | ||
timer_queue_test.cc | ||
timer_queue.h | ||
transaction_test_util.cc | ||
transaction_test_util.h | ||
xxhash.cc | ||
xxhash.h |