rocksdb/table
Sagar Vemuri 7103559f49 Improve direct IO range scan performance with readahead (#3884)
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.

**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.

**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.

**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.

**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```

Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```

```
Before:
seekrandom   :   37939.906 micros/op 26 ops/sec;   29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom   :   8527.720 micros/op 117 ops/sec;  129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884

Differential Revision: D8082143

Pulled By: sagar0

fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
2018-06-21 11:13:08 -07:00
..
adaptive_table_factory.cc Comment out unused variables 2018-03-05 13:13:41 -08:00
adaptive_table_factory.h Comment out unused variables 2018-03-05 13:13:41 -08:00
block_based_filter_block_test.cc Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
block_based_filter_block.cc Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
block_based_filter_block.h Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
block_based_table_builder.cc run make format for PR 3838 (#3954) 2018-06-05 12:58:02 -07:00
block_based_table_builder.h run make format for PR 3838 (#3954) 2018-06-05 12:58:02 -07:00
block_based_table_factory.cc Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
block_based_table_factory.h Align SST file data blocks to avoid spanning multiple pages 2018-03-26 20:26:10 -07:00
block_based_table_reader.cc Improve direct IO range scan performance with readahead (#3884) 2018-06-21 11:13:08 -07:00
block_based_table_reader.h Improve direct IO range scan performance with readahead (#3884) 2018-06-21 11:13:08 -07:00
block_builder.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
block_builder.h Change RocksDB License 2017-07-15 16:11:23 -07:00
block_fetcher.cc run make format for PR 3838 (#3954) 2018-06-05 12:58:02 -07:00
block_fetcher.h Fix BlockFetcher ASAN error 2017-12-12 12:12:38 -08:00
block_prefix_index.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
block_prefix_index.h Change RocksDB License 2017-07-15 16:11:23 -07:00
block_test.cc Exclude seq from index keys 2018-05-25 18:42:43 -07:00
block.cc Should only decode restart points for uncompressed blocks (#3996) 2018-06-15 19:26:58 -07:00
block.h Fix performance regression in Get() for block-based tables (#3953) 2018-06-05 11:43:16 -07:00
bloom_block.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
bloom_block.h Change RocksDB License 2017-07-15 16:11:23 -07:00
cleanable_test.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
cuckoo_table_builder_test.cc Should only decode restart points for uncompressed blocks (#3996) 2018-06-15 19:26:58 -07:00
cuckoo_table_builder.cc Enable MSVC W4 with a few exceptions. Fix warnings and bugs 2017-10-19 10:57:12 -07:00
cuckoo_table_builder.h Change RocksDB License 2017-07-15 16:11:23 -07:00
cuckoo_table_factory.cc Comment out unused variables 2018-03-05 13:13:41 -08:00
cuckoo_table_factory.h comment unused parameters to turn on -Wunused-parameter flag 2018-04-12 17:59:16 -07:00
cuckoo_table_reader_test.cc Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
cuckoo_table_reader.cc Should only decode restart points for uncompressed blocks (#3996) 2018-06-15 19:26:58 -07:00
cuckoo_table_reader.h Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
filter_block.h Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
flush_block_policy.cc Align SST file data blocks to avoid spanning multiple pages 2018-03-26 20:26:10 -07:00
format.cc run make format for PR 3838 (#3954) 2018-06-05 12:58:02 -07:00
format.h run make format for PR 3838 (#3954) 2018-06-05 12:58:02 -07:00
full_filter_bits_builder.h Skip duplicate bloom keys when whole_key and prefix are mixed 2018-04-24 10:58:16 -07:00
full_filter_block_test.cc Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
full_filter_block.cc Fix the bug with duplicate prefix in partition filters (#4024) 2018-06-19 14:12:46 -07:00
full_filter_block.h Fix the bug with duplicate prefix in partition filters (#4024) 2018-06-19 14:12:46 -07:00
get_context.cc comment unused parameters to turn on -Wunused-parameter flag 2018-04-12 17:59:16 -07:00
get_context.h Stats for false positive rate of full filtesr 2018-04-05 15:58:48 -07:00
index_builder.cc Extend format 3 to partitioned index/filters (#3958) 2018-06-06 16:58:16 -07:00
index_builder.h Extend format 3 to partitioned index/filters (#3958) 2018-06-06 16:58:16 -07:00
internal_iterator.h Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs 2018-05-17 02:56:56 -07:00
iter_heap.h Make InternalKeyComparator final and directly use it in merging iterator 2017-09-11 12:04:21 -07:00
iterator_wrapper.h Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs 2018-05-17 02:56:56 -07:00
iterator.cc Comment out unused variables 2018-03-05 13:13:41 -08:00
merger_test.cc Make InternalKeyComparator final and directly use it in merging iterator 2017-09-11 12:04:21 -07:00
merging_iterator.cc Fix regression bug of Prev() with upper bound (#3989) 2018-06-12 16:57:36 -07:00
merging_iterator.h fix DBImpl::NewInternalIterator super-version leak on failure 2017-10-11 14:57:43 -07:00
meta_blocks.cc Should only decode restart points for uncompressed blocks (#3996) 2018-06-15 19:26:58 -07:00
meta_blocks.h Should only decode restart points for uncompressed blocks (#3996) 2018-06-15 19:26:58 -07:00
mock_table.cc Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
mock_table.h Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
partitioned_filter_block_test.cc Fix the bug with duplicate prefix in partition filters (#4024) 2018-06-19 14:12:46 -07:00
partitioned_filter_block.cc Fix the bug with duplicate prefix in partition filters (#4024) 2018-06-19 14:12:46 -07:00
partitioned_filter_block.h Extend format 3 to partitioned index/filters (#3958) 2018-06-06 16:58:16 -07:00
persistent_cache_helper.cc comment unused parameters to turn on -Wunused-parameter flag 2018-04-12 17:59:16 -07:00
persistent_cache_helper.h Change RocksDB License 2017-07-15 16:11:23 -07:00
persistent_cache_options.h Change RocksDB License 2017-07-15 16:11:23 -07:00
plain_table_builder.cc Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
plain_table_builder.h Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
plain_table_factory.cc Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
plain_table_factory.h Comment out unused variables 2018-03-05 13:13:41 -08:00
plain_table_index.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
plain_table_index.h Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
plain_table_key_coding.cc Comment out unused variables 2018-03-05 13:13:41 -08:00
plain_table_key_coding.h Change RocksDB License 2017-07-15 16:11:23 -07:00
plain_table_reader.cc Should only decode restart points for uncompressed blocks (#3996) 2018-06-15 19:26:58 -07:00
plain_table_reader.h Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
scoped_arena_iterator.h Change RocksDB License 2017-07-15 16:11:23 -07:00
sst_file_writer_collectors.h Comment out unused variables 2018-03-05 13:13:41 -08:00
sst_file_writer.cc Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
table_builder.h Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
table_properties_internal.h Eliminate some redundant block reads. 2018-01-10 17:11:58 -08:00
table_properties.cc Exclude seq from index keys 2018-05-25 18:42:43 -07:00
table_reader_bench.cc Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
table_reader.h Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
table_test.cc Should only decode restart points for uncompressed blocks (#3996) 2018-06-15 19:26:58 -07:00
two_level_iterator.cc Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs 2018-05-17 02:56:56 -07:00
two_level_iterator.h fix memory leak in two_level_iterator 2018-04-15 17:26:26 -07:00