rocksdb/db
Mike Kolupaev b4d7209428 Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.

Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.

So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.

Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.

This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289

Differential Revision: D15256423

Pulled By: al13n321

fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:54:04 -07:00
..
compaction Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
db_impl Fix ingested file and direcotry not being sync (#5435) 2019-06-21 10:15:38 -07:00
builder.cc Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
builder.h Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
c_test.c add missing rocksdb_flush_cf in c (#5243) 2019-04-25 11:25:43 -07:00
c.cc Unordered Writes (#5218) 2019-05-13 17:47:21 -07:00
column_family_test.cc Make format 2019-05-31 15:24:43 -07:00
column_family.cc Integrate block cache tracer into db_impl (#5433) 2019-06-13 15:43:10 -07:00
column_family.h Integrate block cache tracer into db_impl (#5433) 2019-06-13 15:43:10 -07:00
compact_files_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
compacted_db_impl.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
compacted_db_impl.h Make format 2019-05-31 15:24:43 -07:00
comparator_db_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
convenience.cc Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
corruption_test.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
cuckoo_table_db_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
db_basic_test.cc Add support for timestamp in Get/Put (#5079) 2019-06-05 23:10:47 -07:00
db_blob_index_test.cc fix lite build 2017-10-17 08:57:09 -07:00
db_block_cache_test.cc Move the index readers out of the block cache (#5298) 2019-05-30 11:53:27 -07:00
db_bloom_filter_test.cc Unordered Writes (#5218) 2019-05-13 17:47:21 -07:00
db_compaction_filter_test.cc Apply modernize-use-override (2nd iteration) 2019-02-14 14:41:36 -08:00
db_compaction_test.cc Combine the read-ahead logic for user reads and compaction reads (#5431) 2019-06-19 14:10:46 -07:00
db_dynamic_level_test.cc Fix flaky DBDynamicLevelTest.DynamicLevelMaxBytesBase2 (#4668) 2018-11-12 16:42:16 -08:00
db_encryption_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
db_filesnapshot.cc Add missing check before calling PurgeObsoleteFiles in EnableFileDeletions (#5448) 2019-06-13 14:43:13 -07:00
db_flush_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
db_info_dumper.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
db_info_dumper.h Change RocksDB License 2017-07-15 16:11:23 -07:00
db_inplace_update_test.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
db_io_failure_test.cc Disable DBIOFailureTest.NoSpaceCompactRange in LITE (#4596) 2018-10-29 14:36:31 -07:00
db_iter_stress_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
db_iter_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
db_iter.cc Revert "Reduce iterator key comparison for upper/lower bound check (#5111)" (#5440) 2019-06-11 16:23:41 -07:00
db_iter.h Make format 2019-05-31 15:24:43 -07:00
db_iterator_test.cc Add an option to put first key of each sst block in the index (#5289) 2019-06-24 20:54:04 -07:00
db_log_iter_test.cc Apply modernize-use-override (2nd iteration) 2019-02-14 14:41:36 -08:00
db_memtable_test.cc Fix tsan complaint in ConcurrentMergeWrite test (#5308) 2019-05-15 11:21:48 -07:00
db_merge_operator_test.cc WriteUnPrepared: less virtual in iterator callback (#5049) 2019-04-02 14:47:16 -07:00
db_options_test.cc fix rocksdb lite and clang contrun test failures (#5477) 2019-06-17 21:16:29 -07:00
db_properties_test.cc Deprecate ttl option from CompactionOptionsFIFO (#4965) 2019-02-15 09:51:41 -08:00
db_range_del_test.cc Fix merging range tombstone covering put during flush/compaction (#5406) 2019-06-04 10:24:14 -07:00
db_sst_test.cc Increase Trash/DB size ratio in DBSSTTest.RateLimitedWALDelete (#5366) 2019-05-30 11:12:59 -07:00
db_statistics_test.cc Make statistics's stats_level change thread-safe (#5030) 2019-03-01 10:42:09 -08:00
db_table_properties_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
db_tailing_iter_test.cc Remove managed iterator 2018-07-17 14:43:18 -07:00
db_test2.cc Fix flaky DBTest2.PresetCompressionDict test (#5378) 2019-05-30 16:11:27 -07:00
db_test_util.cc Unordered Writes (#5218) 2019-05-13 17:47:21 -07:00
db_test_util.h simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
db_test.cc sanitize and limit block_size under 4GB (#5492) 2019-06-20 11:45:08 -07:00
db_universal_compaction_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
db_wal_test.cc Integrate block cache tracer into db_impl (#5433) 2019-06-13 15:43:10 -07:00
db_write_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
dbformat_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
dbformat.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
dbformat.h Add support for timestamp in Get/Put (#5079) 2019-06-05 23:10:47 -07:00
deletefile_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
error_handler_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
error_handler.cc Make format 2019-05-31 15:24:43 -07:00
error_handler.h Fix typos in comments (#4456) 2018-10-04 20:46:50 -07:00
event_helpers.cc Log file_creation_time table property (#5232) 2019-04-22 15:30:07 -07:00
event_helpers.h Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
experimental.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
external_sst_file_basic_test.cc Fix ingested file and direcotry not being sync (#5435) 2019-06-21 10:15:38 -07:00
external_sst_file_ingestion_job.cc Fix ingested file and direcotry not being sync (#5435) 2019-06-21 10:15:38 -07:00
external_sst_file_ingestion_job.h Fix ingested file and direcotry not being sync (#5435) 2019-06-21 10:15:38 -07:00
external_sst_file_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
fault_injection_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
file_indexer_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
file_indexer.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
file_indexer.h Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
filename_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
flush_job_test.cc Integrate block cache tracer into db_impl (#5433) 2019-06-13 15:43:10 -07:00
flush_job.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
flush_job.h Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
flush_scheduler.cc Remove global locks from FlushScheduler (#5372) 2019-06-10 16:50:26 -07:00
flush_scheduler.h Unordered Writes (#5218) 2019-05-13 17:47:21 -07:00
forward_iterator_bench.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
forward_iterator.cc Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
forward_iterator.h Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
internal_stats.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
internal_stats.h Collect compaction stats by priority and dump to info LOG (#5050) 2019-03-19 17:28:19 -07:00
job_context.h WritePrepared: Fix visible key compacted out by compaction (#4883) 2019-01-15 21:34:38 -08:00
listener_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
log_format.h Fix an inaccurate comment (#4315) 2018-08-24 18:13:20 -07:00
log_reader.cc Support for single-primary, multi-secondary instances (#4899) 2019-03-26 16:45:31 -07:00
log_reader.h secondary instance: add support for WAL tailing on OpenAsSecondary 2019-04-24 12:08:44 -07:00
log_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
log_writer.cc LogWriter to only flush after finish generating whole record (#5328) 2019-05-21 12:33:17 -07:00
log_writer.h Close WAL files before deletion (#5233) 2019-04-25 10:11:41 -07:00
logs_with_prep_tracker.cc Skip deleted WALs during recovery 2018-05-03 15:43:09 -07:00
logs_with_prep_tracker.h Skip deleted WALs during recovery 2018-05-03 15:43:09 -07:00
lookup_key.h Introduce a new MultiGet batching implementation (#5011) 2019-04-11 14:28:26 -07:00
malloc_stats.cc Detect if Jemalloc is linked with the binary (#4844) 2019-01-03 16:30:12 -08:00
malloc_stats.h Change RocksDB License 2017-07-15 16:11:23 -07:00
manual_compaction_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
memtable_list_test.cc Integrate block cache tracer into db_impl (#5433) 2019-06-13 15:43:10 -07:00
memtable_list.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
memtable_list.h Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
memtable.cc Add support for timestamp in Get/Put (#5079) 2019-06-05 23:10:47 -07:00
memtable.h Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
merge_context.h Introduce a new MultiGet batching implementation (#5011) 2019-04-11 14:28:26 -07:00
merge_helper_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
merge_helper.cc Fix merging range tombstone covering put during flush/compaction (#5406) 2019-06-04 10:24:14 -07:00
merge_helper.h Remove v1 RangeDelAggregator (#4778) 2018-12-17 17:33:46 -08:00
merge_operator.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
merge_test.cc Make format 2019-05-31 15:24:43 -07:00
obsolete_files_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
options_file_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
perf_context_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
pinned_iterators_manager.h Change RocksDB License 2017-07-15 16:11:23 -07:00
plain_table_db_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
pre_release_callback.h WritePrepared: reduce prepared_mutex_ overhead (#5420) 2019-06-10 11:53:31 -07:00
prefix_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
range_del_aggregator_bench.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
range_del_aggregator_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
range_del_aggregator.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
range_del_aggregator.h Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
range_tombstone_fragmenter_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
range_tombstone_fragmenter.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
range_tombstone_fragmenter.h Add compaction logic to RangeDelAggregatorV2 (#4758) 2018-12-17 13:20:51 -08:00
read_callback.h WritePrepared: fix race condition in reading batch with duplicate keys (#5147) 2019-04-12 14:40:41 -07:00
repair_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
repair.cc Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
snapshot_checker.h WritePrepared: fix issue with snapshot released during compaction (#4858) 2019-01-16 09:55:32 -08:00
snapshot_impl.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
snapshot_impl.h Refresh snapshot list during long compactions (2nd attempt) (#5278) 2019-05-03 17:30:22 -07:00
table_cache.cc Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
table_cache.h Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
table_properties_collector_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
table_properties_collector.cc Feature for sampling and reporting compressibility (#4842) 2019-03-18 12:15:34 -07:00
table_properties_collector.h Feature for sampling and reporting compressibility (#4842) 2019-03-18 12:15:34 -07:00
transaction_log_impl.cc Replace Corruption with TryAgain status when new tail is not visible to TransactionLogIterator (#5474) 2019-06-19 08:10:08 -07:00
transaction_log_impl.h Move some file related files outside util/ (#5375) 2019-05-29 20:47:06 -07:00
version_builder_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
version_builder.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
version_builder.h Support for single-primary, multi-secondary instances (#4899) 2019-03-26 16:45:31 -07:00
version_edit_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
version_edit.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
version_edit.h Make RocksDB secondary instance respect atomic groups in version edits. (#5411) 2019-06-04 10:56:19 -07:00
version_set_test.cc Integrate block cache tracer into db_impl (#5433) 2019-06-13 15:43:10 -07:00
version_set.cc Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
version_set.h Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
wal_manager_test.cc Replace Corruption with TryAgain status when new tail is not visible to TransactionLogIterator (#5474) 2019-06-19 08:10:08 -07:00
wal_manager.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
wal_manager.h improve comment for WalManager (#5350) 2019-05-24 10:40:30 -07:00
write_batch_base.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
write_batch_internal.h WriteUnPrepared: Add new WAL marker kTypeBeginUnprepareXID (#4069) 2018-06-28 18:58:29 -07:00
write_batch_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
write_batch.cc Make format 2019-05-31 15:24:43 -07:00
write_callback_test.cc WritePrepared: reduce prepared_mutex_ overhead (#5420) 2019-06-10 11:53:31 -07:00
write_callback.h Change RocksDB License 2017-07-15 16:11:23 -07:00
write_controller_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
write_controller.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
write_controller.h Change RocksDB License 2017-07-15 16:11:23 -07:00
write_thread.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
write_thread.h Fix skip WAL for whole write_group when leader's callback fail (#4838) 2019-01-03 12:40:42 -08:00