rocksdb/table
Andrew Kryczka 62f70f6d14 Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.

So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:

- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952

Differential Revision: D13967980

Pulled By: ajkr

fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-11 19:47:32 -08:00
..
adaptive_table_factory.cc Update all unique/shared_ptr instances to be qualified with namespace std (#4638) 2018-11-09 11:19:58 -08:00
adaptive_table_factory.h Update all unique/shared_ptr instances to be qualified with namespace std (#4638) 2018-11-09 11:19:58 -08:00
block_based_filter_block_test.cc Remove two variables from BlockContents class and don't use class Block for compressed block (#4650) 2018-11-13 17:02:55 -08:00
block_based_filter_block.cc PrefixMayMatch: remove unnecessary check for prefix_extractor_ (#4067) 2018-06-27 20:42:43 -07:00
block_based_filter_block.h Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
block_based_table_builder.cc Reduce scope of compression dictionary to single SST (#4952) 2019-02-11 19:47:32 -08:00
block_based_table_builder.h Reduce scope of compression dictionary to single SST (#4952) 2019-02-11 19:47:32 -08:00
block_based_table_factory.cc Reduce scope of compression dictionary to single SST (#4952) 2019-02-11 19:47:32 -08:00
block_based_table_factory.h Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697) 2018-11-21 11:29:57 -08:00
block_based_table_reader.cc Checksum properties block for block-based table (#4956) 2019-02-11 11:50:01 -08:00
block_based_table_reader.h Cache dictionary used for decompressing data blocks (#4881) 2019-01-23 18:15:47 -08:00
block_builder.cc DataBlockHashIndex: Remove the division from EstimateSize() (#4293) 2018-08-20 23:13:50 -07:00
block_builder.h Add path to WritableFileWriter. (#4039) 2018-08-23 10:12:58 -07:00
block_fetcher.cc Cache dictionary used for decompressing data blocks (#4881) 2019-01-23 18:15:47 -08:00
block_fetcher.h Cache dictionary used for decompressing data blocks (#4881) 2019-01-23 18:15:47 -08:00
block_prefix_index.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
block_prefix_index.h Change RocksDB License 2017-07-15 16:11:23 -07:00
block_test.cc Remove two variables from BlockContents class and don't use class Block for compressed block (#4650) 2018-11-13 17:02:55 -08:00
block.cc Checksum properties block for block-based table (#4956) 2019-02-11 11:50:01 -08:00
block.h Checksum properties block for block-based table (#4956) 2019-02-11 11:50:01 -08:00
bloom_block.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
bloom_block.h Disallow customized hash function in DynamicBloom (#4915) 2019-01-24 10:34:30 -08:00
cleanable_test.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
cuckoo_table_builder_test.cc Update all unique/shared_ptr instances to be qualified with namespace std (#4638) 2018-11-09 11:19:58 -08:00
cuckoo_table_builder.cc Promote rocksdb.{deleted.keys,merge.operands} to main table properties (#4594) 2018-10-30 15:34:27 -07:00
cuckoo_table_builder.h Change RocksDB License 2017-07-15 16:11:23 -07:00
cuckoo_table_factory.cc Update all unique/shared_ptr instances to be qualified with namespace std (#4638) 2018-11-09 11:19:58 -08:00
cuckoo_table_factory.h Update all unique/shared_ptr instances to be qualified with namespace std (#4638) 2018-11-09 11:19:58 -08:00
cuckoo_table_reader_test.cc Update all unique/shared_ptr instances to be qualified with namespace std (#4638) 2018-11-09 11:19:58 -08:00
cuckoo_table_reader.cc Support pragma once in all header files and cleanup some warnings (#4339) 2018-09-05 18:13:31 -07:00
cuckoo_table_reader.h Index value delta encoding (#3983) 2018-08-09 16:58:40 -07:00
data_block_footer.cc Add db_bench options of data block hash index (#4281) 2018-08-16 18:42:46 -07:00
data_block_footer.h Add db_bench options of data block hash index (#4281) 2018-08-16 18:42:46 -07:00
data_block_hash_index_test.cc Reduce scope of compression dictionary to single SST (#4952) 2019-02-11 19:47:32 -08:00
data_block_hash_index.cc DataBlockHashIndex: Remove the division from EstimateSize() (#4293) 2018-08-20 23:13:50 -07:00
data_block_hash_index.h DataBlockHashIndex: Remove the division from EstimateSize() (#4293) 2018-08-20 23:13:50 -07:00
filter_block.h use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899) 2018-06-26 15:57:26 -07:00
flush_block_policy.cc Align SST file data blocks to avoid spanning multiple pages 2018-03-26 20:26:10 -07:00
format.cc Digest ZSTD compression dictionary once when writing SST file (#4849) 2019-01-18 19:12:57 -08:00
format.h Digest ZSTD compression dictionary once when writing SST file (#4849) 2019-01-18 19:12:57 -08:00
full_filter_bits_builder.h Skip duplicate bloom keys when whole_key and prefix are mixed 2018-04-24 10:58:16 -07:00
full_filter_block_test.cc Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
full_filter_block.cc Charging block cache more accurately (#4073) 2018-06-29 08:57:20 -07:00
full_filter_block.h Charging block cache more accurately (#4073) 2018-06-29 08:57:20 -07:00
get_context.cc PlainTable should avoid copying Get() results from immortal source. (#4924) 2019-01-25 17:12:19 -08:00
get_context.h Cache dictionary used for decompressing data blocks (#4881) 2019-01-23 18:15:47 -08:00
index_builder.cc Add path to WritableFileWriter. (#4039) 2018-08-23 10:12:58 -07:00
index_builder.h Add path to WritableFileWriter. (#4039) 2018-08-23 10:12:58 -07:00
internal_iterator.h Index value delta encoding (#3983) 2018-08-09 16:58:40 -07:00
iter_heap.h Make InternalKeyComparator final and directly use it in merging iterator 2017-09-11 12:04:21 -07:00
iterator_wrapper.h Index value delta encoding (#3983) 2018-08-09 16:58:40 -07:00
iterator.cc fix typo in error message, twice (#4457) 2018-10-09 17:07:27 -07:00
merger_test.cc Make InternalKeyComparator final and directly use it in merging iterator 2017-09-11 12:04:21 -07:00
merging_iterator.cc Index value delta encoding (#3983) 2018-08-09 16:58:40 -07:00
merging_iterator.h Index value delta encoding (#3983) 2018-08-09 16:58:40 -07:00
meta_blocks.cc Checksum properties block for block-based table (#4956) 2019-02-11 11:50:01 -08:00
meta_blocks.h Checksum properties block for block-based table (#4956) 2019-02-11 11:50:01 -08:00
mock_table.cc Update all unique/shared_ptr instances to be qualified with namespace std (#4638) 2018-11-09 11:19:58 -08:00
mock_table.h Update all unique/shared_ptr instances to be qualified with namespace std (#4638) 2018-11-09 11:19:58 -08:00
partitioned_filter_block_test.cc Remove two variables from BlockContents class and don't use class Block for compressed block (#4650) 2018-11-13 17:02:55 -08:00
partitioned_filter_block.cc Fix bug in partition filters with format_version=4 (#4381) 2018-09-17 17:28:15 -07:00
partitioned_filter_block.h Fix bug in partition filters with format_version=4 (#4381) 2018-09-17 17:28:15 -07:00
persistent_cache_helper.cc Remove two variables from BlockContents class and don't use class Block for compressed block (#4650) 2018-11-13 17:02:55 -08:00
persistent_cache_helper.h Change RocksDB License 2017-07-15 16:11:23 -07:00
persistent_cache_options.h Change RocksDB License 2017-07-15 16:11:23 -07:00
plain_table_builder.cc Remove PlainTable's feature store_index_in_file (#4914) 2019-01-28 12:50:22 -08:00
plain_table_builder.h Remove PlainTable's feature store_index_in_file (#4914) 2019-01-28 12:50:22 -08:00
plain_table_factory.cc Remove cuckoo hash memtable (#4953) 2019-02-07 16:15:27 -08:00
plain_table_factory.h Remove PlainTable's feature store_index_in_file (#4914) 2019-01-28 12:50:22 -08:00
plain_table_index.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
plain_table_index.h Move prefix_extractor to MutableCFOptions 2018-05-21 14:43:11 -07:00
plain_table_key_coding.cc Comment out unused variables 2018-03-05 13:13:41 -08:00
plain_table_key_coding.h Update all unique/shared_ptr instances to be qualified with namespace std (#4638) 2018-11-09 11:19:58 -08:00
plain_table_reader.cc Remove PlainTable's feature store_index_in_file (#4914) 2019-01-28 12:50:22 -08:00
plain_table_reader.h PlainTable should avoid copying Get() results from immortal source. (#4924) 2019-01-25 17:12:19 -08:00
scoped_arena_iterator.h Change RocksDB License 2017-07-15 16:11:23 -07:00
sst_file_reader_test.cc Get CompactionJobInfo from CompactFiles 2018-12-13 14:21:24 -08:00
sst_file_reader.cc Get CompactionJobInfo from CompactFiles 2018-12-13 14:21:24 -08:00
sst_file_writer_collectors.h Comment out unused variables 2018-03-05 13:13:41 -08:00
sst_file_writer.cc Reduce scope of compression dictionary to single SST (#4952) 2019-02-11 19:47:32 -08:00
table_builder.h Reduce scope of compression dictionary to single SST (#4952) 2019-02-11 19:47:32 -08:00
table_properties_internal.h Index value delta encoding (#3983) 2018-08-09 16:58:40 -07:00
table_properties.cc Promote rocksdb.{deleted.keys,merge.operands} to main table properties (#4594) 2018-10-30 15:34:27 -07:00
table_reader_bench.cc Reduce scope of compression dictionary to single SST (#4952) 2019-02-11 19:47:32 -08:00
table_reader.h Clean up FragmentedRangeTombstoneList (#4692) 2018-11-28 15:29:02 -08:00
table_test.cc Reduce scope of compression dictionary to single SST (#4952) 2019-02-11 19:47:32 -08:00
two_level_iterator.cc Index value delta encoding (#3983) 2018-08-09 16:58:40 -07:00
two_level_iterator.h Index value delta encoding (#3983) 2018-08-09 16:58:40 -07:00