62f70f6d14
Summary: Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio. So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include: - The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called. - After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up. - Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952 Differential Revision: D13967980 Pulled By: ajkr fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f |
||
---|---|---|
.. | ||
advisor | ||
dump | ||
rdb | ||
auto_sanity_test.sh | ||
benchmark_leveldb.sh | ||
benchmark.sh | ||
blob_dump.cc | ||
check_format_compatible.sh | ||
CMakeLists.txt | ||
db_bench_tool_test.cc | ||
db_bench_tool.cc | ||
db_bench.cc | ||
db_crashtest.py | ||
db_repl_stress.cc | ||
db_sanity_test.cc | ||
db_stress.cc | ||
dbench_monitor | ||
Dockerfile | ||
generate_random_db.sh | ||
ingest_external_sst.sh | ||
ldb_cmd_impl.h | ||
ldb_cmd_test.cc | ||
ldb_cmd.cc | ||
ldb_test.py | ||
ldb_tool.cc | ||
ldb.cc | ||
pflag | ||
reduce_levels_test.cc | ||
regression_test.sh | ||
report_lite_binary_size.sh | ||
rocksdb_dump_test.sh | ||
run_flash_bench.sh | ||
run_leveldb.sh | ||
sample-dump.dmp | ||
sst_dump_test.cc | ||
sst_dump_tool_imp.h | ||
sst_dump_tool.cc | ||
sst_dump.cc | ||
trace_analyzer_test.cc | ||
trace_analyzer_tool.cc | ||
trace_analyzer_tool.h | ||
trace_analyzer.cc | ||
verify_random_db.sh | ||
write_external_sst.sh | ||
write_stress_runner.py | ||
write_stress.cc |