rocksdb/util
Andrew Kryczka d904233d2f Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.

However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.

Related changes include:

- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970

Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.

Reviewed By: pdillinger

Differential Revision: D26467994

Pulled By: ajkr

fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 14:09:54 -08:00
..
aligned_buffer.h Fix wrong comments about function TruncateToPageBoundary. (#6975) 2020-10-07 12:34:34 -07:00
autovector_test.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
autovector.h Change autovector to have a reserved size in LITE mode (#6868) 2020-05-21 14:48:10 -07:00
bloom_impl.h Ribbon: InterleavedSolutionStorage (#7598) 2020-11-03 12:46:36 -08:00
bloom_test.cc Support optimize_filters_for_memory for Ribbon filter (#7774) 2020-12-18 14:31:03 -08:00
build_version.cc.in Make builds reproducible (#7866) 2021-01-28 17:42:16 -08:00
cast_util.h Replace reinterpret_cast with static_cast_with_check (#7067) 2020-07-02 19:25:41 -07:00
channel.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
coding_lean.h Add Encode/DecodeFixedGeneric, coding_lean.h (#7587) 2020-10-23 14:11:15 -07:00
coding_test.cc Fix potential overflow of unsigned type in for loop (#6902) 2020-06-02 15:05:07 -07:00
coding.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
coding.h Add Encode/DecodeFixedGeneric, coding_lean.h (#7587) 2020-10-23 14:11:15 -07:00
compaction_job_stats_impl.cc Add is_full_compaction to CompactionJobStats, cleanup (#7451) 2020-10-01 12:52:58 -07:00
comparator.cc Remove unused includes (#7604) 2020-10-28 23:22:27 -07:00
compression_context_cache.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
compression_context_cache.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
compression.h Limit buffering for collecting samples for compression dictionary (#7970) 2021-02-19 14:09:54 -08:00
concurrent_task_limiter_impl.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
concurrent_task_limiter_impl.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
core_local.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
crc32c_arm64.cc Adding ARM AT_HWCAP support for FreeBSD (#7750) 2020-12-08 13:33:21 -08:00
crc32c_arm64.h Fix compilation on Apple Silicon (#7714) 2020-12-04 15:22:33 -08:00
crc32c_ppc_asm.S Fix Compilation on ppc64le using Clang 11 (#7713) 2020-12-01 11:21:44 -08:00
crc32c_ppc_constants.h Remove PATENTS text from a few straggler files (#5326) 2019-05-21 16:22:35 -07:00
crc32c_ppc.c Fix Compilation on ppc64le using Clang 11 (#7713) 2020-12-01 11:21:44 -08:00
crc32c_ppc.h Fix Compilation on ppc64le using Clang 11 (#7713) 2020-12-01 11:21:44 -08:00
crc32c_test.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
crc32c.cc Fix build on FreeBSD/powerpc64(le) (#7732) 2020-12-08 15:31:56 -08:00
crc32c.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
defer_test.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
defer.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
duplicate_detector.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
dynamic_bloom_test.cc Add a SystemClock class to capture the time functions of an Env (#7858) 2021-01-25 22:09:11 -08:00
dynamic_bloom.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
dynamic_bloom.h Genericize and clean up FastRange (#7436) 2020-09-28 11:35:00 -07:00
fastrange.h Genericize and clean up FastRange (#7436) 2020-09-28 11:35:00 -07:00
file_checksum_helper.cc Refactor with VersionEditHandler (#6581) 2020-11-11 08:00:14 -08:00
file_checksum_helper.h Real fix for race in backup custom checksum checking (#7309) 2020-08-26 10:39:20 -07:00
file_reader_writer_test.cc Remove Legacy and Custom FileWrapper classes from header files (#7851) 2021-01-28 22:10:32 -08:00
filelock_test.cc Fix MSVC-related build issues (#7439) 2020-10-01 09:23:04 -07:00
filter_bench.cc Add a SystemClock class to capture the time functions of an Env (#7858) 2021-01-25 22:09:11 -08:00
gflags_compat.h Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566) 2020-10-27 10:33:09 -07:00
hash_map.h Change HashMap::Insert()'s value to a const reference (#6567) 2020-03-20 14:59:54 -07:00
hash_test.cc Use NPHash64 in more places (#7632) 2020-11-10 23:42:13 -08:00
hash.cc Integrity protection for live updates to WriteBatch (#7748) 2021-01-29 12:18:58 -08:00
hash.h Integrity protection for live updates to WriteBatch (#7748) 2021-01-29 12:18:58 -08:00
heap_test.cc Revert "Update googletest from 1.8.1 to 1.10.0 (#6808)" (#6923) 2020-06-03 15:55:03 -07:00
heap.h Avoid self-move-assign in pop operation of binary heap. (#7942) 2021-02-19 13:47:25 -08:00
kv_map.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
log_write_bench.cc Add a SystemClock class to capture the time functions of an Env (#7858) 2021-01-25 22:09:11 -08:00
math128.h Ribbon: InterleavedSolutionStorage (#7598) 2020-11-03 12:46:36 -08:00
math.h Fix MSVC-related build issues (#7439) 2020-10-01 09:23:04 -07:00
murmurhash.cc C++20 compatibility (#6697) 2020-04-20 13:24:25 -07:00
murmurhash.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
mutexlock.h Prevents Table Cache to open same files more times (#6707) 2020-04-21 13:16:31 -07:00
ppc-opcode.h Remove PATENTS text from a few straggler files (#5326) 2019-05-21 16:22:35 -07:00
random_test.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
random.cc More Makefile Cleanup (#7097) 2020-07-09 14:35:17 -07:00
random.h More Makefile Cleanup (#7097) 2020-07-09 14:35:17 -07:00
rate_limiter_test.cc Add a SystemClock class to capture the time functions of an Env (#7858) 2021-01-25 22:09:11 -08:00
rate_limiter.cc Add a SystemClock class to capture the time functions of an Env (#7858) 2021-01-25 22:09:11 -08:00
rate_limiter.h Add a SystemClock class to capture the time functions of an Env (#7858) 2021-01-25 22:09:11 -08:00
repeatable_thread_test.cc Add a SystemClock class to capture the time functions of an Env (#7858) 2021-01-25 22:09:11 -08:00
repeatable_thread.h Add a SystemClock class to capture the time functions of an Env (#7858) 2021-01-25 22:09:11 -08:00
ribbon_alg.h Add prefetching (batched MultiGet) for experimental Ribbon filter (#7889) 2021-02-10 21:04:56 -08:00
ribbon_impl.h Add prefetching (batched MultiGet) for experimental Ribbon filter (#7889) 2021-02-10 21:04:56 -08:00
ribbon_test.cc Add a SystemClock class to capture the time functions of an Env (#7858) 2021-01-25 22:09:11 -08:00
set_comparator.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
slice_test.cc Handoff checksum Implementation (#7523) 2021-02-10 22:20:32 -08:00
slice_transform_test.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
slice.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
status.cc add string separation while composing error message (#7919) 2021-02-18 12:25:35 -08:00
stderr_logger.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
stop_watch.h Add a SystemClock class to capture the time functions of an Env (#7858) 2021-01-25 22:09:11 -08:00
string_util.cc Remove unused includes (#7604) 2020-10-28 23:22:27 -07:00
string_util.h Add Struct Type to OptionsTypeInfo (#6425) 2020-05-21 10:58:39 -07:00
thread_list_test.cc fix thread status synchronization in thread_list_test (#7825) 2021-01-04 10:46:24 -08:00
thread_local_test.cc Fix ThreadLocalTest.SequentialReadWriteTest failure when running individually (#6929) 2020-06-04 11:44:09 -07:00
thread_local.cc Fix typo in ThreadData comment (#7131) 2020-07-15 09:23:23 -07:00
thread_local.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
thread_operation.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
threadpool_imp.cc Make it able to lower cpu priority to specific level in threadpool (#6969) 2020-06-13 13:25:20 -07:00
threadpool_imp.h Make it able to lower cpu priority to specific level in threadpool (#6969) 2020-06-13 13:25:20 -07:00
timer_queue_test.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
timer_queue.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
timer_test.cc Add a SystemClock class to capture the time functions of an Env (#7858) 2021-01-25 22:09:11 -08:00
timer.h Add a SystemClock class to capture the time functions of an Env (#7858) 2021-01-25 22:09:11 -08:00
user_comparator_wrapper.h Separate internal and user key comparators in BlockIter (#6944) 2020-07-07 17:26:16 -07:00
vector_iterator.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
work_queue_test.cc Add pipelined & parallel compression optimization (#6262) 2020-04-01 16:40:18 -07:00
work_queue.h Revamp cache_bench to resemble a real workload (#6629) 2020-04-03 10:26:49 -07:00
xxh3p.h Fix MSVC-related build issues (#7439) 2020-10-01 09:23:04 -07:00
xxhash.cc Remove unused includes (#7604) 2020-10-28 23:22:27 -07:00
xxhash.h Misc hashing updates / upgrades (#5909) 2019-10-24 17:16:46 -07:00