2016-02-10 00:12:00 +01:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-16 01:03:42 +02:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
2013-10-29 01:54:09 +01:00
|
|
|
//
|
2011-03-18 23:37:00 +01:00
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
|
2013-10-29 01:54:09 +01:00
|
|
|
#pragma once
|
2011-03-18 23:37:00 +01:00
|
|
|
#include <stdint.h>
|
2021-08-03 21:42:22 +02:00
|
|
|
|
Implement XXH3 block checksum type (#9069)
Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.
Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument
This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.
DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.
Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.
### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
As you can see, especially once warmed up, xxh3 is fastest.
### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
Test
for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
Results (ops/sec)
for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3
Reviewed By: mrambacher
Differential Revision: D31905249
Pulled By: pdillinger
fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
2021-10-29 07:13:47 +02:00
|
|
|
#include <array>
|
2014-06-09 21:26:09 +02:00
|
|
|
#include <limits>
|
Add more table properties to EventLogger
Summary:
Example output:
{"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}
Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.
Test Plan: make check + check out the output in the log
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38343
2015-05-13 00:53:55 +02:00
|
|
|
#include <string>
|
|
|
|
#include <utility>
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 19:04:30 +02:00
|
|
|
#include <vector>
|
2014-03-01 03:19:07 +01:00
|
|
|
|
2020-02-11 00:42:46 +01:00
|
|
|
#include "db/version_edit.h"
|
2013-11-20 07:00:48 +01:00
|
|
|
#include "rocksdb/flush_block_policy.h"
|
2017-06-28 02:02:20 +02:00
|
|
|
#include "rocksdb/listener.h"
|
2013-08-23 17:38:13 +02:00
|
|
|
#include "rocksdb/options.h"
|
|
|
|
#include "rocksdb/status.h"
|
Implement XXH3 block checksum type (#9069)
Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.
Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument
This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.
DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.
Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.
### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
As you can see, especially once warmed up, xxh3 is fastest.
### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
Test
for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
Results (ops/sec)
for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3
Reviewed By: mrambacher
Differential Revision: D31905249
Pulled By: pdillinger
fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
2021-10-29 07:13:47 +02:00
|
|
|
#include "rocksdb/table.h"
|
2019-05-31 02:39:43 +02:00
|
|
|
#include "table/meta_blocks.h"
|
2014-01-28 06:58:46 +01:00
|
|
|
#include "table/table_builder.h"
|
2018-06-04 21:04:52 +02:00
|
|
|
#include "util/compression.h"
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2020-02-20 21:07:53 +01:00
|
|
|
namespace ROCKSDB_NAMESPACE {
|
2011-03-18 23:37:00 +01:00
|
|
|
|
|
|
|
class BlockBuilder;
|
|
|
|
class BlockHandle;
|
|
|
|
class WritableFile;
|
2014-03-01 03:19:07 +01:00
|
|
|
struct BlockBasedTableOptions;
|
2013-10-29 01:54:09 +01:00
|
|
|
|
2015-07-02 01:13:49 +02:00
|
|
|
extern const uint64_t kBlockBasedTableMagicNumber;
|
|
|
|
extern const uint64_t kLegacyBlockBasedTableMagicNumber;
|
|
|
|
|
2013-10-29 01:54:09 +01:00
|
|
|
class BlockBasedTableBuilder : public TableBuilder {
|
2011-03-18 23:37:00 +01:00
|
|
|
public:
|
|
|
|
// Create a builder that will store the contents of the table it is
|
|
|
|
// building in *file. Does not close the file. It is up to the
|
2013-10-30 18:52:33 +01:00
|
|
|
// caller to close the file after calling Finish().
|
2021-04-29 15:59:53 +02:00
|
|
|
BlockBasedTableBuilder(const BlockBasedTableOptions& table_options,
|
|
|
|
const TableBuilderOptions& table_builder_options,
|
|
|
|
WritableFileWriter* file);
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2018-06-04 21:04:52 +02:00
|
|
|
// No copying allowed
|
|
|
|
BlockBasedTableBuilder(const BlockBasedTableBuilder&) = delete;
|
|
|
|
BlockBasedTableBuilder& operator=(const BlockBasedTableBuilder&) = delete;
|
|
|
|
|
2019-09-12 03:07:12 +02:00
|
|
|
// REQUIRES: Either Finish() or Abandon() has been called.
|
|
|
|
~BlockBasedTableBuilder();
|
|
|
|
|
2011-03-18 23:37:00 +01:00
|
|
|
// Add key,value to the table being constructed.
|
|
|
|
// REQUIRES: key is after any previously added key according to comparator.
|
|
|
|
// REQUIRES: Finish(), Abandon() have not been called
|
2013-10-29 01:54:09 +01:00
|
|
|
void Add(const Slice& key, const Slice& value) override;
|
2011-03-18 23:37:00 +01:00
|
|
|
|
|
|
|
// Return non-ok iff some error has been detected.
|
2013-10-29 01:54:09 +01:00
|
|
|
Status status() const override;
|
2011-03-18 23:37:00 +01:00
|
|
|
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-28 00:03:05 +01:00
|
|
|
// Return non-ok iff some error happens during IO.
|
|
|
|
IOStatus io_status() const override;
|
|
|
|
|
2011-03-18 23:37:00 +01:00
|
|
|
// Finish building the table. Stops using the file passed to the
|
|
|
|
// constructor after this function returns.
|
|
|
|
// REQUIRES: Finish(), Abandon() have not been called
|
2013-10-29 01:54:09 +01:00
|
|
|
Status Finish() override;
|
2011-03-18 23:37:00 +01:00
|
|
|
|
|
|
|
// Indicate that the contents of this builder should be abandoned. Stops
|
|
|
|
// using the file passed to the constructor after this function returns.
|
|
|
|
// If the caller is not going to call Finish(), it must call Abandon()
|
|
|
|
// before destroying this builder.
|
|
|
|
// REQUIRES: Finish(), Abandon() have not been called
|
2013-10-29 01:54:09 +01:00
|
|
|
void Abandon() override;
|
2011-03-18 23:37:00 +01:00
|
|
|
|
|
|
|
// Number of calls to Add() so far.
|
2013-10-29 01:54:09 +01:00
|
|
|
uint64_t NumEntries() const override;
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2020-04-03 01:13:44 +02:00
|
|
|
bool IsEmpty() const override;
|
|
|
|
|
2011-03-18 23:37:00 +01:00
|
|
|
// Size of the file generated so far. If invoked after a successful
|
|
|
|
// Finish() call, returns the size of the final generated file.
|
2013-10-29 01:54:09 +01:00
|
|
|
uint64_t FileSize() const override;
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2020-04-02 01:37:54 +02:00
|
|
|
// Estimated size of the file generated so far. This is used when
|
|
|
|
// FileSize() cannot estimate final SST size, e.g. parallel compression
|
|
|
|
// is enabled.
|
|
|
|
uint64_t EstimatedFileSize() const override;
|
|
|
|
|
2015-06-04 21:03:40 +02:00
|
|
|
bool NeedCompact() const override;
|
|
|
|
|
Add more table properties to EventLogger
Summary:
Example output:
{"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}
Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.
Test Plan: make check + check out the output in the log
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38343
2015-05-13 00:53:55 +02:00
|
|
|
// Get table properties
|
|
|
|
TableProperties GetTableProperties() const override;
|
|
|
|
|
2020-02-11 00:42:46 +01:00
|
|
|
// Get file checksum
|
2020-03-30 00:57:02 +02:00
|
|
|
std::string GetFileChecksum() const override;
|
2020-02-11 00:42:46 +01:00
|
|
|
|
|
|
|
// Get file checksum function name
|
|
|
|
const char* GetFileChecksumFuncName() const override;
|
|
|
|
|
2011-03-18 23:37:00 +01:00
|
|
|
private:
|
|
|
|
bool ok() const { return status().ok(); }
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
2016-04-28 02:36:03 +02:00
|
|
|
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
// Transition state from buffered to unbuffered. See `Rep::State` API comment
|
|
|
|
// for details of the states.
|
|
|
|
// REQUIRES: `rep_->state == kBuffered`
|
|
|
|
void EnterUnbuffered();
|
|
|
|
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 23:06:59 +01:00
|
|
|
// Call block's Finish() method and then
|
|
|
|
// - in buffered mode, buffer the uncompressed block contents.
|
|
|
|
// - in unbuffered mode, write the compressed block contents to file.
|
2021-08-03 21:42:22 +02:00
|
|
|
void WriteBlock(BlockBuilder* block, BlockHandle* handle,
|
|
|
|
BlockType blocktype);
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
2016-04-28 02:36:03 +02:00
|
|
|
|
Compaction Support for Range Deletion
Summary:
This diff introduces RangeDelAggregator, which takes ownership of iterators
provided to it via AddTombstones(). The tombstones are organized in a two-level
map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data
copy by holding Slices returned by the iterator, which remain valid thanks to pinning.
For compaction, we create a hierarchical range tombstone iterator with structure
matching the iterator over compaction input data. An aggregator based on that
iterator is used by CompactionIterator to determine which keys are covered by
range tombstones. In case of merge operand, the same aggregator is used by
MergeHelper. Upon finishing each file in the compaction, relevant range tombstones
are added to the output file's range tombstone metablock and file boundaries are
updated accordingly.
To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete()
considers tombstones in the key's snapshot stripe. When this function is used outside of
compaction, it also checks newer stripes, which can contain covering tombstones. Currently
the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges
within a stripe such that binary search can be used.
RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range
to a new table's range tombstone meta-block. Since range tombstones may fall in the gap
between files, we may need to extend some files' key-ranges. The strategy is (1) first file
extends as far left as possible and other files do not extend left, (2) all files extend right
until either the start of the next file or the end of the last range tombstone in the gap,
whichever comes first.
One other notable change is adding release/move semantics to ScopedArenaIterator
such that it can be used to transfer ownership of an arena-allocated iterator, similar to
how unique_ptr is used for malloc'd data.
Depends on D61473
Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927
Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62205
2016-10-18 21:04:56 +02:00
|
|
|
// Compress and write block content to the file.
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
2016-04-28 02:36:03 +02:00
|
|
|
void WriteBlock(const Slice& block_contents, BlockHandle* handle,
|
2021-08-03 21:42:22 +02:00
|
|
|
BlockType block_type);
|
Compaction Support for Range Deletion
Summary:
This diff introduces RangeDelAggregator, which takes ownership of iterators
provided to it via AddTombstones(). The tombstones are organized in a two-level
map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data
copy by holding Slices returned by the iterator, which remain valid thanks to pinning.
For compaction, we create a hierarchical range tombstone iterator with structure
matching the iterator over compaction input data. An aggregator based on that
iterator is used by CompactionIterator to determine which keys are covered by
range tombstones. In case of merge operand, the same aggregator is used by
MergeHelper. Upon finishing each file in the compaction, relevant range tombstones
are added to the output file's range tombstone metablock and file boundaries are
updated accordingly.
To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete()
considers tombstones in the key's snapshot stripe. When this function is used outside of
compaction, it also checks newer stripes, which can contain covering tombstones. Currently
the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges
within a stripe such that binary search can be used.
RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range
to a new table's range tombstone meta-block. Since range tombstones may fall in the gap
between files, we may need to extend some files' key-ranges. The strategy is (1) first file
extends as far left as possible and other files do not extend left, (2) all files extend right
until either the start of the next file or the end of the last range tombstone in the gap,
whichever comes first.
One other notable change is adding release/move semantics to ScopedArenaIterator
such that it can be used to transfer ownership of an arena-allocated iterator, similar to
how unique_ptr is used for malloc'd data.
Depends on D61473
Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927
Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62205
2016-10-18 21:04:56 +02:00
|
|
|
// Directly write data to the file.
|
2018-03-27 05:14:24 +02:00
|
|
|
void WriteRawBlock(const Slice& data, CompressionType, BlockHandle* handle,
|
2021-08-03 21:42:22 +02:00
|
|
|
BlockType block_type, const Slice* raw_data = nullptr);
|
2021-06-18 06:55:42 +02:00
|
|
|
|
|
|
|
void SetupCacheKeyPrefix(const TableBuilderOptions& tbo);
|
|
|
|
|
2021-08-03 21:42:22 +02:00
|
|
|
template <typename TBlocklike>
|
2013-09-02 08:23:40 +02:00
|
|
|
Status InsertBlockInCache(const Slice& block_contents,
|
2021-08-03 21:42:22 +02:00
|
|
|
const BlockHandle* handle, BlockType block_type);
|
|
|
|
|
|
|
|
Status InsertBlockInCacheHelper(const Slice& block_contents,
|
|
|
|
const BlockHandle* handle,
|
|
|
|
BlockType block_type);
|
|
|
|
|
2021-06-18 06:55:42 +02:00
|
|
|
Status InsertBlockInCompressedCache(const Slice& block_contents,
|
|
|
|
const CompressionType type,
|
|
|
|
const BlockHandle* handle);
|
2018-07-20 18:00:33 +02:00
|
|
|
|
|
|
|
void WriteFilterBlock(MetaIndexBuilder* meta_index_builder);
|
|
|
|
void WriteIndexBlock(MetaIndexBuilder* meta_index_builder,
|
|
|
|
BlockHandle* index_block_handle);
|
|
|
|
void WritePropertiesBlock(MetaIndexBuilder* meta_index_builder);
|
|
|
|
void WriteCompressionDictBlock(MetaIndexBuilder* meta_index_builder);
|
|
|
|
void WriteRangeDelBlock(MetaIndexBuilder* meta_index_builder);
|
2018-12-07 22:15:09 +01:00
|
|
|
void WriteFooter(BlockHandle& metaindex_block_handle,
|
|
|
|
BlockHandle& index_block_handle);
|
2018-07-20 18:00:33 +02:00
|
|
|
|
2011-03-18 23:37:00 +01:00
|
|
|
struct Rep;
|
TablePropertiesCollectorFactory
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.
Here's description of task #4296714:
I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.
For example, this table has 3155 entries and 1391828 deleted keys.
The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.
Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.
In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.
Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests
Reviewers: sdong, dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18579
2014-05-13 21:30:55 +02:00
|
|
|
class BlockBasedTablePropertiesCollectorFactory;
|
2014-03-01 03:19:07 +01:00
|
|
|
class BlockBasedTablePropertiesCollector;
|
2011-03-18 23:37:00 +01:00
|
|
|
Rep* rep_;
|
2013-10-29 01:54:09 +01:00
|
|
|
|
2020-04-02 01:37:54 +02:00
|
|
|
struct ParallelCompressionRep;
|
|
|
|
|
2013-10-29 01:54:09 +01:00
|
|
|
// Advanced operation: flush any buffered key/value pairs to file.
|
|
|
|
// Can be used to ensure that two adjacent entries never live in
|
|
|
|
// the same data block. Most clients should not need to use this method.
|
|
|
|
// REQUIRES: Finish(), Abandon() have not been called
|
|
|
|
void Flush();
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2014-06-09 21:26:09 +02:00
|
|
|
// Some compression libraries fail when the raw size is bigger than int. If
|
|
|
|
// uncompressed size is bigger than kCompressionSizeLimit, don't compress it
|
|
|
|
const uint64_t kCompressionSizeLimit = std::numeric_limits<int>::max();
|
2020-04-02 01:37:54 +02:00
|
|
|
|
|
|
|
// Get blocks from mem-table walking thread, compress them and
|
|
|
|
// pass them to the write thread. Used in parallel compression mode only
|
2020-10-22 20:03:10 +02:00
|
|
|
void BGWorkCompression(const CompressionContext& compression_ctx,
|
2020-04-02 01:37:54 +02:00
|
|
|
UncompressionContext* verify_ctx);
|
|
|
|
|
|
|
|
// Given raw block content, try to compress it and return result and
|
|
|
|
// compression type
|
2020-10-22 20:03:10 +02:00
|
|
|
void CompressAndVerifyBlock(const Slice& raw_block_contents,
|
|
|
|
bool is_data_block,
|
|
|
|
const CompressionContext& compression_ctx,
|
|
|
|
UncompressionContext* verify_ctx,
|
|
|
|
std::string* compressed_output,
|
|
|
|
Slice* result_block_contents,
|
|
|
|
CompressionType* result_compression_type,
|
|
|
|
Status* out_status);
|
2020-04-02 01:37:54 +02:00
|
|
|
|
|
|
|
// Get compressed blocks from BGWorkCompression and write them into SST
|
|
|
|
void BGWorkWriteRawBlock();
|
2020-10-22 20:03:10 +02:00
|
|
|
|
|
|
|
// Initialize parallel compression context and
|
|
|
|
// start BGWorkCompression and BGWorkWriteRawBlock threads
|
|
|
|
void StartParallelCompression();
|
|
|
|
|
|
|
|
// Stop BGWorkCompression and BGWorkWriteRawBlock threads
|
|
|
|
void StopParallelCompression();
|
2011-03-18 23:37:00 +01:00
|
|
|
};
|
|
|
|
|
2019-01-19 04:10:17 +01:00
|
|
|
Slice CompressBlock(const Slice& raw, const CompressionInfo& info,
|
2016-08-01 23:50:19 +02:00
|
|
|
CompressionType* type, uint32_t format_version,
|
2019-03-18 20:07:35 +01:00
|
|
|
bool do_sample, std::string* compressed_output,
|
|
|
|
std::string* sampled_output_fast,
|
|
|
|
std::string* sampled_output_slow);
|
2016-08-01 23:50:19 +02:00
|
|
|
|
2020-02-20 21:07:53 +01:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|