rocksdb/table/block_based/block_based_table_builder.h
Peter Dillinger a7d4bea43a Implement XXH3 block checksum type (#9069)
Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.

Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument

This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069

Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.

DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.

Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.

 ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c       :       0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash       :       0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64     :       0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3         :       0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c       :       0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash       :       0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64     :       0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3         :       0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c       :       0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash       :       0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64     :       0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3         :       0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)

As you can see, especially once warmed up, xxh3 is fastest.

 ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

Test

    for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done

Results (ops/sec)

    for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done

results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3

Reviewed By: mrambacher

Differential Revision: D31905249

Pulled By: pdillinger

fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
2021-10-28 22:15:17 -07:00

204 lines
7.9 KiB
C++

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
//
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file. See the AUTHORS file for names of contributors.
#pragma once
#include <stdint.h>
#include <array>
#include <limits>
#include <string>
#include <utility>
#include <vector>
#include "db/version_edit.h"
#include "rocksdb/flush_block_policy.h"
#include "rocksdb/listener.h"
#include "rocksdb/options.h"
#include "rocksdb/status.h"
#include "rocksdb/table.h"
#include "table/meta_blocks.h"
#include "table/table_builder.h"
#include "util/compression.h"
namespace ROCKSDB_NAMESPACE {
class BlockBuilder;
class BlockHandle;
class WritableFile;
struct BlockBasedTableOptions;
extern const uint64_t kBlockBasedTableMagicNumber;
extern const uint64_t kLegacyBlockBasedTableMagicNumber;
class BlockBasedTableBuilder : public TableBuilder {
public:
// Create a builder that will store the contents of the table it is
// building in *file. Does not close the file. It is up to the
// caller to close the file after calling Finish().
BlockBasedTableBuilder(const BlockBasedTableOptions& table_options,
const TableBuilderOptions& table_builder_options,
WritableFileWriter* file);
// No copying allowed
BlockBasedTableBuilder(const BlockBasedTableBuilder&) = delete;
BlockBasedTableBuilder& operator=(const BlockBasedTableBuilder&) = delete;
// REQUIRES: Either Finish() or Abandon() has been called.
~BlockBasedTableBuilder();
// Add key,value to the table being constructed.
// REQUIRES: key is after any previously added key according to comparator.
// REQUIRES: Finish(), Abandon() have not been called
void Add(const Slice& key, const Slice& value) override;
// Return non-ok iff some error has been detected.
Status status() const override;
// Return non-ok iff some error happens during IO.
IOStatus io_status() const override;
// Finish building the table. Stops using the file passed to the
// constructor after this function returns.
// REQUIRES: Finish(), Abandon() have not been called
Status Finish() override;
// Indicate that the contents of this builder should be abandoned. Stops
// using the file passed to the constructor after this function returns.
// If the caller is not going to call Finish(), it must call Abandon()
// before destroying this builder.
// REQUIRES: Finish(), Abandon() have not been called
void Abandon() override;
// Number of calls to Add() so far.
uint64_t NumEntries() const override;
bool IsEmpty() const override;
// Size of the file generated so far. If invoked after a successful
// Finish() call, returns the size of the final generated file.
uint64_t FileSize() const override;
// Estimated size of the file generated so far. This is used when
// FileSize() cannot estimate final SST size, e.g. parallel compression
// is enabled.
uint64_t EstimatedFileSize() const override;
bool NeedCompact() const override;
// Get table properties
TableProperties GetTableProperties() const override;
// Get file checksum
std::string GetFileChecksum() const override;
// Get file checksum function name
const char* GetFileChecksumFuncName() const override;
// Computes and populates block trailer for a block
static void ComputeBlockTrailer(const Slice& block_contents,
CompressionType compression_type,
ChecksumType checksum_type,
std::array<char, kBlockTrailerSize>* trailer);
private:
bool ok() const { return status().ok(); }
// Transition state from buffered to unbuffered. See `Rep::State` API comment
// for details of the states.
// REQUIRES: `rep_->state == kBuffered`
void EnterUnbuffered();
// Call block's Finish() method and then
// - in buffered mode, buffer the uncompressed block contents.
// - in unbuffered mode, write the compressed block contents to file.
void WriteBlock(BlockBuilder* block, BlockHandle* handle,
BlockType blocktype);
// Compress and write block content to the file.
void WriteBlock(const Slice& block_contents, BlockHandle* handle,
BlockType block_type);
// Directly write data to the file.
void WriteRawBlock(const Slice& data, CompressionType, BlockHandle* handle,
BlockType block_type, const Slice* raw_data = nullptr);
void SetupCacheKeyPrefix(const TableBuilderOptions& tbo);
template <typename TBlocklike>
Status InsertBlockInCache(const Slice& block_contents,
const BlockHandle* handle, BlockType block_type);
Status InsertBlockInCacheHelper(const Slice& block_contents,
const BlockHandle* handle,
BlockType block_type);
Status InsertBlockInCompressedCache(const Slice& block_contents,
const CompressionType type,
const BlockHandle* handle);
void WriteFilterBlock(MetaIndexBuilder* meta_index_builder);
void WriteIndexBlock(MetaIndexBuilder* meta_index_builder,
BlockHandle* index_block_handle);
void WritePropertiesBlock(MetaIndexBuilder* meta_index_builder);
void WriteCompressionDictBlock(MetaIndexBuilder* meta_index_builder);
void WriteRangeDelBlock(MetaIndexBuilder* meta_index_builder);
void WriteFooter(BlockHandle& metaindex_block_handle,
BlockHandle& index_block_handle);
struct Rep;
class BlockBasedTablePropertiesCollectorFactory;
class BlockBasedTablePropertiesCollector;
Rep* rep_;
struct ParallelCompressionRep;
// Advanced operation: flush any buffered key/value pairs to file.
// Can be used to ensure that two adjacent entries never live in
// the same data block. Most clients should not need to use this method.
// REQUIRES: Finish(), Abandon() have not been called
void Flush();
// Some compression libraries fail when the raw size is bigger than int. If
// uncompressed size is bigger than kCompressionSizeLimit, don't compress it
const uint64_t kCompressionSizeLimit = std::numeric_limits<int>::max();
// Get blocks from mem-table walking thread, compress them and
// pass them to the write thread. Used in parallel compression mode only
void BGWorkCompression(const CompressionContext& compression_ctx,
UncompressionContext* verify_ctx);
// Given raw block content, try to compress it and return result and
// compression type
void CompressAndVerifyBlock(const Slice& raw_block_contents,
bool is_data_block,
const CompressionContext& compression_ctx,
UncompressionContext* verify_ctx,
std::string* compressed_output,
Slice* result_block_contents,
CompressionType* result_compression_type,
Status* out_status);
// Get compressed blocks from BGWorkCompression and write them into SST
void BGWorkWriteRawBlock();
// Initialize parallel compression context and
// start BGWorkCompression and BGWorkWriteRawBlock threads
void StartParallelCompression();
// Stop BGWorkCompression and BGWorkWriteRawBlock threads
void StopParallelCompression();
};
Slice CompressBlock(const Slice& raw, const CompressionInfo& info,
CompressionType* type, uint32_t format_version,
bool do_sample, std::string* compressed_output,
std::string* sampled_output_fast,
std::string* sampled_output_slow);
} // namespace ROCKSDB_NAMESPACE