2016-02-10 00:12:00 +01:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-16 01:03:42 +02:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
2013-10-16 23:59:46 +02:00
|
|
|
//
|
2011-03-18 23:37:00 +01:00
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
|
2019-05-30 23:47:29 +02:00
|
|
|
#include "table/block_based/block_based_table_builder.h"
|
2011-03-18 23:37:00 +01:00
|
|
|
|
|
|
|
#include <assert.h>
|
2013-11-20 01:29:42 +01:00
|
|
|
#include <stdio.h>
|
2013-10-10 20:43:24 +02:00
|
|
|
|
2017-02-07 01:29:29 +01:00
|
|
|
#include <list>
|
2014-03-01 03:19:07 +01:00
|
|
|
#include <map>
|
|
|
|
#include <memory>
|
2014-05-15 23:09:03 +02:00
|
|
|
#include <string>
|
|
|
|
#include <unordered_map>
|
2014-08-16 00:05:09 +02:00
|
|
|
#include <utility>
|
2014-03-01 03:19:07 +01:00
|
|
|
|
|
|
|
#include "db/dbformat.h"
|
2019-05-30 23:47:29 +02:00
|
|
|
#include "index_builder.h"
|
2014-03-01 03:19:07 +01:00
|
|
|
|
2013-09-02 08:23:40 +02:00
|
|
|
#include "rocksdb/cache.h"
|
2013-08-23 17:38:13 +02:00
|
|
|
#include "rocksdb/comparator.h"
|
|
|
|
#include "rocksdb/env.h"
|
2014-03-01 03:19:07 +01:00
|
|
|
#include "rocksdb/flush_block_policy.h"
|
2016-04-21 19:16:28 +02:00
|
|
|
#include "rocksdb/merge_operator.h"
|
2014-03-01 03:19:07 +01:00
|
|
|
#include "rocksdb/table.h"
|
|
|
|
|
2019-05-30 23:47:29 +02:00
|
|
|
#include "table/block_based/block.h"
|
2019-05-31 02:39:43 +02:00
|
|
|
#include "table/block_based/block_based_filter_block.h"
|
2019-05-30 23:47:29 +02:00
|
|
|
#include "table/block_based/block_based_table_factory.h"
|
|
|
|
#include "table/block_based/block_based_table_reader.h"
|
|
|
|
#include "table/block_based/block_builder.h"
|
|
|
|
#include "table/block_based/filter_block.h"
|
New Bloom filter implementation for full and partitioned filters (#6007)
Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.
Speed
The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.
Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):
$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
Single filter net ns/op: 26.2825
Random filter net ns/op: 150.459
Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
Single filter net ns/op: 63.2978
Random filter net ns/op: 188.038
Average FP rate %: 1.13823
Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.
The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.
Accuracy
The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.
Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)
Fixes https://github.com/facebook/rocksdb/issues/5857
Fixes https://github.com/facebook/rocksdb/issues/4120
Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.
Compatibility
Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007
Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).
Differential Revision: D18294749
Pulled By: pdillinger
fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
2019-11-14 01:31:26 +01:00
|
|
|
#include "table/block_based/filter_policy_internal.h"
|
2019-05-30 23:47:29 +02:00
|
|
|
#include "table/block_based/full_filter_block.h"
|
|
|
|
#include "table/block_based/partitioned_filter_block.h"
|
2011-03-18 23:37:00 +01:00
|
|
|
#include "table/format.h"
|
2014-03-01 03:19:07 +01:00
|
|
|
#include "table/table_builder.h"
|
|
|
|
|
2019-05-31 02:39:43 +02:00
|
|
|
#include "memory/memory_allocator.h"
|
2011-03-18 23:37:00 +01:00
|
|
|
#include "util/coding.h"
|
2015-01-09 21:57:11 +01:00
|
|
|
#include "util/compression.h"
|
2011-03-18 23:37:00 +01:00
|
|
|
#include "util/crc32c.h"
|
2013-06-17 19:11:10 +02:00
|
|
|
#include "util/stop_watch.h"
|
2018-07-20 23:34:07 +02:00
|
|
|
#include "util/string_util.h"
|
2014-05-01 20:09:32 +02:00
|
|
|
#include "util/xxhash.h"
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2013-10-04 06:49:15 +02:00
|
|
|
namespace rocksdb {
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2014-05-15 23:09:03 +02:00
|
|
|
extern const std::string kHashIndexPrefixesBlock;
|
|
|
|
extern const std::string kHashIndexPrefixesMetadataBlock;
|
2013-10-10 20:43:24 +02:00
|
|
|
|
2014-03-01 03:19:07 +01:00
|
|
|
typedef BlockBasedTableOptions::IndexType IndexType;
|
2014-04-10 23:19:43 +02:00
|
|
|
|
2014-11-13 20:39:30 +01:00
|
|
|
// Without anonymous namespace here, we fail the warning -Wmissing-prototypes
|
|
|
|
namespace {
|
|
|
|
|
2017-03-07 22:48:02 +01:00
|
|
|
// Create a filter block builder based on its type.
|
|
|
|
FilterBlockBuilder* CreateFilterBlockBuilder(
|
2018-05-21 23:33:55 +02:00
|
|
|
const ImmutableCFOptions& /*opt*/, const MutableCFOptions& mopt,
|
|
|
|
const BlockBasedTableOptions& table_opt,
|
2018-08-10 01:49:45 +02:00
|
|
|
const bool use_delta_encoding_for_index_values,
|
2017-03-07 22:48:02 +01:00
|
|
|
PartitionedIndexBuilder* const p_index_builder) {
|
2014-09-08 19:37:05 +02:00
|
|
|
if (table_opt.filter_policy == nullptr) return nullptr;
|
|
|
|
|
|
|
|
FilterBitsBuilder* filter_bits_builder =
|
New Bloom filter implementation for full and partitioned filters (#6007)
Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.
Speed
The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.
Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):
$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
Single filter net ns/op: 26.2825
Random filter net ns/op: 150.459
Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
Single filter net ns/op: 63.2978
Random filter net ns/op: 188.038
Average FP rate %: 1.13823
Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.
The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.
Accuracy
The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.
Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)
Fixes https://github.com/facebook/rocksdb/issues/5857
Fixes https://github.com/facebook/rocksdb/issues/4120
Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.
Compatibility
Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007
Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).
Differential Revision: D18294749
Pulled By: pdillinger
fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
2019-11-14 01:31:26 +01:00
|
|
|
FilterBuildingContext(table_opt).GetBuilder();
|
2014-09-08 19:37:05 +02:00
|
|
|
if (filter_bits_builder == nullptr) {
|
2018-05-21 23:33:55 +02:00
|
|
|
return new BlockBasedFilterBlockBuilder(mopt.prefix_extractor.get(),
|
|
|
|
table_opt);
|
2014-09-08 19:37:05 +02:00
|
|
|
} else {
|
2017-03-07 22:48:02 +01:00
|
|
|
if (table_opt.partition_filters) {
|
|
|
|
assert(p_index_builder != nullptr);
|
2017-07-02 19:36:10 +02:00
|
|
|
// Since after partition cut request from filter builder it takes time
|
|
|
|
// until index builder actully cuts the partition, we take the lower bound
|
|
|
|
// as partition size.
|
|
|
|
assert(table_opt.block_size_deviation <= 100);
|
2019-03-28 00:13:08 +01:00
|
|
|
auto partition_size =
|
|
|
|
static_cast<uint32_t>(((table_opt.metadata_block_size *
|
|
|
|
(100 - table_opt.block_size_deviation)) +
|
|
|
|
99) /
|
|
|
|
100);
|
2017-07-12 18:27:12 +02:00
|
|
|
partition_size = std::max(partition_size, static_cast<uint32_t>(1));
|
2017-03-07 22:48:02 +01:00
|
|
|
return new PartitionedFilterBlockBuilder(
|
2018-05-21 23:33:55 +02:00
|
|
|
mopt.prefix_extractor.get(), table_opt.whole_key_filtering,
|
2017-03-07 22:48:02 +01:00
|
|
|
filter_bits_builder, table_opt.index_block_restart_interval,
|
2018-08-10 01:49:45 +02:00
|
|
|
use_delta_encoding_for_index_values, p_index_builder, partition_size);
|
2017-03-07 22:48:02 +01:00
|
|
|
} else {
|
2018-05-21 23:33:55 +02:00
|
|
|
return new FullFilterBlockBuilder(mopt.prefix_extractor.get(),
|
2017-03-07 22:48:02 +01:00
|
|
|
table_opt.whole_key_filtering,
|
|
|
|
filter_bits_builder);
|
|
|
|
}
|
2014-09-08 19:37:05 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-03-01 03:19:07 +01:00
|
|
|
bool GoodCompressionRatio(size_t compressed_size, size_t raw_size) {
|
2013-10-10 20:43:24 +02:00
|
|
|
// Check to see if compressed less than 12.5%
|
|
|
|
return compressed_size < raw_size - (raw_size / 8u);
|
|
|
|
}
|
|
|
|
|
2019-03-18 20:07:35 +01:00
|
|
|
bool CompressBlockInternal(const Slice& raw,
|
|
|
|
const CompressionInfo& compression_info,
|
|
|
|
uint32_t format_version,
|
|
|
|
std::string* compressed_output) {
|
2014-03-01 03:19:07 +01:00
|
|
|
// Will return compressed block contents if (1) the compression method is
|
|
|
|
// supported in this platform and (2) the compression rate is "good enough".
|
2019-01-19 04:10:17 +01:00
|
|
|
switch (compression_info.type()) {
|
2014-03-01 03:19:07 +01:00
|
|
|
case kSnappyCompression:
|
2019-03-18 20:07:35 +01:00
|
|
|
return Snappy_Compress(compression_info, raw.data(), raw.size(),
|
|
|
|
compressed_output);
|
2014-03-01 03:19:07 +01:00
|
|
|
case kZlibCompression:
|
2019-03-18 20:07:35 +01:00
|
|
|
return Zlib_Compress(
|
|
|
|
compression_info,
|
|
|
|
GetCompressFormatForVersion(kZlibCompression, format_version),
|
|
|
|
raw.data(), raw.size(), compressed_output);
|
2014-03-01 03:19:07 +01:00
|
|
|
case kBZip2Compression:
|
2019-03-18 20:07:35 +01:00
|
|
|
return BZip2_Compress(
|
|
|
|
compression_info,
|
|
|
|
GetCompressFormatForVersion(kBZip2Compression, format_version),
|
|
|
|
raw.data(), raw.size(), compressed_output);
|
2014-03-01 03:19:07 +01:00
|
|
|
case kLZ4Compression:
|
2019-03-18 20:07:35 +01:00
|
|
|
return LZ4_Compress(
|
|
|
|
compression_info,
|
|
|
|
GetCompressFormatForVersion(kLZ4Compression, format_version),
|
|
|
|
raw.data(), raw.size(), compressed_output);
|
2014-03-01 03:19:07 +01:00
|
|
|
case kLZ4HCCompression:
|
2019-03-18 20:07:35 +01:00
|
|
|
return LZ4HC_Compress(
|
|
|
|
compression_info,
|
|
|
|
GetCompressFormatForVersion(kLZ4HCCompression, format_version),
|
|
|
|
raw.data(), raw.size(), compressed_output);
|
2016-04-20 07:54:24 +02:00
|
|
|
case kXpressCompression:
|
2019-03-18 20:07:35 +01:00
|
|
|
return XPRESS_Compress(raw.data(), raw.size(), compressed_output);
|
2016-09-02 00:28:40 +02:00
|
|
|
case kZSTD:
|
2015-08-28 00:40:42 +02:00
|
|
|
case kZSTDNotFinalCompression:
|
2019-03-18 20:07:35 +01:00
|
|
|
return ZSTD_Compress(compression_info, raw.data(), raw.size(),
|
|
|
|
compressed_output);
|
|
|
|
default:
|
|
|
|
// Do not recognize this compression type
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
} // namespace
|
|
|
|
|
|
|
|
// format_version is the block format as defined in include/rocksdb/table.h
|
|
|
|
Slice CompressBlock(const Slice& raw, const CompressionInfo& info,
|
|
|
|
CompressionType* type, uint32_t format_version,
|
|
|
|
bool do_sample, std::string* compressed_output,
|
|
|
|
std::string* sampled_output_fast,
|
|
|
|
std::string* sampled_output_slow) {
|
|
|
|
*type = info.type();
|
|
|
|
|
|
|
|
if (info.type() == kNoCompression && !info.SampleForCompression()) {
|
|
|
|
return raw;
|
2014-03-01 03:19:07 +01:00
|
|
|
}
|
|
|
|
|
2019-03-18 20:07:35 +01:00
|
|
|
// If requested, we sample one in every N block with a
|
|
|
|
// fast and slow compression algorithm and report the stats.
|
|
|
|
// The users can use these stats to decide if it is worthwhile
|
|
|
|
// enabling compression and they also get a hint about which
|
|
|
|
// compression algorithm wil be beneficial.
|
|
|
|
if (do_sample && info.SampleForCompression() &&
|
|
|
|
Random::GetTLSInstance()->OneIn((int)info.SampleForCompression()) &&
|
|
|
|
sampled_output_fast && sampled_output_slow) {
|
|
|
|
// Sampling with a fast compression algorithm
|
|
|
|
if (LZ4_Supported() || Snappy_Supported()) {
|
|
|
|
CompressionType c =
|
|
|
|
LZ4_Supported() ? kLZ4Compression : kSnappyCompression;
|
|
|
|
CompressionContext context(c);
|
|
|
|
CompressionOptions options;
|
|
|
|
CompressionInfo info_tmp(options, context,
|
|
|
|
CompressionDict::GetEmptyDict(), c,
|
|
|
|
info.SampleForCompression());
|
|
|
|
|
|
|
|
CompressBlockInternal(raw, info_tmp, format_version, sampled_output_fast);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Sampling with a slow but high-compression algorithm
|
|
|
|
if (ZSTD_Supported() || Zlib_Supported()) {
|
|
|
|
CompressionType c = ZSTD_Supported() ? kZSTD : kZlibCompression;
|
|
|
|
CompressionContext context(c);
|
|
|
|
CompressionOptions options;
|
|
|
|
CompressionInfo info_tmp(options, context,
|
|
|
|
CompressionDict::GetEmptyDict(), c,
|
|
|
|
info.SampleForCompression());
|
|
|
|
CompressBlockInternal(raw, info_tmp, format_version, sampled_output_slow);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Actually compress the data
|
|
|
|
if (*type != kNoCompression) {
|
|
|
|
if (CompressBlockInternal(raw, info, format_version, compressed_output) &&
|
|
|
|
GoodCompressionRatio(compressed_output->size(), raw.size())) {
|
|
|
|
return *compressed_output;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Compression method is not supported, or not good
|
|
|
|
// compression ratio, so just fall back to uncompressed form.
|
2014-03-01 03:19:07 +01:00
|
|
|
*type = kNoCompression;
|
|
|
|
return raw;
|
|
|
|
}
|
|
|
|
|
2013-12-05 01:35:48 +01:00
|
|
|
// kBlockBasedTableMagicNumber was picked by running
|
2014-05-01 20:09:32 +02:00
|
|
|
// echo rocksdb.table.block_based | sha1sum
|
2013-12-05 00:09:41 +01:00
|
|
|
// and taking the leading 64 bits.
|
2015-07-13 21:11:05 +02:00
|
|
|
// Please note that kBlockBasedTableMagicNumber may also be accessed by other
|
|
|
|
// .cc files
|
|
|
|
// for that reason we declare it extern in the header but to get the space
|
|
|
|
// allocated
|
2015-07-02 01:13:49 +02:00
|
|
|
// it must be not extern in one place.
|
|
|
|
const uint64_t kBlockBasedTableMagicNumber = 0x88e241b785f4cff7ull;
|
2014-05-01 20:09:32 +02:00
|
|
|
// We also support reading and writing legacy block based table format (for
|
|
|
|
// backwards compatibility)
|
2015-07-02 01:13:49 +02:00
|
|
|
const uint64_t kLegacyBlockBasedTableMagicNumber = 0xdb4775248b80fb57ull;
|
2013-12-05 00:09:41 +01:00
|
|
|
|
2014-03-01 03:19:07 +01:00
|
|
|
// A collector that collects properties of interest to block-based table.
|
|
|
|
// For now this class looks heavy-weight since we only write one additional
|
|
|
|
// property.
|
2015-04-25 11:14:27 +02:00
|
|
|
// But in the foreseeable future, we will add more and more properties that are
|
2014-03-01 03:19:07 +01:00
|
|
|
// specific to block-based table.
|
|
|
|
class BlockBasedTableBuilder::BlockBasedTablePropertiesCollector
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 19:04:30 +02:00
|
|
|
: public IntTblPropCollector {
|
2014-03-01 03:19:07 +01:00
|
|
|
public:
|
2014-05-15 23:09:03 +02:00
|
|
|
explicit BlockBasedTablePropertiesCollector(
|
2015-02-05 02:03:57 +01:00
|
|
|
BlockBasedTableOptions::IndexType index_type, bool whole_key_filtering,
|
|
|
|
bool prefix_filtering)
|
|
|
|
: index_type_(index_type),
|
|
|
|
whole_key_filtering_(whole_key_filtering),
|
|
|
|
prefix_filtering_(prefix_filtering) {}
|
2014-03-01 03:19:07 +01:00
|
|
|
|
2019-02-14 22:52:47 +01:00
|
|
|
Status InternalAdd(const Slice& /*key*/, const Slice& /*value*/,
|
|
|
|
uint64_t /*file_size*/) override {
|
2014-03-01 03:19:07 +01:00
|
|
|
// Intentionally left blank. Have no interest in collecting stats for
|
|
|
|
// individual key/value pairs.
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
2019-03-18 20:07:35 +01:00
|
|
|
virtual void BlockAdd(uint64_t /* blockRawBytes */,
|
|
|
|
uint64_t /* blockCompressedBytesFast */,
|
|
|
|
uint64_t /* blockCompressedBytesSlow */) override {
|
|
|
|
// Intentionally left blank. No interest in collecting stats for
|
|
|
|
// blocks.
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2019-02-14 22:52:47 +01:00
|
|
|
Status Finish(UserCollectedProperties* properties) override {
|
2014-03-01 03:19:07 +01:00
|
|
|
std::string val;
|
|
|
|
PutFixed32(&val, static_cast<uint32_t>(index_type_));
|
|
|
|
properties->insert({BlockBasedTablePropertyNames::kIndexType, val});
|
2015-02-05 02:03:57 +01:00
|
|
|
properties->insert({BlockBasedTablePropertyNames::kWholeKeyFiltering,
|
|
|
|
whole_key_filtering_ ? kPropTrue : kPropFalse});
|
|
|
|
properties->insert({BlockBasedTablePropertyNames::kPrefixFiltering,
|
|
|
|
prefix_filtering_ ? kPropTrue : kPropFalse});
|
2014-03-01 03:19:07 +01:00
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
// The name of the properties collector can be used for debugging purpose.
|
2019-02-14 22:52:47 +01:00
|
|
|
const char* Name() const override {
|
2014-03-01 03:19:07 +01:00
|
|
|
return "BlockBasedTablePropertiesCollector";
|
|
|
|
}
|
|
|
|
|
2019-02-14 22:52:47 +01:00
|
|
|
UserCollectedProperties GetReadableProperties() const override {
|
2014-03-01 03:19:07 +01:00
|
|
|
// Intentionally left blank.
|
|
|
|
return UserCollectedProperties();
|
|
|
|
}
|
|
|
|
|
|
|
|
private:
|
|
|
|
BlockBasedTableOptions::IndexType index_type_;
|
2015-02-05 02:03:57 +01:00
|
|
|
bool whole_key_filtering_;
|
|
|
|
bool prefix_filtering_;
|
2014-03-01 03:19:07 +01:00
|
|
|
};
|
|
|
|
|
2013-10-29 01:54:09 +01:00
|
|
|
struct BlockBasedTableBuilder::Rep {
|
2014-09-05 01:18:36 +02:00
|
|
|
const ImmutableCFOptions ioptions;
|
2018-05-21 23:33:55 +02:00
|
|
|
const MutableCFOptions moptions;
|
2014-08-25 23:22:05 +02:00
|
|
|
const BlockBasedTableOptions table_options;
|
2014-01-27 22:53:22 +01:00
|
|
|
const InternalKeyComparator& internal_comparator;
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
WritableFileWriter* file;
|
2013-10-10 20:43:24 +02:00
|
|
|
uint64_t offset = 0;
|
2011-03-18 23:37:00 +01:00
|
|
|
Status status;
|
2018-03-27 05:14:24 +02:00
|
|
|
size_t alignment;
|
2011-03-18 23:37:00 +01:00
|
|
|
BlockBuilder data_block;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
// Buffers uncompressed data blocks and keys to replay later. Needed when
|
|
|
|
// compression dictionary is enabled so we can finalize the dictionary before
|
|
|
|
// compressing any data blocks.
|
|
|
|
// TODO(ajkr): ideally we don't buffer all keys and all uncompressed data
|
|
|
|
// blocks as it's redundant, but it's easier to implement for now.
|
|
|
|
std::vector<std::pair<std::string, std::vector<std::string>>>
|
|
|
|
data_block_and_keys_buffers;
|
2016-08-20 00:10:31 +02:00
|
|
|
BlockBuilder range_del_block;
|
2014-05-15 23:09:03 +02:00
|
|
|
|
|
|
|
InternalKeySliceTransform internal_prefix_transform;
|
2014-03-01 03:19:07 +01:00
|
|
|
std::unique_ptr<IndexBuilder> index_builder;
|
2017-06-13 19:59:22 +02:00
|
|
|
PartitionedIndexBuilder* p_index_builder_ = nullptr;
|
2014-03-01 03:19:07 +01:00
|
|
|
|
2011-03-18 23:37:00 +01:00
|
|
|
std::string last_key;
|
2019-01-19 04:10:17 +01:00
|
|
|
CompressionType compression_type;
|
2019-03-18 20:07:35 +01:00
|
|
|
uint64_t sample_for_compression;
|
2019-01-19 04:10:17 +01:00
|
|
|
CompressionOptions compression_opts;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
std::unique_ptr<CompressionDict> compression_dict;
|
2018-06-05 21:51:05 +02:00
|
|
|
CompressionContext compression_ctx;
|
2018-06-04 21:04:52 +02:00
|
|
|
std::unique_ptr<UncompressionContext> verify_ctx;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
std::unique_ptr<UncompressionDict> verify_dict;
|
|
|
|
|
|
|
|
size_t data_begin_offset = 0;
|
|
|
|
|
2013-11-20 01:29:42 +01:00
|
|
|
TableProperties props;
|
2013-10-10 20:43:24 +02:00
|
|
|
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
// States of the builder.
|
|
|
|
//
|
|
|
|
// - `kBuffered`: This is the initial state where zero or more data blocks are
|
|
|
|
// accumulated uncompressed in-memory. From this state, call
|
|
|
|
// `EnterUnbuffered()` to finalize the compression dictionary if enabled,
|
|
|
|
// compress/write out any buffered blocks, and proceed to the `kUnbuffered`
|
|
|
|
// state.
|
|
|
|
//
|
|
|
|
// - `kUnbuffered`: This is the state when compression dictionary is finalized
|
|
|
|
// either because it wasn't enabled in the first place or it's been created
|
|
|
|
// from sampling previously buffered data. In this state, blocks are simply
|
|
|
|
// compressed/written out as they fill up. From this state, call `Finish()`
|
|
|
|
// to complete the file (write meta-blocks, etc.), or `Abandon()` to delete
|
|
|
|
// the partially created file.
|
|
|
|
//
|
|
|
|
// - `kClosed`: This indicates either `Finish()` or `Abandon()` has been
|
|
|
|
// called, so the table builder is no longer usable. We must be in this
|
|
|
|
// state by the time the destructor runs.
|
|
|
|
enum class State {
|
|
|
|
kBuffered,
|
|
|
|
kUnbuffered,
|
|
|
|
kClosed,
|
|
|
|
};
|
|
|
|
State state;
|
|
|
|
|
2018-08-10 01:49:45 +02:00
|
|
|
const bool use_delta_encoding_for_index_values;
|
2017-03-07 22:48:02 +01:00
|
|
|
std::unique_ptr<FilterBlockBuilder> filter_builder;
|
2013-09-02 08:23:40 +02:00
|
|
|
char compressed_cache_key_prefix[BlockBasedTable::kMaxCacheKeyPrefixSize];
|
|
|
|
size_t compressed_cache_key_prefix_size;
|
2011-03-18 23:37:00 +01:00
|
|
|
|
|
|
|
BlockHandle pending_handle; // Handle to add to index block
|
|
|
|
|
|
|
|
std::string compressed_output;
|
2013-11-08 06:27:21 +01:00
|
|
|
std::unique_ptr<FlushBlockPolicy> flush_block_policy;
|
2016-04-07 08:10:32 +02:00
|
|
|
uint32_t column_family_id;
|
|
|
|
const std::string& column_family_name;
|
2017-06-28 02:02:20 +02:00
|
|
|
uint64_t creation_time = 0;
|
2017-10-27 23:49:40 +02:00
|
|
|
uint64_t oldest_key_time = 0;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
const uint64_t target_file_size;
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
2019-04-11 04:24:25 +02:00
|
|
|
uint64_t file_creation_time = 0;
|
2011-03-18 23:37:00 +01:00
|
|
|
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 19:04:30 +02:00
|
|
|
std::vector<std::unique_ptr<IntTblPropCollector>> table_properties_collectors;
|
TablePropertiesCollectorFactory
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.
Here's description of task #4296714:
I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.
For example, this table has 3155 entries and 1391828 deleted keys.
The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.
Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.
In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.
Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests
Reviewers: sdong, dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18579
2014-05-13 21:30:55 +02:00
|
|
|
|
2018-05-21 23:33:55 +02:00
|
|
|
Rep(const ImmutableCFOptions& _ioptions, const MutableCFOptions& _moptions,
|
2014-09-05 01:18:36 +02:00
|
|
|
const BlockBasedTableOptions& table_opt,
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 19:04:30 +02:00
|
|
|
const InternalKeyComparator& icomparator,
|
|
|
|
const std::vector<std::unique_ptr<IntTblPropCollectorFactory>>*
|
|
|
|
int_tbl_prop_collector_factories,
|
2016-04-07 08:10:32 +02:00
|
|
|
uint32_t _column_family_id, WritableFileWriter* f,
|
2015-10-09 01:57:35 +02:00
|
|
|
const CompressionType _compression_type,
|
2019-03-18 20:07:35 +01:00
|
|
|
const uint64_t _sample_for_compression,
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
const CompressionOptions& _compression_opts, const bool skip_filters,
|
2017-10-24 00:22:05 +02:00
|
|
|
const std::string& _column_family_name, const uint64_t _creation_time,
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
2019-04-11 04:24:25 +02:00
|
|
|
const uint64_t _oldest_key_time, const uint64_t _target_file_size,
|
|
|
|
const uint64_t _file_creation_time)
|
2014-10-31 19:59:54 +01:00
|
|
|
: ioptions(_ioptions),
|
2018-05-21 23:33:55 +02:00
|
|
|
moptions(_moptions),
|
2014-08-25 23:22:05 +02:00
|
|
|
table_options(table_opt),
|
2014-01-27 22:53:22 +01:00
|
|
|
internal_comparator(icomparator),
|
2011-03-18 23:37:00 +01:00
|
|
|
file(f),
|
2018-03-27 05:14:24 +02:00
|
|
|
alignment(table_options.block_align
|
|
|
|
? std::min(table_options.block_size, kDefaultPageSize)
|
|
|
|
: 0),
|
2015-12-16 21:08:30 +01:00
|
|
|
data_block(table_options.block_restart_interval,
|
2018-08-15 23:27:47 +02:00
|
|
|
table_options.use_delta_encoding,
|
|
|
|
false /* use_value_delta_encoding */,
|
|
|
|
icomparator.user_comparator()
|
|
|
|
->CanKeysWithDifferentByteContentsBeEqual()
|
|
|
|
? BlockBasedTableOptions::kDataBlockBinarySearch
|
|
|
|
: table_options.data_block_index_type,
|
|
|
|
table_options.data_block_hash_table_util_ratio),
|
2018-05-21 18:42:49 +02:00
|
|
|
range_del_block(1 /* block_restart_interval */),
|
2018-05-21 23:33:55 +02:00
|
|
|
internal_prefix_transform(_moptions.prefix_extractor.get()),
|
2019-01-19 04:10:17 +01:00
|
|
|
compression_type(_compression_type),
|
2019-03-18 20:07:35 +01:00
|
|
|
sample_for_compression(_sample_for_compression),
|
2019-01-19 04:10:17 +01:00
|
|
|
compression_opts(_compression_opts),
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
compression_dict(),
|
2019-01-19 04:10:17 +01:00
|
|
|
compression_ctx(_compression_type),
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
verify_dict(),
|
2019-02-19 21:12:25 +01:00
|
|
|
state((_compression_opts.max_dict_bytes > 0) ? State::kBuffered
|
|
|
|
: State::kUnbuffered),
|
2018-08-10 01:49:45 +02:00
|
|
|
use_delta_encoding_for_index_values(table_opt.format_version >= 4 &&
|
|
|
|
!table_opt.block_align),
|
2017-12-07 20:50:49 +01:00
|
|
|
compressed_cache_key_prefix_size(0),
|
2014-08-25 23:22:05 +02:00
|
|
|
flush_block_policy(
|
|
|
|
table_options.flush_block_policy_factory->NewFlushBlockPolicy(
|
2016-04-07 08:10:32 +02:00
|
|
|
table_options, data_block)),
|
|
|
|
column_family_id(_column_family_id),
|
2017-06-28 02:02:20 +02:00
|
|
|
column_family_name(_column_family_name),
|
2017-10-24 00:22:05 +02:00
|
|
|
creation_time(_creation_time),
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
oldest_key_time(_oldest_key_time),
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
2019-04-11 04:24:25 +02:00
|
|
|
target_file_size(_target_file_size),
|
|
|
|
file_creation_time(_file_creation_time) {
|
2017-03-07 22:48:02 +01:00
|
|
|
if (table_options.index_type ==
|
|
|
|
BlockBasedTableOptions::kTwoLevelIndexSearch) {
|
2017-06-13 19:59:22 +02:00
|
|
|
p_index_builder_ = PartitionedIndexBuilder::CreateIndexBuilder(
|
2018-08-10 01:49:45 +02:00
|
|
|
&internal_comparator, use_delta_encoding_for_index_values,
|
|
|
|
table_options);
|
2017-06-13 19:59:22 +02:00
|
|
|
index_builder.reset(p_index_builder_);
|
2017-03-07 22:48:02 +01:00
|
|
|
} else {
|
|
|
|
index_builder.reset(IndexBuilder::CreateIndexBuilder(
|
|
|
|
table_options.index_type, &internal_comparator,
|
2018-08-10 01:49:45 +02:00
|
|
|
&this->internal_prefix_transform, use_delta_encoding_for_index_values,
|
|
|
|
table_options));
|
2017-03-07 22:48:02 +01:00
|
|
|
}
|
|
|
|
if (skip_filters) {
|
|
|
|
filter_builder = nullptr;
|
|
|
|
} else {
|
2018-05-21 23:33:55 +02:00
|
|
|
filter_builder.reset(CreateFilterBlockBuilder(
|
2018-08-10 01:49:45 +02:00
|
|
|
_ioptions, _moptions, table_options,
|
|
|
|
use_delta_encoding_for_index_values, p_index_builder_));
|
2017-03-07 22:48:02 +01:00
|
|
|
}
|
|
|
|
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 19:04:30 +02:00
|
|
|
for (auto& collector_factories : *int_tbl_prop_collector_factories) {
|
TablePropertiesCollectorFactory
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.
Here's description of task #4296714:
I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.
For example, this table has 3155 entries and 1391828 deleted keys.
The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.
Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.
In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.
Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests
Reviewers: sdong, dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18579
2014-05-13 21:30:55 +02:00
|
|
|
table_properties_collectors.emplace_back(
|
2015-10-09 01:57:35 +02:00
|
|
|
collector_factories->CreateIntTblPropCollector(column_family_id));
|
TablePropertiesCollectorFactory
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.
Here's description of task #4296714:
I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.
For example, this table has 3155 entries and 1391828 deleted keys.
The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.
Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.
In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.
Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests
Reviewers: sdong, dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18579
2014-05-13 21:30:55 +02:00
|
|
|
}
|
|
|
|
table_properties_collectors.emplace_back(
|
2015-02-05 02:03:57 +01:00
|
|
|
new BlockBasedTablePropertiesCollector(
|
|
|
|
table_options.index_type, table_options.whole_key_filtering,
|
2018-05-21 23:33:55 +02:00
|
|
|
_moptions.prefix_extractor != nullptr));
|
2018-06-04 21:04:52 +02:00
|
|
|
if (table_options.verify_compression) {
|
|
|
|
verify_ctx.reset(new UncompressionContext(UncompressionContext::NoCache(),
|
2019-01-19 04:10:17 +01:00
|
|
|
compression_type));
|
2018-06-04 21:04:52 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
Rep(const Rep&) = delete;
|
|
|
|
Rep& operator=(const Rep&) = delete;
|
|
|
|
|
2018-06-05 21:51:05 +02:00
|
|
|
~Rep() {}
|
2011-03-18 23:37:00 +01:00
|
|
|
};
|
|
|
|
|
2013-11-20 07:00:48 +01:00
|
|
|
BlockBasedTableBuilder::BlockBasedTableBuilder(
|
2018-05-21 23:33:55 +02:00
|
|
|
const ImmutableCFOptions& ioptions, const MutableCFOptions& moptions,
|
2014-09-05 01:18:36 +02:00
|
|
|
const BlockBasedTableOptions& table_options,
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 19:04:30 +02:00
|
|
|
const InternalKeyComparator& internal_comparator,
|
|
|
|
const std::vector<std::unique_ptr<IntTblPropCollectorFactory>>*
|
|
|
|
int_tbl_prop_collector_factories,
|
2015-10-09 01:57:35 +02:00
|
|
|
uint32_t column_family_id, WritableFileWriter* file,
|
|
|
|
const CompressionType compression_type,
|
2019-03-18 20:07:35 +01:00
|
|
|
const uint64_t sample_for_compression,
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
const CompressionOptions& compression_opts, const bool skip_filters,
|
2017-10-24 00:22:05 +02:00
|
|
|
const std::string& column_family_name, const uint64_t creation_time,
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
2019-04-11 04:24:25 +02:00
|
|
|
const uint64_t oldest_key_time, const uint64_t target_file_size,
|
|
|
|
const uint64_t file_creation_time) {
|
2015-01-13 23:33:04 +01:00
|
|
|
BlockBasedTableOptions sanitized_table_options(table_options);
|
|
|
|
if (sanitized_table_options.format_version == 0 &&
|
|
|
|
sanitized_table_options.checksum != kCRC32c) {
|
2017-03-16 03:22:52 +01:00
|
|
|
ROCKS_LOG_WARN(
|
|
|
|
ioptions.info_log,
|
2015-01-13 23:33:04 +01:00
|
|
|
"Silently converting format_version to 1 because checksum is "
|
|
|
|
"non-default");
|
|
|
|
// silently convert format_version to 1 to keep consistent with current
|
|
|
|
// behavior
|
|
|
|
sanitized_table_options.format_version = 1;
|
|
|
|
}
|
|
|
|
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
2019-04-11 04:24:25 +02:00
|
|
|
rep_ =
|
|
|
|
new Rep(ioptions, moptions, sanitized_table_options, internal_comparator,
|
|
|
|
int_tbl_prop_collector_factories, column_family_id, file,
|
|
|
|
compression_type, sample_for_compression, compression_opts,
|
|
|
|
skip_filters, column_family_name, creation_time, oldest_key_time,
|
|
|
|
target_file_size, file_creation_time);
|
2015-02-17 17:03:45 +01:00
|
|
|
|
2017-03-07 22:48:02 +01:00
|
|
|
if (rep_->filter_builder != nullptr) {
|
|
|
|
rep_->filter_builder->StartBlock(0);
|
2012-04-17 17:36:46 +02:00
|
|
|
}
|
2014-08-25 23:22:05 +02:00
|
|
|
if (table_options.block_cache_compressed.get() != nullptr) {
|
2013-12-03 20:17:58 +01:00
|
|
|
BlockBasedTable::GenerateCachePrefix(
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
table_options.block_cache_compressed.get(), file->writable_file(),
|
2013-12-03 20:17:58 +01:00
|
|
|
&rep_->compressed_cache_key_prefix[0],
|
|
|
|
&rep_->compressed_cache_key_prefix_size);
|
2013-09-02 08:23:40 +02:00
|
|
|
}
|
2011-03-18 23:37:00 +01:00
|
|
|
}
|
|
|
|
|
2013-10-29 01:54:09 +01:00
|
|
|
BlockBasedTableBuilder::~BlockBasedTableBuilder() {
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
// Catch errors where caller forgot to call Finish()
|
|
|
|
assert(rep_->state == Rep::State::kClosed);
|
2011-03-18 23:37:00 +01:00
|
|
|
delete rep_;
|
|
|
|
}
|
|
|
|
|
2013-10-29 01:54:09 +01:00
|
|
|
void BlockBasedTableBuilder::Add(const Slice& key, const Slice& value) {
|
2011-03-18 23:37:00 +01:00
|
|
|
Rep* r = rep_;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
assert(rep_->state != Rep::State::kClosed);
|
2011-03-18 23:37:00 +01:00
|
|
|
if (!ok()) return;
|
2016-08-20 00:10:31 +02:00
|
|
|
ValueType value_type = ExtractValueType(key);
|
|
|
|
if (IsValueType(value_type)) {
|
2019-01-03 00:05:41 +01:00
|
|
|
#ifndef NDEBUG
|
|
|
|
if (r->props.num_entries > r->props.num_range_deletions) {
|
2016-08-20 00:10:31 +02:00
|
|
|
assert(r->internal_comparator.Compare(key, Slice(r->last_key)) > 0);
|
2013-11-08 06:27:21 +01:00
|
|
|
}
|
2019-01-03 00:05:41 +01:00
|
|
|
#endif // NDEBUG
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2016-08-20 00:10:31 +02:00
|
|
|
auto should_flush = r->flush_block_policy->Update(key, value);
|
|
|
|
if (should_flush) {
|
|
|
|
assert(!r->data_block.empty());
|
|
|
|
Flush();
|
|
|
|
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
if (r->state == Rep::State::kBuffered &&
|
|
|
|
r->data_begin_offset > r->target_file_size) {
|
|
|
|
EnterUnbuffered();
|
|
|
|
}
|
|
|
|
|
2016-08-20 00:10:31 +02:00
|
|
|
// Add item to index block.
|
|
|
|
// We do not emit the index entry for a block until we have seen the
|
|
|
|
// first key for the next data block. This allows us to use shorter
|
|
|
|
// keys in the index block. For example, consider a block boundary
|
|
|
|
// between the keys "the quick brown fox" and "the who". We can use
|
|
|
|
// "the r" as the key for the index block entry since it is >= all
|
|
|
|
// entries in the first block and < all entries in subsequent
|
|
|
|
// blocks.
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
if (ok() && r->state == Rep::State::kUnbuffered) {
|
2016-08-20 00:10:31 +02:00
|
|
|
r->index_builder->AddIndexEntry(&r->last_key, &key, r->pending_handle);
|
|
|
|
}
|
|
|
|
}
|
2012-04-17 17:36:46 +02:00
|
|
|
|
2017-03-07 22:48:02 +01:00
|
|
|
// Note: PartitionedFilterBlockBuilder requires key being added to filter
|
|
|
|
// builder after being added to index builder.
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
if (r->state == Rep::State::kUnbuffered && r->filter_builder != nullptr) {
|
2019-06-06 08:07:28 +02:00
|
|
|
size_t ts_sz = r->internal_comparator.user_comparator()->timestamp_size();
|
|
|
|
r->filter_builder->Add(ExtractUserKeyAndStripTimestamp(key, ts_sz));
|
2016-08-20 00:10:31 +02:00
|
|
|
}
|
2013-10-16 20:50:50 +02:00
|
|
|
|
2016-08-20 00:10:31 +02:00
|
|
|
r->last_key.assign(key.data(), key.size());
|
|
|
|
r->data_block.Add(key, value);
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
if (r->state == Rep::State::kBuffered) {
|
|
|
|
// Buffer keys to be replayed during `Finish()` once compression
|
|
|
|
// dictionary has been finalized.
|
|
|
|
if (r->data_block_and_keys_buffers.empty() || should_flush) {
|
|
|
|
r->data_block_and_keys_buffers.emplace_back();
|
|
|
|
}
|
|
|
|
r->data_block_and_keys_buffers.back().second.emplace_back(key.ToString());
|
|
|
|
} else {
|
|
|
|
r->index_builder->OnKeyAdded(key);
|
|
|
|
}
|
2016-08-20 00:10:31 +02:00
|
|
|
NotifyCollectTableCollectorsOnAdd(key, value, r->offset,
|
|
|
|
r->table_properties_collectors,
|
|
|
|
r->ioptions.info_log);
|
|
|
|
|
|
|
|
} else if (value_type == kTypeRangeDeletion) {
|
|
|
|
r->range_del_block.Add(key, value);
|
2016-09-12 23:14:40 +02:00
|
|
|
NotifyCollectTableCollectorsOnAdd(key, value, r->offset,
|
|
|
|
r->table_properties_collectors,
|
|
|
|
r->ioptions.info_log);
|
2016-08-20 00:10:31 +02:00
|
|
|
} else {
|
|
|
|
assert(false);
|
|
|
|
}
|
2019-01-03 00:05:41 +01:00
|
|
|
|
|
|
|
r->props.num_entries++;
|
|
|
|
r->props.raw_key_size += key.size();
|
|
|
|
r->props.raw_value_size += value.size();
|
|
|
|
if (value_type == kTypeDeletion || value_type == kTypeSingleDeletion) {
|
|
|
|
r->props.num_deletions++;
|
|
|
|
} else if (value_type == kTypeRangeDeletion) {
|
|
|
|
r->props.num_deletions++;
|
|
|
|
r->props.num_range_deletions++;
|
|
|
|
} else if (value_type == kTypeMerge) {
|
|
|
|
r->props.num_merge_operands++;
|
|
|
|
}
|
2011-03-18 23:37:00 +01:00
|
|
|
}
|
|
|
|
|
2013-10-29 01:54:09 +01:00
|
|
|
void BlockBasedTableBuilder::Flush() {
|
2011-03-18 23:37:00 +01:00
|
|
|
Rep* r = rep_;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
assert(rep_->state != Rep::State::kClosed);
|
2011-03-18 23:37:00 +01:00
|
|
|
if (!ok()) return;
|
|
|
|
if (r->data_block.empty()) return;
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
2016-04-28 02:36:03 +02:00
|
|
|
WriteBlock(&r->data_block, &r->pending_handle, true /* is_data_block */);
|
2012-06-28 08:41:33 +02:00
|
|
|
}
|
|
|
|
|
2013-10-29 01:54:09 +01:00
|
|
|
void BlockBasedTableBuilder::WriteBlock(BlockBuilder* block,
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
2016-04-28 02:36:03 +02:00
|
|
|
BlockHandle* handle,
|
|
|
|
bool is_data_block) {
|
|
|
|
WriteBlock(block->Finish(), handle, is_data_block);
|
2014-03-01 03:19:07 +01:00
|
|
|
block->Reset();
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::WriteBlock(const Slice& raw_block_contents,
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
2016-04-28 02:36:03 +02:00
|
|
|
BlockHandle* handle,
|
|
|
|
bool is_data_block) {
|
2011-03-18 23:37:00 +01:00
|
|
|
// File format contains a sequence of blocks where each block has:
|
|
|
|
// block_data: uint8[n]
|
|
|
|
// type: uint8
|
|
|
|
// crc: uint32
|
|
|
|
assert(ok());
|
|
|
|
Rep* r = rep_;
|
|
|
|
|
2019-01-19 04:10:17 +01:00
|
|
|
auto type = r->compression_type;
|
2019-03-18 20:07:35 +01:00
|
|
|
uint64_t sample_for_compression = r->sample_for_compression;
|
2014-06-09 21:26:09 +02:00
|
|
|
Slice block_contents;
|
2016-06-11 03:20:54 +02:00
|
|
|
bool abort_compression = false;
|
2016-08-20 00:10:31 +02:00
|
|
|
|
2019-03-28 00:13:08 +01:00
|
|
|
StopWatchNano timer(
|
|
|
|
r->ioptions.env,
|
|
|
|
ShouldReportDetailedTime(r->ioptions.env, r->ioptions.statistics));
|
2016-07-19 18:44:03 +02:00
|
|
|
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
if (r->state == Rep::State::kBuffered) {
|
|
|
|
assert(is_data_block);
|
|
|
|
assert(!r->data_block_and_keys_buffers.empty());
|
|
|
|
r->data_block_and_keys_buffers.back().first = raw_block_contents.ToString();
|
|
|
|
r->data_begin_offset += r->data_block_and_keys_buffers.back().first.size();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2014-06-09 21:26:09 +02:00
|
|
|
if (raw_block_contents.size() < kCompressionSizeLimit) {
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
const CompressionDict* compression_dict;
|
|
|
|
if (!is_data_block || r->compression_dict == nullptr) {
|
|
|
|
compression_dict = &CompressionDict::GetEmptyDict();
|
|
|
|
} else {
|
|
|
|
compression_dict = r->compression_dict.get();
|
|
|
|
}
|
|
|
|
assert(compression_dict != nullptr);
|
|
|
|
CompressionInfo compression_info(r->compression_opts, r->compression_ctx,
|
2019-03-18 20:07:35 +01:00
|
|
|
*compression_dict, type,
|
|
|
|
sample_for_compression);
|
|
|
|
|
|
|
|
std::string sampled_output_fast;
|
|
|
|
std::string sampled_output_slow;
|
|
|
|
block_contents = CompressBlock(
|
|
|
|
raw_block_contents, compression_info, &type,
|
|
|
|
r->table_options.format_version, is_data_block /* do_sample */,
|
|
|
|
&r->compressed_output, &sampled_output_fast, &sampled_output_slow);
|
|
|
|
|
|
|
|
// notify collectors on block add
|
|
|
|
NotifyCollectTableCollectorsOnBlockAdd(
|
|
|
|
r->table_properties_collectors, raw_block_contents.size(),
|
|
|
|
sampled_output_fast.size(), sampled_output_slow.size());
|
2016-06-11 03:20:54 +02:00
|
|
|
|
|
|
|
// Some of the compression algorithms are known to be unreliable. If
|
|
|
|
// the verify_compression flag is set then try to de-compress the
|
|
|
|
// compressed data and compare to the input.
|
|
|
|
if (type != kNoCompression && r->table_options.verify_compression) {
|
|
|
|
// Retrieve the uncompressed contents into a new buffer
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
const UncompressionDict* verify_dict;
|
|
|
|
if (!is_data_block || r->verify_dict == nullptr) {
|
|
|
|
verify_dict = &UncompressionDict::GetEmptyDict();
|
|
|
|
} else {
|
|
|
|
verify_dict = r->verify_dict.get();
|
|
|
|
}
|
|
|
|
assert(verify_dict != nullptr);
|
2016-06-11 03:20:54 +02:00
|
|
|
BlockContents contents;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
UncompressionInfo uncompression_info(*r->verify_ctx, *verify_dict,
|
|
|
|
r->compression_type);
|
2018-06-05 21:51:05 +02:00
|
|
|
Status stat = UncompressBlockContentsForCompressionType(
|
2019-01-19 04:10:17 +01:00
|
|
|
uncompression_info, block_contents.data(), block_contents.size(),
|
2018-06-05 21:51:05 +02:00
|
|
|
&contents, r->table_options.format_version, r->ioptions);
|
2016-06-11 03:20:54 +02:00
|
|
|
|
|
|
|
if (stat.ok()) {
|
|
|
|
bool compressed_ok = contents.data.compare(raw_block_contents) == 0;
|
|
|
|
if (!compressed_ok) {
|
|
|
|
// The result of the compression was invalid. abort.
|
|
|
|
abort_compression = true;
|
2017-03-16 03:22:52 +01:00
|
|
|
ROCKS_LOG_ERROR(r->ioptions.info_log,
|
|
|
|
"Decompressed block did not match raw block");
|
2016-06-11 03:20:54 +02:00
|
|
|
r->status =
|
|
|
|
Status::Corruption("Decompressed block did not match raw block");
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// Decompression reported an error. abort.
|
|
|
|
r->status = Status::Corruption("Could not decompress");
|
|
|
|
abort_compression = true;
|
|
|
|
}
|
|
|
|
}
|
2014-06-09 21:26:09 +02:00
|
|
|
} else {
|
2016-06-11 03:20:54 +02:00
|
|
|
// Block is too big to be compressed.
|
|
|
|
abort_compression = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Abort compression if the block is too big, or did not pass
|
|
|
|
// verification.
|
|
|
|
if (abort_compression) {
|
2014-09-05 01:18:36 +02:00
|
|
|
RecordTick(r->ioptions.statistics, NUMBER_BLOCK_NOT_COMPRESSED);
|
2014-06-09 21:26:09 +02:00
|
|
|
type = kNoCompression;
|
|
|
|
block_contents = raw_block_contents;
|
2017-12-14 19:17:22 +01:00
|
|
|
} else if (type != kNoCompression) {
|
|
|
|
if (ShouldReportDetailedTime(r->ioptions.env, r->ioptions.statistics)) {
|
2019-02-28 19:14:19 +01:00
|
|
|
RecordTimeToHistogram(r->ioptions.statistics, COMPRESSION_TIMES_NANOS,
|
|
|
|
timer.ElapsedNanos());
|
2017-12-14 19:17:22 +01:00
|
|
|
}
|
2019-02-28 19:14:19 +01:00
|
|
|
RecordInHistogram(r->ioptions.statistics, BYTES_COMPRESSED,
|
|
|
|
raw_block_contents.size());
|
2016-07-19 18:44:03 +02:00
|
|
|
RecordTick(r->ioptions.statistics, NUMBER_BLOCK_COMPRESSED);
|
2019-02-12 02:46:49 +01:00
|
|
|
} else if (type != r->compression_type) {
|
|
|
|
RecordTick(r->ioptions.statistics, NUMBER_BLOCK_NOT_COMPRESSED);
|
2016-07-19 18:44:03 +02:00
|
|
|
}
|
2016-06-11 03:20:54 +02:00
|
|
|
|
2018-03-27 05:14:24 +02:00
|
|
|
WriteRawBlock(block_contents, type, handle, is_data_block);
|
2012-04-17 17:36:46 +02:00
|
|
|
r->compressed_output.clear();
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
if (is_data_block) {
|
|
|
|
if (r->filter_builder != nullptr) {
|
|
|
|
r->filter_builder->StartBlock(r->offset);
|
|
|
|
}
|
|
|
|
r->props.data_size = r->offset;
|
|
|
|
++r->props.num_data_blocks;
|
|
|
|
}
|
2012-04-17 17:36:46 +02:00
|
|
|
}
|
|
|
|
|
2013-10-29 01:54:09 +01:00
|
|
|
void BlockBasedTableBuilder::WriteRawBlock(const Slice& block_contents,
|
|
|
|
CompressionType type,
|
2018-03-27 05:14:24 +02:00
|
|
|
BlockHandle* handle,
|
|
|
|
bool is_data_block) {
|
2012-04-17 17:36:46 +02:00
|
|
|
Rep* r = rep_;
|
2014-09-05 01:18:36 +02:00
|
|
|
StopWatch sw(r->ioptions.env, r->ioptions.statistics, WRITE_RAW_BLOCK_MICROS);
|
2011-03-18 23:37:00 +01:00
|
|
|
handle->set_offset(r->offset);
|
|
|
|
handle->set_size(block_contents.size());
|
2017-06-26 22:15:55 +02:00
|
|
|
assert(r->status.ok());
|
2011-03-18 23:37:00 +01:00
|
|
|
r->status = r->file->Append(block_contents);
|
|
|
|
if (r->status.ok()) {
|
|
|
|
char trailer[kBlockTrailerSize];
|
|
|
|
trailer[0] = type;
|
2014-05-01 20:09:32 +02:00
|
|
|
char* trailer_without_type = trailer + 1;
|
2014-08-25 23:22:05 +02:00
|
|
|
switch (r->table_options.checksum) {
|
2014-05-01 20:09:32 +02:00
|
|
|
case kNoChecksum:
|
2017-08-24 04:31:40 +02:00
|
|
|
EncodeFixed32(trailer_without_type, 0);
|
|
|
|
break;
|
2014-05-01 20:09:32 +02:00
|
|
|
case kCRC32c: {
|
|
|
|
auto crc = crc32c::Value(block_contents.data(), block_contents.size());
|
|
|
|
crc = crc32c::Extend(crc, trailer, 1); // Extend to cover block type
|
|
|
|
EncodeFixed32(trailer_without_type, crc32c::Mask(crc));
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case kxxHash: {
|
2019-10-25 02:14:27 +02:00
|
|
|
XXH32_state_t* const state = XXH32_createState();
|
|
|
|
XXH32_reset(state, 0);
|
|
|
|
XXH32_update(state, block_contents.data(),
|
2014-11-11 22:47:22 +01:00
|
|
|
static_cast<uint32_t>(block_contents.size()));
|
2019-10-25 02:14:27 +02:00
|
|
|
XXH32_update(state, trailer, 1); // Extend to cover block type
|
|
|
|
EncodeFixed32(trailer_without_type, XXH32_digest(state));
|
|
|
|
XXH32_freeState(state);
|
2014-05-01 20:09:32 +02:00
|
|
|
break;
|
|
|
|
}
|
2018-11-01 23:39:40 +01:00
|
|
|
case kxxHash64: {
|
|
|
|
XXH64_state_t* const state = XXH64_createState();
|
|
|
|
XXH64_reset(state, 0);
|
|
|
|
XXH64_update(state, block_contents.data(),
|
2019-03-28 00:13:08 +01:00
|
|
|
static_cast<uint32_t>(block_contents.size()));
|
2018-11-01 23:39:40 +01:00
|
|
|
XXH64_update(state, trailer, 1); // Extend to cover block type
|
2019-03-28 00:13:08 +01:00
|
|
|
EncodeFixed32(
|
|
|
|
trailer_without_type,
|
|
|
|
static_cast<uint32_t>(XXH64_digest(state) & // lower 32 bits
|
|
|
|
uint64_t{0xffffffff}));
|
2018-11-01 23:39:40 +01:00
|
|
|
XXH64_freeState(state);
|
|
|
|
break;
|
|
|
|
}
|
2014-05-01 20:09:32 +02:00
|
|
|
}
|
|
|
|
|
2017-06-26 22:15:55 +02:00
|
|
|
assert(r->status.ok());
|
2019-01-30 01:16:53 +01:00
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"BlockBasedTableBuilder::WriteRawBlock:TamperWithChecksum",
|
|
|
|
static_cast<char*>(trailer));
|
2011-03-18 23:37:00 +01:00
|
|
|
r->status = r->file->Append(Slice(trailer, kBlockTrailerSize));
|
2013-09-02 08:23:40 +02:00
|
|
|
if (r->status.ok()) {
|
|
|
|
r->status = InsertBlockInCache(block_contents, type, handle);
|
|
|
|
}
|
2011-03-18 23:37:00 +01:00
|
|
|
if (r->status.ok()) {
|
|
|
|
r->offset += block_contents.size() + kBlockTrailerSize;
|
2018-03-27 05:14:24 +02:00
|
|
|
if (r->table_options.block_align && is_data_block) {
|
|
|
|
size_t pad_bytes =
|
|
|
|
(r->alignment - ((block_contents.size() + kBlockTrailerSize) &
|
|
|
|
(r->alignment - 1))) &
|
|
|
|
(r->alignment - 1);
|
|
|
|
r->status = r->file->Pad(pad_bytes);
|
|
|
|
if (r->status.ok()) {
|
|
|
|
r->offset += pad_bytes;
|
|
|
|
}
|
|
|
|
}
|
2011-03-18 23:37:00 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-03-28 00:13:08 +01:00
|
|
|
Status BlockBasedTableBuilder::status() const { return rep_->status; }
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2018-11-14 02:00:49 +01:00
|
|
|
static void DeleteCachedBlockContents(const Slice& /*key*/, void* value) {
|
|
|
|
BlockContents* bc = reinterpret_cast<BlockContents*>(value);
|
|
|
|
delete bc;
|
2013-09-02 08:23:40 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
//
|
|
|
|
// Make a copy of the block contents and insert into compressed block cache
|
|
|
|
//
|
|
|
|
Status BlockBasedTableBuilder::InsertBlockInCache(const Slice& block_contents,
|
2014-07-16 15:45:49 +02:00
|
|
|
const CompressionType type,
|
|
|
|
const BlockHandle* handle) {
|
2013-09-02 08:23:40 +02:00
|
|
|
Rep* r = rep_;
|
2014-08-25 23:22:05 +02:00
|
|
|
Cache* block_cache_compressed = r->table_options.block_cache_compressed.get();
|
2013-09-02 08:23:40 +02:00
|
|
|
|
|
|
|
if (type != kNoCompression && block_cache_compressed != nullptr) {
|
|
|
|
size_t size = block_contents.size();
|
|
|
|
|
2018-10-03 02:21:54 +02:00
|
|
|
auto ubuf =
|
2018-11-21 20:28:02 +01:00
|
|
|
AllocateBlock(size + 1, block_cache_compressed->memory_allocator());
|
2014-08-16 00:05:09 +02:00
|
|
|
memcpy(ubuf.get(), block_contents.data(), size);
|
2014-07-16 15:45:49 +02:00
|
|
|
ubuf[size] = type;
|
2013-09-02 08:23:40 +02:00
|
|
|
|
2018-11-14 02:00:49 +01:00
|
|
|
BlockContents* block_contents_to_cache =
|
|
|
|
new BlockContents(std::move(ubuf), size);
|
|
|
|
#ifndef NDEBUG
|
|
|
|
block_contents_to_cache->is_raw_block = true;
|
|
|
|
#endif // NDEBUG
|
2013-09-02 08:23:40 +02:00
|
|
|
|
|
|
|
// make cache key by appending the file offset to the cache prefix id
|
|
|
|
char* end = EncodeVarint64(
|
2019-03-28 00:13:08 +01:00
|
|
|
r->compressed_cache_key_prefix + r->compressed_cache_key_prefix_size,
|
|
|
|
handle->offset());
|
|
|
|
Slice key(r->compressed_cache_key_prefix,
|
|
|
|
static_cast<size_t>(end - r->compressed_cache_key_prefix));
|
2013-09-02 08:23:40 +02:00
|
|
|
|
|
|
|
// Insert into compressed block cache.
|
2018-11-14 02:00:49 +01:00
|
|
|
block_cache_compressed->Insert(
|
|
|
|
key, block_contents_to_cache,
|
|
|
|
block_contents_to_cache->ApproximateMemoryUsage(),
|
|
|
|
&DeleteCachedBlockContents);
|
2013-09-02 08:23:40 +02:00
|
|
|
|
|
|
|
// Invalidate OS cache.
|
2014-11-13 20:39:30 +01:00
|
|
|
r->file->InvalidateCache(static_cast<size_t>(r->offset), size);
|
2013-09-02 08:23:40 +02:00
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
2018-07-20 18:00:33 +02:00
|
|
|
void BlockBasedTableBuilder::WriteFilterBlock(
|
|
|
|
MetaIndexBuilder* meta_index_builder) {
|
|
|
|
BlockHandle filter_block_handle;
|
|
|
|
bool empty_filter_block = (rep_->filter_builder == nullptr ||
|
|
|
|
rep_->filter_builder->NumAdded() == 0);
|
2018-03-22 06:56:48 +01:00
|
|
|
if (ok() && !empty_filter_block) {
|
2017-03-07 22:48:02 +01:00
|
|
|
Status s = Status::Incomplete();
|
2018-07-20 18:00:33 +02:00
|
|
|
while (ok() && s.IsIncomplete()) {
|
2018-07-20 23:34:07 +02:00
|
|
|
Slice filter_content =
|
|
|
|
rep_->filter_builder->Finish(filter_block_handle, &s);
|
2017-03-07 22:48:02 +01:00
|
|
|
assert(s.ok() || s.IsIncomplete());
|
2018-07-20 18:00:33 +02:00
|
|
|
rep_->props.filter_size += filter_content.size();
|
2017-03-07 22:48:02 +01:00
|
|
|
WriteRawBlock(filter_content, kNoCompression, &filter_block_handle);
|
|
|
|
}
|
|
|
|
}
|
2018-07-20 18:00:33 +02:00
|
|
|
if (ok() && !empty_filter_block) {
|
|
|
|
// Add mapping from "<filter_block_prefix>.Name" to location
|
|
|
|
// of filter data.
|
|
|
|
std::string key;
|
|
|
|
if (rep_->filter_builder->IsBlockBased()) {
|
|
|
|
key = BlockBasedTable::kFilterBlockPrefix;
|
|
|
|
} else {
|
|
|
|
key = rep_->table_options.partition_filters
|
|
|
|
? BlockBasedTable::kPartitionedFilterBlockPrefix
|
|
|
|
: BlockBasedTable::kFullFilterBlockPrefix;
|
|
|
|
}
|
|
|
|
key.append(rep_->table_options.filter_policy->Name());
|
|
|
|
meta_index_builder->Add(key, filter_block_handle);
|
|
|
|
}
|
|
|
|
}
|
2017-03-07 22:48:02 +01:00
|
|
|
|
2018-07-20 18:00:33 +02:00
|
|
|
void BlockBasedTableBuilder::WriteIndexBlock(
|
|
|
|
MetaIndexBuilder* meta_index_builder, BlockHandle* index_block_handle) {
|
2014-05-15 23:09:03 +02:00
|
|
|
IndexBuilder::IndexBlocks index_blocks;
|
2018-07-20 18:00:33 +02:00
|
|
|
auto index_builder_status = rep_->index_builder->Finish(&index_blocks);
|
2017-02-07 01:29:29 +01:00
|
|
|
if (index_builder_status.IsIncomplete()) {
|
|
|
|
// We we have more than one index partition then meta_blocks are not
|
|
|
|
// supported for the index. Currently meta_blocks are used only by
|
|
|
|
// HashIndexBuilder which is not multi-partition.
|
|
|
|
assert(index_blocks.meta_blocks.empty());
|
2018-07-20 18:00:33 +02:00
|
|
|
} else if (ok() && !index_builder_status.ok()) {
|
|
|
|
rep_->status = index_builder_status;
|
2013-10-10 20:43:24 +02:00
|
|
|
}
|
2014-05-15 23:09:03 +02:00
|
|
|
if (ok()) {
|
2018-07-20 18:00:33 +02:00
|
|
|
for (const auto& item : index_blocks.meta_blocks) {
|
|
|
|
BlockHandle block_handle;
|
|
|
|
WriteBlock(item.second, &block_handle, false /* is_data_block */);
|
|
|
|
if (!ok()) {
|
|
|
|
break;
|
2014-09-08 19:37:05 +02:00
|
|
|
}
|
2018-07-20 18:00:33 +02:00
|
|
|
meta_index_builder->Add(item.first, block_handle);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (ok()) {
|
|
|
|
if (rep_->table_options.enable_index_compression) {
|
|
|
|
WriteBlock(index_blocks.index_block_contents, index_block_handle, false);
|
|
|
|
} else {
|
|
|
|
WriteRawBlock(index_blocks.index_block_contents, kNoCompression,
|
|
|
|
index_block_handle);
|
2013-10-10 20:43:24 +02:00
|
|
|
}
|
2018-07-20 18:00:33 +02:00
|
|
|
}
|
|
|
|
// If there are more index partitions, finish them and write them out
|
|
|
|
Status s = index_builder_status;
|
|
|
|
while (ok() && s.IsIncomplete()) {
|
|
|
|
s = rep_->index_builder->Finish(&index_blocks, *index_block_handle);
|
|
|
|
if (!s.ok() && !s.IsIncomplete()) {
|
|
|
|
rep_->status = s;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (rep_->table_options.enable_index_compression) {
|
|
|
|
WriteBlock(index_blocks.index_block_contents, index_block_handle, false);
|
|
|
|
} else {
|
|
|
|
WriteRawBlock(index_blocks.index_block_contents, kNoCompression,
|
|
|
|
index_block_handle);
|
|
|
|
}
|
|
|
|
// The last index_block_handle will be for the partition index block
|
|
|
|
}
|
|
|
|
}
|
2013-10-10 20:43:24 +02:00
|
|
|
|
2018-07-20 18:00:33 +02:00
|
|
|
void BlockBasedTableBuilder::WritePropertiesBlock(
|
|
|
|
MetaIndexBuilder* meta_index_builder) {
|
|
|
|
BlockHandle properties_block_handle;
|
|
|
|
if (ok()) {
|
|
|
|
PropertyBlockBuilder property_block_builder;
|
|
|
|
rep_->props.column_family_id = rep_->column_family_id;
|
|
|
|
rep_->props.column_family_name = rep_->column_family_name;
|
2018-07-20 23:34:07 +02:00
|
|
|
rep_->props.filter_policy_name =
|
|
|
|
rep_->table_options.filter_policy != nullptr
|
|
|
|
? rep_->table_options.filter_policy->Name()
|
|
|
|
: "";
|
|
|
|
rep_->props.index_size =
|
2018-08-11 00:14:44 +02:00
|
|
|
rep_->index_builder->IndexSize() + kBlockTrailerSize;
|
2018-07-20 18:00:33 +02:00
|
|
|
rep_->props.comparator_name = rep_->ioptions.user_comparator != nullptr
|
2018-07-20 23:34:07 +02:00
|
|
|
? rep_->ioptions.user_comparator->Name()
|
|
|
|
: "nullptr";
|
|
|
|
rep_->props.merge_operator_name =
|
|
|
|
rep_->ioptions.merge_operator != nullptr
|
|
|
|
? rep_->ioptions.merge_operator->Name()
|
|
|
|
: "nullptr";
|
2018-07-20 18:00:33 +02:00
|
|
|
rep_->props.compression_name =
|
2019-01-19 04:10:17 +01:00
|
|
|
CompressionTypeToString(rep_->compression_type);
|
2019-04-02 23:48:52 +02:00
|
|
|
rep_->props.compression_options =
|
|
|
|
CompressionOptionsToString(rep_->compression_opts);
|
2018-07-20 23:34:07 +02:00
|
|
|
rep_->props.prefix_extractor_name =
|
|
|
|
rep_->moptions.prefix_extractor != nullptr
|
|
|
|
? rep_->moptions.prefix_extractor->Name()
|
|
|
|
: "nullptr";
|
2018-07-20 18:00:33 +02:00
|
|
|
|
|
|
|
std::string property_collectors_names = "[";
|
|
|
|
for (size_t i = 0;
|
|
|
|
i < rep_->ioptions.table_properties_collector_factories.size(); ++i) {
|
|
|
|
if (i != 0) {
|
|
|
|
property_collectors_names += ",";
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
2016-04-28 02:36:03 +02:00
|
|
|
}
|
2018-07-20 18:00:33 +02:00
|
|
|
property_collectors_names +=
|
|
|
|
rep_->ioptions.table_properties_collector_factories[i]->Name();
|
|
|
|
}
|
|
|
|
property_collectors_names += "]";
|
|
|
|
rep_->props.property_collectors_names = property_collectors_names;
|
|
|
|
if (rep_->table_options.index_type ==
|
|
|
|
BlockBasedTableOptions::kTwoLevelIndexSearch) {
|
|
|
|
assert(rep_->p_index_builder_ != nullptr);
|
|
|
|
rep_->props.index_partitions = rep_->p_index_builder_->NumPartitions();
|
|
|
|
rep_->props.top_level_index_size =
|
2018-08-11 00:14:44 +02:00
|
|
|
rep_->p_index_builder_->TopLevelIndexSize(rep_->offset);
|
2018-07-20 18:00:33 +02:00
|
|
|
}
|
|
|
|
rep_->props.index_key_is_user_key =
|
|
|
|
!rep_->index_builder->seperator_is_key_plus_seq();
|
2018-08-10 01:49:45 +02:00
|
|
|
rep_->props.index_value_is_delta_encoded =
|
|
|
|
rep_->use_delta_encoding_for_index_values;
|
2018-07-20 18:00:33 +02:00
|
|
|
rep_->props.creation_time = rep_->creation_time;
|
|
|
|
rep_->props.oldest_key_time = rep_->oldest_key_time;
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
2019-04-11 04:24:25 +02:00
|
|
|
rep_->props.file_creation_time = rep_->file_creation_time;
|
2018-07-20 18:00:33 +02:00
|
|
|
|
|
|
|
// Add basic properties
|
|
|
|
property_block_builder.AddTableProperty(rep_->props);
|
2016-08-20 00:10:31 +02:00
|
|
|
|
2018-07-20 18:00:33 +02:00
|
|
|
// Add use collected properties
|
|
|
|
NotifyCollectTableCollectorsOnFinish(rep_->table_properties_collectors,
|
|
|
|
rep_->ioptions.info_log,
|
|
|
|
&property_block_builder);
|
2012-04-17 17:36:46 +02:00
|
|
|
|
2018-07-20 18:00:33 +02:00
|
|
|
WriteRawBlock(property_block_builder.Finish(), kNoCompression,
|
|
|
|
&properties_block_handle);
|
|
|
|
}
|
2011-03-18 23:37:00 +01:00
|
|
|
if (ok()) {
|
2019-02-11 20:37:07 +01:00
|
|
|
#ifndef NDEBUG
|
|
|
|
{
|
|
|
|
uint64_t props_block_offset = properties_block_handle.offset();
|
|
|
|
uint64_t props_block_size = properties_block_handle.size();
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"BlockBasedTableBuilder::WritePropertiesBlock:GetPropsBlockOffset",
|
|
|
|
&props_block_offset);
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"BlockBasedTableBuilder::WritePropertiesBlock:GetPropsBlockSize",
|
|
|
|
&props_block_size);
|
|
|
|
}
|
|
|
|
#endif // !NDEBUG
|
2018-07-20 18:00:33 +02:00
|
|
|
meta_index_builder->Add(kPropertiesBlock, properties_block_handle);
|
|
|
|
}
|
|
|
|
}
|
2017-02-07 01:29:29 +01:00
|
|
|
|
2018-07-20 18:00:33 +02:00
|
|
|
void BlockBasedTableBuilder::WriteCompressionDictBlock(
|
|
|
|
MetaIndexBuilder* meta_index_builder) {
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
if (rep_->compression_dict != nullptr &&
|
|
|
|
rep_->compression_dict->GetRawDict().size()) {
|
2018-07-20 18:00:33 +02:00
|
|
|
BlockHandle compression_dict_block_handle;
|
|
|
|
if (ok()) {
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
WriteRawBlock(rep_->compression_dict->GetRawDict(), kNoCompression,
|
2018-07-20 18:00:33 +02:00
|
|
|
&compression_dict_block_handle);
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
#ifndef NDEBUG
|
|
|
|
Slice compression_dict = rep_->compression_dict->GetRawDict();
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"BlockBasedTableBuilder::WriteCompressionDictBlock:RawDict",
|
|
|
|
&compression_dict);
|
|
|
|
#endif // NDEBUG
|
2018-01-11 00:06:29 +01:00
|
|
|
}
|
2018-07-20 18:00:33 +02:00
|
|
|
if (ok()) {
|
|
|
|
meta_index_builder->Add(kCompressionDictBlock,
|
|
|
|
compression_dict_block_handle);
|
2017-02-07 01:29:29 +01:00
|
|
|
}
|
2011-03-18 23:37:00 +01:00
|
|
|
}
|
2018-07-20 18:00:33 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::WriteRangeDelBlock(
|
|
|
|
MetaIndexBuilder* meta_index_builder) {
|
|
|
|
if (ok() && !rep_->range_del_block.empty()) {
|
|
|
|
BlockHandle range_del_block_handle;
|
|
|
|
WriteRawBlock(rep_->range_del_block.Finish(), kNoCompression,
|
|
|
|
&range_del_block_handle);
|
|
|
|
meta_index_builder->Add(kRangeDelBlock, range_del_block_handle);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-12-07 22:15:09 +01:00
|
|
|
void BlockBasedTableBuilder::WriteFooter(BlockHandle& metaindex_block_handle,
|
|
|
|
BlockHandle& index_block_handle) {
|
|
|
|
Rep* r = rep_;
|
|
|
|
// No need to write out new footer if we're using default checksum.
|
|
|
|
// We're writing legacy magic number because we want old versions of RocksDB
|
|
|
|
// be able to read files generated with new release (just in case if
|
|
|
|
// somebody wants to roll back after an upgrade)
|
|
|
|
// TODO(icanadi) at some point in the future, when we're absolutely sure
|
|
|
|
// nobody will roll back to RocksDB 2.x versions, retire the legacy magic
|
|
|
|
// number and always write new table files with new magic number
|
|
|
|
bool legacy = (r->table_options.format_version == 0);
|
|
|
|
// this is guaranteed by BlockBasedTableBuilder's constructor
|
|
|
|
assert(r->table_options.checksum == kCRC32c ||
|
|
|
|
r->table_options.format_version != 0);
|
|
|
|
Footer footer(
|
|
|
|
legacy ? kLegacyBlockBasedTableMagicNumber : kBlockBasedTableMagicNumber,
|
|
|
|
r->table_options.format_version);
|
|
|
|
footer.set_metaindex_handle(metaindex_block_handle);
|
|
|
|
footer.set_index_handle(index_block_handle);
|
|
|
|
footer.set_checksum(r->table_options.checksum);
|
|
|
|
std::string footer_encoding;
|
|
|
|
footer.EncodeTo(&footer_encoding);
|
|
|
|
assert(r->status.ok());
|
|
|
|
r->status = r->file->Append(footer_encoding);
|
|
|
|
if (r->status.ok()) {
|
|
|
|
r->offset += footer_encoding.size();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
void BlockBasedTableBuilder::EnterUnbuffered() {
|
|
|
|
Rep* r = rep_;
|
|
|
|
assert(r->state == Rep::State::kBuffered);
|
|
|
|
r->state = Rep::State::kUnbuffered;
|
|
|
|
const size_t kSampleBytes = r->compression_opts.zstd_max_train_bytes > 0
|
|
|
|
? r->compression_opts.zstd_max_train_bytes
|
|
|
|
: r->compression_opts.max_dict_bytes;
|
|
|
|
Random64 generator{r->creation_time};
|
|
|
|
std::string compression_dict_samples;
|
|
|
|
std::vector<size_t> compression_dict_sample_lens;
|
|
|
|
if (!r->data_block_and_keys_buffers.empty()) {
|
|
|
|
while (compression_dict_samples.size() < kSampleBytes) {
|
|
|
|
size_t rand_idx =
|
2019-04-23 00:59:16 +02:00
|
|
|
static_cast<size_t>(
|
|
|
|
generator.Uniform(r->data_block_and_keys_buffers.size()));
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
size_t copy_len =
|
|
|
|
std::min(kSampleBytes - compression_dict_samples.size(),
|
|
|
|
r->data_block_and_keys_buffers[rand_idx].first.size());
|
|
|
|
compression_dict_samples.append(
|
|
|
|
r->data_block_and_keys_buffers[rand_idx].first, 0, copy_len);
|
|
|
|
compression_dict_sample_lens.emplace_back(copy_len);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// final data block flushed, now we can generate dictionary from the samples.
|
|
|
|
// OK if compression_dict_samples is empty, we'll just get empty dictionary.
|
|
|
|
std::string dict;
|
|
|
|
if (r->compression_opts.zstd_max_train_bytes > 0) {
|
|
|
|
dict = ZSTD_TrainDictionary(compression_dict_samples,
|
|
|
|
compression_dict_sample_lens,
|
|
|
|
r->compression_opts.max_dict_bytes);
|
|
|
|
} else {
|
|
|
|
dict = std::move(compression_dict_samples);
|
|
|
|
}
|
|
|
|
r->compression_dict.reset(new CompressionDict(dict, r->compression_type,
|
|
|
|
r->compression_opts.level));
|
|
|
|
r->verify_dict.reset(new UncompressionDict(
|
|
|
|
dict, r->compression_type == kZSTD ||
|
|
|
|
r->compression_type == kZSTDNotFinalCompression));
|
|
|
|
|
|
|
|
for (size_t i = 0; ok() && i < r->data_block_and_keys_buffers.size(); ++i) {
|
|
|
|
const auto& data_block = r->data_block_and_keys_buffers[i].first;
|
|
|
|
auto& keys = r->data_block_and_keys_buffers[i].second;
|
|
|
|
assert(!data_block.empty());
|
|
|
|
assert(!keys.empty());
|
|
|
|
|
|
|
|
for (const auto& key : keys) {
|
|
|
|
if (r->filter_builder != nullptr) {
|
|
|
|
r->filter_builder->Add(ExtractUserKey(key));
|
|
|
|
}
|
|
|
|
r->index_builder->OnKeyAdded(key);
|
|
|
|
}
|
|
|
|
WriteBlock(Slice(data_block), &r->pending_handle, true /* is_data_block */);
|
|
|
|
if (ok() && i + 1 < r->data_block_and_keys_buffers.size()) {
|
|
|
|
Slice first_key_in_next_block =
|
|
|
|
r->data_block_and_keys_buffers[i + 1].second.front();
|
|
|
|
Slice* first_key_in_next_block_ptr = &first_key_in_next_block;
|
|
|
|
r->index_builder->AddIndexEntry(&keys.back(), first_key_in_next_block_ptr,
|
|
|
|
r->pending_handle);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
r->data_block_and_keys_buffers.clear();
|
|
|
|
}
|
|
|
|
|
2018-07-20 18:00:33 +02:00
|
|
|
Status BlockBasedTableBuilder::Finish() {
|
|
|
|
Rep* r = rep_;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
assert(r->state != Rep::State::kClosed);
|
2018-07-20 18:00:33 +02:00
|
|
|
bool empty_data_block = r->data_block.empty();
|
|
|
|
Flush();
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
if (r->state == Rep::State::kBuffered) {
|
|
|
|
EnterUnbuffered();
|
|
|
|
}
|
2018-07-20 18:00:33 +02:00
|
|
|
// To make sure properties block is able to keep the accurate size of index
|
|
|
|
// block, we will finish writing all index entries first.
|
|
|
|
if (ok() && !empty_data_block) {
|
|
|
|
r->index_builder->AddIndexEntry(
|
|
|
|
&r->last_key, nullptr /* no next data block */, r->pending_handle);
|
|
|
|
}
|
|
|
|
|
2018-12-07 22:15:09 +01:00
|
|
|
// Write meta blocks, metaindex block and footer in the following order.
|
2018-07-20 18:00:33 +02:00
|
|
|
// 1. [meta block: filter]
|
|
|
|
// 2. [meta block: index]
|
|
|
|
// 3. [meta block: compression dictionary]
|
|
|
|
// 4. [meta block: range deletion tombstone]
|
|
|
|
// 5. [meta block: properties]
|
|
|
|
// 6. [metaindex block]
|
2018-12-07 22:15:09 +01:00
|
|
|
// 7. Footer
|
2018-07-20 18:00:33 +02:00
|
|
|
BlockHandle metaindex_block_handle, index_block_handle;
|
|
|
|
MetaIndexBuilder meta_index_builder;
|
|
|
|
WriteFilterBlock(&meta_index_builder);
|
|
|
|
WriteIndexBlock(&meta_index_builder, &index_block_handle);
|
|
|
|
WriteCompressionDictBlock(&meta_index_builder);
|
|
|
|
WriteRangeDelBlock(&meta_index_builder);
|
|
|
|
WritePropertiesBlock(&meta_index_builder);
|
|
|
|
if (ok()) {
|
|
|
|
// flush the meta index block
|
|
|
|
WriteRawBlock(meta_index_builder.Finish(), kNoCompression,
|
|
|
|
&metaindex_block_handle);
|
|
|
|
}
|
2011-03-18 23:37:00 +01:00
|
|
|
if (ok()) {
|
2018-12-07 22:15:09 +01:00
|
|
|
WriteFooter(metaindex_block_handle, index_block_handle);
|
2011-03-18 23:37:00 +01:00
|
|
|
}
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
r->state = Rep::State::kClosed;
|
2011-03-18 23:37:00 +01:00
|
|
|
return r->status;
|
|
|
|
}
|
|
|
|
|
2013-10-29 01:54:09 +01:00
|
|
|
void BlockBasedTableBuilder::Abandon() {
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
assert(rep_->state != Rep::State::kClosed);
|
|
|
|
rep_->state = Rep::State::kClosed;
|
2011-03-18 23:37:00 +01:00
|
|
|
}
|
|
|
|
|
2013-10-29 01:54:09 +01:00
|
|
|
uint64_t BlockBasedTableBuilder::NumEntries() const {
|
2013-11-20 01:29:42 +01:00
|
|
|
return rep_->props.num_entries;
|
2011-03-18 23:37:00 +01:00
|
|
|
}
|
|
|
|
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 04:42:25 +01:00
|
|
|
uint64_t BlockBasedTableBuilder::FileSize() const { return rep_->offset; }
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2015-06-04 21:03:40 +02:00
|
|
|
bool BlockBasedTableBuilder::NeedCompact() const {
|
|
|
|
for (const auto& collector : rep_->table_properties_collectors) {
|
|
|
|
if (collector->NeedCompact()) {
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
Add more table properties to EventLogger
Summary:
Example output:
{"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}
Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.
Test Plan: make check + check out the output in the log
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38343
2015-05-13 00:53:55 +02:00
|
|
|
TableProperties BlockBasedTableBuilder::GetTableProperties() const {
|
|
|
|
TableProperties ret = rep_->props;
|
|
|
|
for (const auto& collector : rep_->table_properties_collectors) {
|
|
|
|
for (const auto& prop : collector->GetReadableProperties()) {
|
2015-09-15 18:03:08 +02:00
|
|
|
ret.readable_properties.insert(prop);
|
Add more table properties to EventLogger
Summary:
Example output:
{"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}
Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.
Test Plan: make check + check out the output in the log
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38343
2015-05-13 00:53:55 +02:00
|
|
|
}
|
2015-09-15 18:03:08 +02:00
|
|
|
collector->Finish(&ret.user_collected_properties);
|
Add more table properties to EventLogger
Summary:
Example output:
{"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}
Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.
Test Plan: make check + check out the output in the log
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38343
2015-05-13 00:53:55 +02:00
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-05-15 23:09:03 +02:00
|
|
|
const std::string BlockBasedTable::kFilterBlockPrefix = "filter.";
|
2014-09-08 19:37:05 +02:00
|
|
|
const std::string BlockBasedTable::kFullFilterBlockPrefix = "fullfilter.";
|
2017-03-07 22:48:02 +01:00
|
|
|
const std::string BlockBasedTable::kPartitionedFilterBlockPrefix =
|
|
|
|
"partitionedfilter.";
|
2013-10-04 06:49:15 +02:00
|
|
|
} // namespace rocksdb
|