Compare commits
4 Commits
main
...
release_fi
Author | SHA1 | Date | |
---|---|---|---|
|
c464412cf8 | ||
|
4411488585 | ||
|
8cc712c0eb | ||
|
62b71b3094 |
14
HISTORY.md
14
HISTORY.md
@ -1,8 +1,5 @@
|
||||
# Rocksdb Change Log
|
||||
## 7.1.0 (03/21/2022)
|
||||
### Public API changes
|
||||
* Add DB::OpenAndTrimHistory API. This API will open DB and trim data to the timestamp specified by trim_ts (The data with timestamp larger than specified trim bound will be removed). This API should only be used at a timestamp-enabled column families recovery. If the column family doesn't have timestamp enabled, this API won't trim any data on that column family. This API is not compatible with avoid_flush_during_recovery option.
|
||||
|
||||
## 7.1.0 (03/23/2022)
|
||||
### New Features
|
||||
* Allow WriteBatchWithIndex to index a WriteBatch that includes keys with user-defined timestamps. The index itself does not have timestamp.
|
||||
* Add support for user-defined timestamps to write-committed transaction without API change. The `TransactionDB` layer APIs do not allow timestamps because we require that all user-defined-timestamps-aware operations go through the `Transaction` APIs.
|
||||
@ -13,10 +10,12 @@
|
||||
* Experimental support for async_io in ReadOptions which is used by FilePrefetchBuffer to prefetch some of the data asynchronously, if reads are sequential and auto readahead is enabled by rocksdb internally.
|
||||
|
||||
### Bug Fixes
|
||||
* Fixed a major performance bug in which Bloom filters generated by pre-7.0 releases are not read by early 7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name() in #9590. This can severely impact read performance and read I/O on upgrade or downgrade with existing DB, but not data correctness.
|
||||
* Fixed a data race on `versions_` between `DBImpl::ResumeImpl()` and threads waiting for recovery to complete (#9496)
|
||||
* Fixed a bug caused by race among flush, incoming writes and taking snapshots. Queries to snapshots created with these race condition can return incorrect result, e.g. resurfacing deleted data.
|
||||
* Fixed a bug that DB flush uses `options.compression` even `options.compression_per_level` is set.
|
||||
* Fixed a bug that DisableManualCompaction may assert when disable an unscheduled manual compaction.
|
||||
* Fix a race condition when cancel manual compaction with `DisableManualCompaction`. Also DB close can cancel the manual compaction thread.
|
||||
* Fixed a potential timer crash when open close DB concurrently.
|
||||
* Fixed a race condition for `alive_log_files_` in non-two-write-queues mode. The race is between the write_thread_ in WriteToWAL() and another thread executing `FindObsoleteFiles()`. The race condition will be caught if `__glibcxx_requires_nonempty` is enabled.
|
||||
* Fixed a bug that `Iterator::Refresh()` reads stale keys after DeleteRange() performed.
|
||||
@ -25,12 +24,11 @@
|
||||
* Fixed a race condition when mmaping a WritableFile on POSIX.
|
||||
|
||||
### Public API changes
|
||||
* Remove BlockBasedTableOptions.hash_index_allow_collision which already takes no effect.
|
||||
* Added pure virtual FilterPolicy::CompatibilityName(), which is needed for fixing major performance bug involving FilterPolicy naming in SST metadata without affecting Customizable aspect of FilterPolicy. This change only affects those with their own custom or wrapper FilterPolicy classes.
|
||||
* `options.compression_per_level` is dynamically changeable with `SetOptions()`.
|
||||
* Added `WriteOptions::rate_limiter_priority`. When set to something other than `Env::IO_TOTAL`, the internal rate limiter (`DBOptions::rate_limiter`) will be charged at the specified priority for writes associated with the API to which the `WriteOptions` was provided. Currently the support covers automatic WAL flushes, which happen during live updates (`Put()`, `Write()`, `Delete()`, etc.) when `WriteOptions::disableWAL == false` and `DBOptions::manual_wal_flush == false`.
|
||||
|
||||
### Bug Fixes
|
||||
* Fix a race condition when cancel manual compaction with `DisableManualCompaction`. Also DB close can cancel the manual compaction thread.
|
||||
* Add DB::OpenAndTrimHistory API. This API will open DB and trim data to the timestamp specified by trim_ts (The data with timestamp larger than specified trim bound will be removed). This API should only be used at a timestamp-enabled column families recovery. If the column family doesn't have timestamp enabled, this API won't trim any data on that column family. This API is not compatible with avoid_flush_during_recovery option.
|
||||
* Remove BlockBasedTableOptions.hash_index_allow_collision which already takes no effect.
|
||||
|
||||
## 7.0.0 (02/20/2022)
|
||||
### Bug Fixes
|
||||
|
6
db/c.cc
6
db/c.cc
@ -3752,6 +3752,9 @@ rocksdb_filterpolicy_t* rocksdb_filterpolicy_create_bloom_format(
|
||||
const FilterPolicy* rep_;
|
||||
~Wrapper() override { delete rep_; }
|
||||
const char* Name() const override { return rep_->Name(); }
|
||||
const char* CompatibilityName() const override {
|
||||
return rep_->CompatibilityName();
|
||||
}
|
||||
// No need to override GetFilterBitsBuilder if this one is overridden
|
||||
ROCKSDB_NAMESPACE::FilterBitsBuilder* GetBuilderWithContext(
|
||||
const ROCKSDB_NAMESPACE::FilterBuildingContext& context)
|
||||
@ -3789,6 +3792,9 @@ rocksdb_filterpolicy_t* rocksdb_filterpolicy_create_ribbon_format(
|
||||
const FilterPolicy* rep_;
|
||||
~Wrapper() override { delete rep_; }
|
||||
const char* Name() const override { return rep_->Name(); }
|
||||
const char* CompatibilityName() const override {
|
||||
return rep_->CompatibilityName();
|
||||
}
|
||||
ROCKSDB_NAMESPACE::FilterBitsBuilder* GetBuilderWithContext(
|
||||
const ROCKSDB_NAMESPACE::FilterBuildingContext& context)
|
||||
const override {
|
||||
|
@ -1638,9 +1638,15 @@ class LevelAndStyleCustomFilterPolicy : public FilterPolicy {
|
||||
policy_l0_other_(NewBloomFilterPolicy(bpk_l0_other)),
|
||||
policy_otherwise_(NewBloomFilterPolicy(bpk_otherwise)) {}
|
||||
|
||||
const char* Name() const override {
|
||||
return "LevelAndStyleCustomFilterPolicy";
|
||||
}
|
||||
|
||||
// OK to use built-in policy name because we are deferring to a
|
||||
// built-in builder. We aren't changing the serialized format.
|
||||
const char* Name() const override { return policy_fifo_->Name(); }
|
||||
const char* CompatibilityName() const override {
|
||||
return policy_fifo_->CompatibilityName();
|
||||
}
|
||||
|
||||
FilterBitsBuilder* GetBuilderWithContext(
|
||||
const FilterBuildingContext& context) const override {
|
||||
|
@ -66,6 +66,17 @@ void FilePrefetchBuffer::CalculateOffsetAndLen(size_t alignment,
|
||||
// chunk_len is greater than 0.
|
||||
bufs_[index].buffer_.RefitTail(static_cast<size_t>(chunk_offset_in_buffer),
|
||||
static_cast<size_t>(chunk_len));
|
||||
} else if (chunk_len > 0) {
|
||||
// For async prefetching, it doesn't call RefitTail with chunk_len > 0.
|
||||
// Allocate new buffer if needed because aligned buffer calculate remaining
|
||||
// buffer as capacity_ - cursize_ which might not be the case in this as we
|
||||
// are not refitting.
|
||||
// TODO akanksha: Update the condition when asynchronous prefetching is
|
||||
// stable.
|
||||
bufs_[index].buffer_.Alignment(alignment);
|
||||
bufs_[index].buffer_.AllocateNewBuffer(
|
||||
static_cast<size_t>(roundup_len), copy_data_to_new_buffer,
|
||||
chunk_offset_in_buffer, static_cast<size_t>(chunk_len));
|
||||
}
|
||||
}
|
||||
|
||||
@ -236,34 +247,47 @@ Status FilePrefetchBuffer::PrefetchAsync(const IOOptions& opts,
|
||||
// Index of second buffer.
|
||||
uint32_t second = curr_ ^ 1;
|
||||
|
||||
// If data is in second buffer, make it curr_. Second buffer can be either
|
||||
// partial filled or full.
|
||||
if (bufs_[second].buffer_.CurrentSize() > 0 &&
|
||||
offset >= bufs_[second].offset_ &&
|
||||
offset <= bufs_[second].offset_ + bufs_[second].buffer_.CurrentSize()) {
|
||||
// Clear the curr_ as buffers have been swapped and curr_ contains the
|
||||
// outdated data.
|
||||
// First clear the buffers if it contains outdated data. Outdated data can be
|
||||
// because previous sequential reads were read from the cache instead of these
|
||||
// buffer.
|
||||
{
|
||||
if (bufs_[curr_].buffer_.CurrentSize() > 0 &&
|
||||
offset >= bufs_[curr_].offset_ + bufs_[curr_].buffer_.CurrentSize()) {
|
||||
bufs_[curr_].buffer_.Clear();
|
||||
// Switch the buffers.
|
||||
curr_ = curr_ ^ 1;
|
||||
second = curr_ ^ 1;
|
||||
}
|
||||
|
||||
// If second buffer contains outdated data, clear it for async prefetching.
|
||||
// Outdated can be because previous sequential reads were read from the cache
|
||||
// instead of this buffer.
|
||||
if (bufs_[second].buffer_.CurrentSize() > 0 &&
|
||||
offset >= bufs_[second].offset_ + bufs_[second].buffer_.CurrentSize()) {
|
||||
bufs_[second].buffer_.Clear();
|
||||
}
|
||||
}
|
||||
|
||||
// If data is in second buffer, make it curr_. Second buffer can be either
|
||||
// partial filled or full.
|
||||
if (bufs_[second].buffer_.CurrentSize() > 0 &&
|
||||
offset >= bufs_[second].offset_ &&
|
||||
offset < bufs_[second].offset_ + bufs_[second].buffer_.CurrentSize()) {
|
||||
// Clear the curr_ as buffers have been swapped and curr_ contains the
|
||||
// outdated data and switch the buffers.
|
||||
bufs_[curr_].buffer_.Clear();
|
||||
curr_ = curr_ ^ 1;
|
||||
second = curr_ ^ 1;
|
||||
}
|
||||
// After swap check if all the requested bytes are in curr_, it will go for
|
||||
// async prefetching only.
|
||||
if (bufs_[curr_].buffer_.CurrentSize() > 0 &&
|
||||
offset + length <=
|
||||
bufs_[curr_].offset_ + bufs_[curr_].buffer_.CurrentSize()) {
|
||||
offset += length;
|
||||
length = 0;
|
||||
prefetch_size -= length;
|
||||
}
|
||||
// Data is overlapping i.e. some of the data is in curr_ buffer and remaining
|
||||
// in second buffer.
|
||||
if (bufs_[curr_].buffer_.CurrentSize() > 0 &&
|
||||
bufs_[second].buffer_.CurrentSize() > 0 &&
|
||||
offset >= bufs_[curr_].offset_ &&
|
||||
offset < bufs_[curr_].offset_ + bufs_[curr_].buffer_.CurrentSize() &&
|
||||
offset + prefetch_size > bufs_[second].offset_) {
|
||||
offset + length > bufs_[second].offset_) {
|
||||
// Allocate new buffer to third buffer;
|
||||
bufs_[2].buffer_.Clear();
|
||||
bufs_[2].buffer_.Alignment(alignment);
|
||||
@ -273,12 +297,10 @@ Status FilePrefetchBuffer::PrefetchAsync(const IOOptions& opts,
|
||||
|
||||
// Move data from curr_ buffer to third.
|
||||
CopyDataToBuffer(curr_, offset, length);
|
||||
|
||||
if (length == 0) {
|
||||
// Requested data has been copied and curr_ still has unconsumed data.
|
||||
return s;
|
||||
}
|
||||
|
||||
CopyDataToBuffer(second, offset, length);
|
||||
// Length == 0: All the requested data has been copied to third buffer. It
|
||||
// should go for only async prefetching.
|
||||
@ -306,6 +328,7 @@ Status FilePrefetchBuffer::PrefetchAsync(const IOOptions& opts,
|
||||
if (length > 0) {
|
||||
CalculateOffsetAndLen(alignment, offset, roundup_len1, curr_,
|
||||
false /*refit_tail*/, chunk_len1);
|
||||
assert(roundup_len1 >= chunk_len1);
|
||||
read_len1 = static_cast<size_t>(roundup_len1 - chunk_len1);
|
||||
}
|
||||
{
|
||||
@ -316,7 +339,7 @@ Status FilePrefetchBuffer::PrefetchAsync(const IOOptions& opts,
|
||||
Roundup(rounddown_start2 + readahead_size, alignment);
|
||||
|
||||
// For length == 0, do the asynchronous prefetching in second instead of
|
||||
// synchronous prefetching of remaining prefetch_size.
|
||||
// synchronous prefetching in curr_.
|
||||
if (length == 0) {
|
||||
rounddown_start2 =
|
||||
bufs_[curr_].offset_ + bufs_[curr_].buffer_.CurrentSize();
|
||||
@ -330,8 +353,8 @@ Status FilePrefetchBuffer::PrefetchAsync(const IOOptions& opts,
|
||||
|
||||
// Update the buffer offset.
|
||||
bufs_[second].offset_ = rounddown_start2;
|
||||
assert(roundup_len2 >= chunk_len2);
|
||||
uint64_t read_len2 = static_cast<size_t>(roundup_len2 - chunk_len2);
|
||||
|
||||
ReadAsync(opts, reader, rate_limiter_priority, read_len2, chunk_len2,
|
||||
rounddown_start2, second)
|
||||
.PermitUncheckedError();
|
||||
@ -344,7 +367,6 @@ Status FilePrefetchBuffer::PrefetchAsync(const IOOptions& opts,
|
||||
return s;
|
||||
}
|
||||
}
|
||||
|
||||
// Copy remaining requested bytes to third_buffer.
|
||||
if (copy_to_third_buffer && length > 0) {
|
||||
CopyDataToBuffer(curr_, offset, length);
|
||||
|
@ -90,6 +90,19 @@ class FilterPolicy : public Customizable {
|
||||
virtual ~FilterPolicy();
|
||||
static const char* Type() { return "FilterPolicy"; }
|
||||
|
||||
// The name used for identifying whether a filter on disk is readable
|
||||
// by this FilterPolicy. If this FilterPolicy is part of a family that
|
||||
// can read each others filters, such as built-in BloomFilterPolcy and
|
||||
// RibbonFilterPolicy, the CompatibilityName is a shared family name,
|
||||
// while kinds of filters in the family can have distinct Customizable
|
||||
// Names. This function is pure virtual so that wrappers around built-in
|
||||
// policies are prompted to defer to CompatibilityName() of the wrapped
|
||||
// policy, which is important for compatibility.
|
||||
//
|
||||
// For custom filter policies that are not part of a read-compatible
|
||||
// family (rare), implementations may return Name().
|
||||
virtual const char* CompatibilityName() const = 0;
|
||||
|
||||
// Creates a new FilterPolicy based on the input value string and returns the
|
||||
// result The value might be an ID, and ID with properties, or an old-style
|
||||
// policy string.
|
||||
|
@ -1487,6 +1487,7 @@ class MockFilterPolicy : public FilterPolicy {
|
||||
public:
|
||||
static const char* kClassName() { return "MockFilterPolicy"; }
|
||||
const char* Name() const override { return kClassName(); }
|
||||
const char* CompatibilityName() const override { return Name(); }
|
||||
FilterBitsBuilder* GetBuilderWithContext(
|
||||
const FilterBuildingContext&) const override {
|
||||
return nullptr;
|
||||
|
@ -1605,7 +1605,7 @@ void BlockBasedTableBuilder::WriteFilterBlock(
|
||||
? BlockBasedTable::kPartitionedFilterBlockPrefix
|
||||
: BlockBasedTable::kFullFilterBlockPrefix;
|
||||
}
|
||||
key.append(rep_->table_options.filter_policy->Name());
|
||||
key.append(rep_->table_options.filter_policy->CompatibilityName());
|
||||
meta_index_builder->Add(key, filter_block_handle);
|
||||
}
|
||||
}
|
||||
|
@ -12,6 +12,7 @@
|
||||
#include <array>
|
||||
#include <limits>
|
||||
#include <string>
|
||||
#include <unordered_set>
|
||||
#include <utility>
|
||||
#include <vector>
|
||||
|
||||
@ -50,6 +51,7 @@
|
||||
#include "table/block_based/block_prefix_index.h"
|
||||
#include "table/block_based/block_type.h"
|
||||
#include "table/block_based/filter_block.h"
|
||||
#include "table/block_based/filter_policy_internal.h"
|
||||
#include "table/block_based/full_filter_block.h"
|
||||
#include "table/block_based/hash_index_reader.h"
|
||||
#include "table/block_based/partitioned_filter_block.h"
|
||||
@ -897,29 +899,54 @@ Status BlockBasedTable::PrefetchIndexAndFilterBlocks(
|
||||
const BlockBasedTableOptions& table_options, const int level,
|
||||
size_t file_size, size_t max_file_size_for_l0_meta_pin,
|
||||
BlockCacheLookupContext* lookup_context) {
|
||||
Status s;
|
||||
|
||||
// Find filter handle and filter type
|
||||
if (rep_->filter_policy) {
|
||||
for (auto filter_type :
|
||||
{Rep::FilterType::kFullFilter, Rep::FilterType::kPartitionedFilter,
|
||||
Rep::FilterType::kBlockFilter}) {
|
||||
std::string prefix;
|
||||
switch (filter_type) {
|
||||
case Rep::FilterType::kFullFilter:
|
||||
prefix = kFullFilterBlockPrefix;
|
||||
auto name = rep_->filter_policy->CompatibilityName();
|
||||
bool builtin_compatible =
|
||||
strcmp(name, BuiltinFilterPolicy::kCompatibilityName()) == 0;
|
||||
|
||||
for (const auto& [filter_type, prefix] :
|
||||
{std::make_pair(Rep::FilterType::kFullFilter, kFullFilterBlockPrefix),
|
||||
std::make_pair(Rep::FilterType::kPartitionedFilter,
|
||||
kPartitionedFilterBlockPrefix),
|
||||
std::make_pair(Rep::FilterType::kBlockFilter, kFilterBlockPrefix)}) {
|
||||
if (builtin_compatible) {
|
||||
// This code is only here to deal with a hiccup in early 7.0.x where
|
||||
// there was an unintentional name change in the SST files metadata.
|
||||
// It should be OK to remove this in the future (late 2022) and just
|
||||
// have the 'else' code.
|
||||
// NOTE: the test:: names below are likely not needed but included
|
||||
// out of caution
|
||||
static const std::unordered_set<std::string> kBuiltinNameAndAliases = {
|
||||
BuiltinFilterPolicy::kCompatibilityName(),
|
||||
test::LegacyBloomFilterPolicy::kClassName(),
|
||||
test::FastLocalBloomFilterPolicy::kClassName(),
|
||||
test::Standard128RibbonFilterPolicy::kClassName(),
|
||||
DeprecatedBlockBasedBloomFilterPolicy::kClassName(),
|
||||
BloomFilterPolicy::kClassName(),
|
||||
RibbonFilterPolicy::kClassName(),
|
||||
};
|
||||
|
||||
// For efficiency, do a prefix seek and see if the first match is
|
||||
// good.
|
||||
meta_iter->Seek(prefix);
|
||||
if (meta_iter->status().ok() && meta_iter->Valid()) {
|
||||
Slice key = meta_iter->key();
|
||||
if (key.starts_with(prefix)) {
|
||||
key.remove_prefix(prefix.size());
|
||||
if (kBuiltinNameAndAliases.find(key.ToString()) !=
|
||||
kBuiltinNameAndAliases.end()) {
|
||||
Slice v = meta_iter->value();
|
||||
Status s = rep_->filter_handle.DecodeFrom(&v);
|
||||
if (s.ok()) {
|
||||
rep_->filter_type = filter_type;
|
||||
break;
|
||||
case Rep::FilterType::kPartitionedFilter:
|
||||
prefix = kPartitionedFilterBlockPrefix;
|
||||
break;
|
||||
case Rep::FilterType::kBlockFilter:
|
||||
prefix = kFilterBlockPrefix;
|
||||
break;
|
||||
default:
|
||||
assert(0);
|
||||
}
|
||||
std::string filter_block_key = prefix;
|
||||
filter_block_key.append(rep_->filter_policy->Name());
|
||||
}
|
||||
}
|
||||
}
|
||||
} else {
|
||||
std::string filter_block_key = prefix + name;
|
||||
if (FindMetaBlock(meta_iter, filter_block_key, &rep_->filter_handle)
|
||||
.ok()) {
|
||||
rep_->filter_type = filter_type;
|
||||
@ -927,12 +954,13 @@ Status BlockBasedTable::PrefetchIndexAndFilterBlocks(
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
// Partition filters cannot be enabled without partition indexes
|
||||
assert(rep_->filter_type != Rep::FilterType::kPartitionedFilter ||
|
||||
rep_->index_type == BlockBasedTableOptions::kTwoLevelIndexSearch);
|
||||
|
||||
// Find compression dictionary handle
|
||||
s = FindOptionalMetaBlock(meta_iter, kCompressionDictBlockName,
|
||||
Status s = FindOptionalMetaBlock(meta_iter, kCompressionDictBlockName,
|
||||
&rep_->compression_dict_handle);
|
||||
if (!s.ok()) {
|
||||
return s;
|
||||
|
@ -1325,6 +1325,16 @@ bool BuiltinFilterPolicy::IsInstanceOf(const std::string& name) const {
|
||||
}
|
||||
}
|
||||
|
||||
static const char* kBuiltinFilterMetadataName = "rocksdb.BuiltinBloomFilter";
|
||||
|
||||
const char* BuiltinFilterPolicy::kCompatibilityName() {
|
||||
return kBuiltinFilterMetadataName;
|
||||
}
|
||||
|
||||
const char* BuiltinFilterPolicy::CompatibilityName() const {
|
||||
return kBuiltinFilterMetadataName;
|
||||
}
|
||||
|
||||
BloomLikeFilterPolicy::BloomLikeFilterPolicy(double bits_per_key)
|
||||
: warned_(false), aggregate_rounding_balance_(0) {
|
||||
// Sanitize bits_per_key
|
||||
@ -1372,7 +1382,7 @@ bool BloomLikeFilterPolicy::IsInstanceOf(const std::string& name) const {
|
||||
}
|
||||
|
||||
const char* ReadOnlyBuiltinFilterPolicy::kClassName() {
|
||||
return "rocksdb.BuiltinBloomFilter";
|
||||
return kBuiltinFilterMetadataName;
|
||||
}
|
||||
|
||||
const char* DeprecatedBlockBasedBloomFilterPolicy::kClassName() {
|
||||
|
@ -135,6 +135,9 @@ class BuiltinFilterPolicy : public FilterPolicy {
|
||||
FilterBitsReader* GetFilterBitsReader(const Slice& contents) const override;
|
||||
static const char* kClassName();
|
||||
bool IsInstanceOf(const std::string& id) const override;
|
||||
// All variants of BuiltinFilterPolicy can read each others filters.
|
||||
const char* CompatibilityName() const override;
|
||||
static const char* kCompatibilityName();
|
||||
|
||||
public: // new
|
||||
// An internal function for the implementation of
|
||||
|
@ -84,6 +84,7 @@ class TestFilterBitsReader : public FilterBitsReader {
|
||||
class TestHashFilter : public FilterPolicy {
|
||||
public:
|
||||
const char* Name() const override { return "TestHashFilter"; }
|
||||
const char* CompatibilityName() const override { return Name(); }
|
||||
|
||||
FilterBitsBuilder* GetBuilderWithContext(
|
||||
const FilterBuildingContext&) const override {
|
||||
|
Loading…
Reference in New Issue
Block a user