Compare commits

...

20 Commits

Author SHA1 Message Date
Yanqin Jin
8ec8d1275b Fix the logic of setting read_amp_bytes_per_bit from OPTIONS file (#7680)
Summary:
Instead of using `EncodeFixed32` which always serialize a integer to
little endian, we should use the local machine's endianness when
populating a native data structure during options parsing.
Without this fix, `read_amp_bytes_per_bit` may be populated incorrectly
on big-endian machines.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7680

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D24999166

Pulled By: riversand963

fbshipit-source-id: dc603cff6e17f8fa32479ce6df93b93082e6b0c4
2020-11-17 10:27:16 -08:00
Yanqin Jin
a1f37aa701 Bump version and update HISTORY.md 2020-11-15 15:13:05 -08:00
Yanqin Jin
98174bc3a3 Hack to load OPTIONS file for read_amp_bytes_per_bit (#7659)
Summary:
A temporary hack to work around a bug in 6.10, 6.11, 6.12, 6.13 and
6.14. The bug will write out 8 bytes to OPTIONS file from the starting
address of BlockBasedTableOptions.read_amp_bytes_per_bit which is
actually a uint32. Consequently, the value of read_amp_bytes_per_bit
written in the OPTIONS file is wrong. From 6.15, RocksDB will
try to parse the read_amp_bytes_per_bit from OPTIONS file as a uint32.
To be able to load OPTIONS file generated by affected releases before
the fix, we need to manually parse read_amp_bytes_per_bit with this hack.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7659

Test Plan:
Generate a db with current 6.14.fb (head at b6db05dbb5). Maybe use db_stress.

Checkout this PR, run
```
 ~/rocksdb/ldb --db=. --try_load_options --ignore_unknown_options idump --count_only
```
Expect success, and should not see
```
Failed: Invalid argument: Error parsing read_amp_bytes_per_bit:17179869184
```

Also
make check

Reviewed By: anand1976

Differential Revision: D24954752

Pulled By: riversand963

fbshipit-source-id: c7b802fc3e52acd050a4fc1cd475016122234394
2020-11-15 15:10:54 -08:00
Huisheng Liu
e82e5e0e63 fix read_amp_bytes_per_bit field size (#7651)
Summary:
The field in BlockBasedTableOptions is 4 bytes:
  // Default: 0 (disabled)
  uint32_t read_amp_bytes_per_bit = 0;

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7651

Reviewed By: ltamasi

Differential Revision: D24844994

Pulled By: riversand963

fbshipit-source-id: e2695e55532256ef8996dd6939cad06987a80293
2020-11-15 15:08:05 -08:00
Andrew Kryczka
f3e33549c1 Update HISTORY.md and version.h for 6.12.7 2020-10-14 10:55:43 -07:00
Amir Sanjar
bfdb0a7651 Fix a build issue with RocksJava on ppc64le that was introduced in https://github.com/facebook/rocksdb/pull/6660
This was fixed in https://github.com/facebook/rocksdb/pull/7359 and this commit is just a partial back-port of the necessary parts of that PR.
2020-10-14 10:53:27 -07:00
Andrew Kryczka
7885d8f9bd Update HISTORY.md and version.h for 6.12.6 2020-10-13 09:30:27 -07:00
Andrew Kryczka
56fee9f1ae Fix bug when DeleteRange() used with paranoid_file_checks == true
Backports part of 7508175558.

The unit test comes from https://github.com/facebook/rocksdb/pull/7521.
2020-10-13 09:28:12 -07:00
Andrew Kryczka
290e990c84 Update HISTORY.md and version.h for 6.12.5 2020-10-12 13:49:18 -07:00
Andrew Kryczka
24bb466f14 Fix bug in pinned partitioned indexes with some reads bypassing block cache
Backports part of 75d3b6fdf0.
2020-10-12 13:45:33 -07:00
Andrew Kryczka
1fdf49c10e Fix bug in pinned partitioned user key indexes
Backports part of 75d3b6fdf0.
2020-10-12 13:43:52 -07:00
Andrew Kryczka
ceb7ae16a4 Add missing release note 2020-10-12 13:43:16 -07:00
Peter Dillinger
1b963a999c Update HISTORY.md and version.h for 6.12.4 2020-09-18 08:20:39 -07:00
Peter Dillinger
8a6c925ca7 Restore file size in backup table file names (and other cleanup)
Summary: Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.

This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.

We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).

We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.

Finally, recognizing that the backup file names have become a de facto
part of the backup meta schema, this change makes them easier to parse
and extend by putting a distinct marker, 's', before DB session ids
embedded in the name. When we extend this to allow custom checksums in
the name, they can get their own marker to ensure safe parsing. For
backward compatibility, file size does not get a marker but is assumed
for _[0-9]+[.]

Test Plan: unit tests included. Sync point callbacks are used to mimic
previous version SST files.
2020-09-17 00:22:30 -07:00
Andrew Kryczka
fad041f210 update HISTORY.md and version.h for 6.12.3 2020-09-16 09:39:59 -07:00
Cheng Chang
972d137366 Fix wrong level args (#7346)
Summary:
The level args should be output level instead of input levels.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7346

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D23506373

Pulled By: cheng-chang

fbshipit-source-id: b2f701d44c13581c5c10c4dbebded4fcd354d641
2020-09-16 09:31:03 -07:00
Levi Tamasi
e6e52a7418 Update version.h and HISTORY.md for 6.12.2 2020-09-14 16:15:46 -07:00
Levi Tamasi
b7cc96d7d1 Expose the start of the expiration range for TTL blob files through LiveFileMetaData (#7365)
Summary:
The patch adds support for exposing the start of the expiration range
for TTL blob files through the `GetLiveFilesMetaData` API. This can be
used for monitoring purposes, i.e. to make sure TTL blob files are
deleted in a timely manner. The patch also fixes a couple of uninitialized
variable issues.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7365

Test Plan: `make check`

Reviewed By: pdillinger

Differential Revision: D23605465

Pulled By: ltamasi

fbshipit-source-id: 97a9612bf5f4b058423debdd3f28f576bb23a70f
2020-09-14 16:11:01 -07:00
Peter Dillinger
7dc1c55490 Update for 6.12.1 patch release 2020-08-20 16:52:20 -07:00
Peter Dillinger
25bddfa632 Work around a backup bug with DB custom checksums
Summary: On a read-write DB configured with
DBOptions::file_checksum_gen_factory, BackupEngine::CreateNewBackup can
fail intermittently, with non-OK status. This is due to a race between
GetLiveFiles and GetLiveFilesChecksumInfo in creating backups.

For patching 6.12 release (as this commit is intended for), we can simply
treat files for which we falsely failed to get checksum info as legacy
files lacking checksum info.

Test Plan: unit test reproducer included
2020-08-20 15:24:11 -07:00
22 changed files with 624 additions and 141 deletions

View File

@ -1,5 +1,38 @@
# Rocksdb Change Log
## Unreleased
## 6.12.8 (2020-11-15)
### Bug Fixes
* Fix a bug of encoding and parsing BlockBasedTableOptions::read_amp_bytes_per_bit as a 64-bit integer.
* Fixed the logic of populating native data structure for `read_amp_bytes_per_bit` during OPTIONS file parsing on big-endian architecture. Without this fix, original code introduced in PR7659, when running on big-endian machine, can mistakenly store read_amp_bytes_per_bit (an uint32) in little endian format. Future access to `read_amp_bytes_per_bit` will give wrong values. Little endian architecture is not affected.
## 6.12.7 (2020-10-14)
### Other
Fix build issue to enable RocksJava release for ppc64le
## 6.12.6 (2020-10-13)
### Bug Fixes
* Fix false positive flush/compaction `Status::Corruption` failure when `paranoid_file_checks == true` and range tombstones were written to the compaction output files.
## 6.12.5 (2020-10-12)
### Bug Fixes
* Since 6.12, memtable lookup should report unrecognized value_type as corruption (#7121).
* Fixed a bug in the following combination of features: indexes with user keys (`format_version >= 3`), indexes are partitioned (`index_type == kTwoLevelIndexSearch`), and some index partitions are pinned in memory (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache`). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.
* Fixed a bug when indexes are partitioned (`index_type == kTwoLevelIndexSearch`), some index partitions are pinned in memory (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache`), and partitions reads could be mixed between block cache and directly from the file (e.g., with `enable_index_compression == 1` and `mmap_read == 1`, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.
## 6.12.4 (2020-09-18)
### Public API Change
* Reworked `BackupableDBOptions::share_files_with_checksum_naming` (new in 6.12) with some minor improvements and to better support those who were extracting files sizes from backup file names.
## 6.12.3 (2020-09-16)
### Bug fixes
* Fixed a bug in size-amp-triggered and periodic-triggered universal compaction, where the compression settings for the first input level were used rather than the compression settings for the output (bottom) level.
## 6.12.2 (2020-09-14)
### Public API Change
* BlobDB now exposes the start of the expiration range of TTL blob files via the `GetLiveFilesMetaData` API.
## 6.12.1 (2020-08-20)
### Bug fixes
* BackupEngine::CreateNewBackup could fail intermittently with non-OK status when backing up a read-write DB configured with a DBOptions::file_checksum_gen_factory. This issue has been worked-around such that CreateNewBackup should succeed, but (until fully fixed) BackupEngine might not see all checksums available in the DB.
## 6.12 (2020-07-28)
### Public API Change
@ -30,7 +63,7 @@
### New Features
* DB identity (`db_id`) and DB session identity (`db_session_id`) are added to table properties and stored in SST files. SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`. The session ID for SstFileWriter (resp., Repairer) resets every time `SstFileWriter::Open` (resp., `Repairer::Run`) is called.
* Added experimental option BlockBasedTableOptions::optimize_filters_for_memory for reducing allocated memory size of Bloom filters (~10% savings with Jemalloc) while preserving the same general accuracy. To have an effect, the option requires format_version=5 and malloc_usable_size. Enabling this option is forward and backward compatible with existing format_version=5.
* `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming` is added, where `BackupTableNameOption` is an `enum` type with two enumerators `kChecksumAndFileSize` and `kOptionalChecksumAndDbSessionId`. By default, `BackupableDBOptions::share_files_with_checksum_naming` is set to `kOptionalChecksumAndDbSessionId`. In the default case, backup table filenames generated by this version of RocksDB are of the form either `<file_number>_<crc32c>_<db_session_id>.sst` or `<file_number>_<db_session_id>.sst` as opposed to `<file_number>_<crc32c>_<file_size>.sst`. Specifically, table filenames are of the form `<file_number>_<crc32c>_<db_session_id>.sst` if `DBOptions::file_checksum_gen_factory` is set to `GetFileChecksumGenCrc32cFactory()`. Futhermore, the checksum value `<crc32c>` appeared in the filenames is hexadecimal-encoded, instead of being decimal-encoded `uint32_t` value. If `DBOptions::file_checksum_gen_factory` is `nullptr`, the table filenames are of the form `<file_number>_<db_session_id>.sst`. The new default behavior fixes the backup file name collision problem, which might be possible at large scale, but the option `kChecksumAndFileSize` is added to allow use of old naming in case it is needed. Moreover, for table files generated prior to this version of RocksDB, using `kOptionalChecksumAndDbSessionId` will fall back on `kChecksumAndFileSize`. In these cases, the checksum value `<crc32c>` in the filenames `<file_number>_<crc32c>_<file_size>.sst` is decimal-encoded `uint32_t` value as before. This default behavior change is not an upgrade issue, because previous versions of RocksDB can read, restore, and delete backups using new names, and it's OK for a backup directory to use a mixture of table file naming schemes. Note that `share_files_with_checksum_naming` comes into effect only when both `share_files_with_checksum` and `share_table_files` are true.
* `BackupableDBOptions::share_files_with_checksum_naming` is added with new default behavior for naming backup files with `share_files_with_checksum`, to address performance and backup integrity issues. See API comments for details.
* Added auto resume function to automatically recover the DB from background Retryable IO Error. When retryable IOError happens during flush and WAL write, the error is mapped to Hard Error and DB will be in read mode. When retryable IO Error happens during compaction, the error will be mapped to Soft Error. DB is still in write/read mode. Autoresume function will create a thread for a DB to call DB->ResumeImpl() to try the recover for Retryable IO Error during flush and WAL write. Compaction will be rescheduled by itself if retryable IO Error happens. Auto resume may also cause other Retryable IO Error during the recovery, so the recovery will fail. Retry the auto resume may solve the issue, so we use max_bgerror_resume_count to decide how many resume cycles will be tried in total. If it is <=0, auto resume retryable IO Error is disabled. Default is INT_MAX, which will lead to a infinit auto resume. bgerror_resume_retry_interval decides the time interval between two auto resumes.
* Option `max_subcompactions` can be set dynamically using DB::SetDBOptions().
* Added experimental ColumnFamilyOptions::sst_partitioner_factory to define determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space).
@ -38,7 +71,7 @@
### Performance Improvements
* Eliminate key copies for internal comparisons while accessing ingested block-based tables.
* Reduce key comparisons during random access in all block-based tables.
* BackupEngine avoids unnecessary repeated checksum computation for backing up a table file to the `shared_checksum` directory when using `kOptionalChecksumAndDbSessionId`, except on SST files generated before this version of RocksDB, which fall back on using `kChecksumAndFileSize`.
* BackupEngine avoids unnecessary repeated checksum computation for backing up a table file to the `shared_checksum` directory when using `share_files_with_checksum_naming = kUseDbSessionId` (new default), except on SST files generated before this version of RocksDB, which fall back on using `kLegacyCrc32cAndFileSize`.
## 6.11 (6/12/2020)
### Bug Fixes

View File

@ -2091,13 +2091,6 @@ rocksdbjavastaticpublishcentral:
jl/%.o: %.cc
$(AM_V_CC)mkdir -p $(@D) && $(CXX) $(CXXFLAGS) -fPIC -c $< -o $@ $(COVERAGEFLAGS)
jl/crc32c_ppc.o: util/crc32c_ppc.c
$(AM_V_CC)$(CC) $(CFLAGS) -c $< -o $@
jl/crc32c_ppc_asm.o: util/crc32c_ppc_asm.S
$(AM_V_CC)$(CC) $(CFLAGS) -c $< -o $@
rocksdbjava: $(LIB_OBJECTS)
$(AM_V_GEN)cd java;$(MAKE) javalib;
$(AM_V_at)rm -f ./java/target/$(ROCKSDBJNILIB)
@ -2159,7 +2152,7 @@ ifeq ($(HAVE_POWER8),1)
$(OBJ_DIR)/util/crc32c_ppc.o: util/crc32c_ppc.c
$(AM_V_CC)$(CC) $(CFLAGS) -c $< -o $@
+$(OBJ_DIR)/util/crc32c_ppc_asm.o: util/crc32c_ppc_asm.S
$(OBJ_DIR)/util/crc32c_ppc_asm.o: util/crc32c_ppc_asm.S
$(AM_V_CC)$(CC) $(CFLAGS) -c $< -o $@
endif
$(OBJ_DIR)/%.o: %.cc
@ -2205,7 +2198,7 @@ $(OBJ_DIR)/%.c.d: %.c
@$(CXX) $(CXXFLAGS) $(PLATFORM_SHARED_CFLAGS) \
-MM -MT'$@' -MT'$(<:.c=.o)' "$<" -o '$@'
+$(OBJ_DIR)/%.S.d: %.S
$(OBJ_DIR)/%.S.d: %.S
@$(CXX) $(CXXFLAGS) $(PLATFORM_SHARED_CFLAGS) \
-MM -MT'$@' -MT'$(<:.S=.o)' "$<" -o '$@'

View File

@ -1288,8 +1288,8 @@ Status CompactionJob::FinishCompactionOutputFile(
auto kv = tombstone.Serialize();
assert(lower_bound == nullptr ||
ucmp->Compare(*lower_bound, kv.second) < 0);
sub_compact->AddToBuilder(kv.first.Encode(), kv.second,
paranoid_file_checks_);
// Range tombstone is not supported by output validator yet.
sub_compact->builder->Add(kv.first.Encode(), kv.second);
InternalKey smallest_candidate = std::move(kv.first);
if (lower_bound != nullptr &&
ucmp->Compare(smallest_candidate.user_key(), *lower_bound) <= 0) {

View File

@ -1019,9 +1019,9 @@ Compaction* UniversalCompactionBuilder::PickCompactionToOldest(
MaxFileSizeForLevel(mutable_cf_options_, output_level,
kCompactionStyleUniversal),
LLONG_MAX, path_id,
GetCompressionType(ioptions_, vstorage_, mutable_cf_options_, start_level,
1, true /* enable_compression */),
GetCompressionOptions(mutable_cf_options_, vstorage_, start_level,
GetCompressionType(ioptions_, vstorage_, mutable_cf_options_,
output_level, 1, true /* enable_compression */),
GetCompressionOptions(mutable_cf_options_, vstorage_, output_level,
true /* enable_compression */),
/* max_subcompactions */ 0, /* grandparents */ {}, /* is manual */ false,
score_, false /* deletion_compaction */, compaction_reason);

View File

@ -104,7 +104,7 @@ class CorruptionTest : public testing::Test {
ASSERT_OK(::ROCKSDB_NAMESPACE::RepairDB(dbname_, options_));
}
void Build(int n, int flush_every = 0) {
void Build(int n, int start, int flush_every) {
std::string key_space, value_space;
WriteBatch batch;
for (int i = 0; i < n; i++) {
@ -113,13 +113,15 @@ class CorruptionTest : public testing::Test {
dbi->TEST_FlushMemTable();
}
//if ((i % 100) == 0) fprintf(stderr, "@ %d of %d\n", i, n);
Slice key = Key(i, &key_space);
Slice key = Key(i + start, &key_space);
batch.Clear();
batch.Put(key, Value(i, &value_space));
ASSERT_OK(db_->Write(WriteOptions(), &batch));
}
}
void Build(int n, int flush_every = 0) { Build(n, 0, flush_every); }
void Check(int min_expected, int max_expected) {
uint64_t next_expected = 0;
uint64_t missed = 0;
@ -614,6 +616,102 @@ TEST_F(CorruptionTest, ParaniodFileChecksOnCompact) {
}
}
TEST_F(CorruptionTest, ParanoidFileChecksWithDeleteRangeFirst) {
Options options;
options.paranoid_file_checks = true;
options.create_if_missing = true;
for (bool do_flush : {true, false}) {
delete db_;
db_ = nullptr;
ASSERT_OK(DestroyDB(dbname_, options));
ASSERT_OK(DB::Open(options, dbname_, &db_));
std::string start, end;
assert(db_ != nullptr);
ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(),
Key(3, &start), Key(7, &end)));
auto snap = db_->GetSnapshot();
ASSERT_NE(snap, nullptr);
ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(),
Key(8, &start), Key(9, &end)));
ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(),
Key(2, &start), Key(5, &end)));
Build(10);
if (do_flush) {
ASSERT_OK(db_->Flush(FlushOptions()));
} else {
DBImpl* dbi = static_cast_with_check<DBImpl>(db_);
ASSERT_OK(dbi->TEST_FlushMemTable());
ASSERT_OK(dbi->TEST_CompactRange(0, nullptr, nullptr, nullptr, true));
}
db_->ReleaseSnapshot(snap);
}
}
TEST_F(CorruptionTest, ParanoidFileChecksWithDeleteRange) {
Options options;
options.paranoid_file_checks = true;
options.create_if_missing = true;
for (bool do_flush : {true, false}) {
delete db_;
db_ = nullptr;
ASSERT_OK(DestroyDB(dbname_, options));
ASSERT_OK(DB::Open(options, dbname_, &db_));
assert(db_ != nullptr);
Build(10, 0, 0);
std::string start, end;
ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(),
Key(5, &start), Key(15, &end)));
auto snap = db_->GetSnapshot();
ASSERT_NE(snap, nullptr);
ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(),
Key(8, &start), Key(9, &end)));
ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(),
Key(12, &start), Key(17, &end)));
ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(),
Key(2, &start), Key(4, &end)));
Build(10, 10, 0);
if (do_flush) {
ASSERT_OK(db_->Flush(FlushOptions()));
} else {
DBImpl* dbi = static_cast_with_check<DBImpl>(db_);
ASSERT_OK(dbi->TEST_FlushMemTable());
ASSERT_OK(dbi->TEST_CompactRange(0, nullptr, nullptr, nullptr, true));
}
db_->ReleaseSnapshot(snap);
}
}
TEST_F(CorruptionTest, ParanoidFileChecksWithDeleteRangeLast) {
Options options;
options.paranoid_file_checks = true;
options.create_if_missing = true;
for (bool do_flush : {true, false}) {
delete db_;
db_ = nullptr;
ASSERT_OK(DestroyDB(dbname_, options));
ASSERT_OK(DB::Open(options, dbname_, &db_));
assert(db_ != nullptr);
std::string start, end;
Build(10);
ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(),
Key(3, &start), Key(7, &end)));
auto snap = db_->GetSnapshot();
ASSERT_NE(snap, nullptr);
ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(),
Key(6, &start), Key(8, &end)));
ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(),
Key(2, &start), Key(5, &end)));
if (do_flush) {
ASSERT_OK(db_->Flush(FlushOptions()));
} else {
DBImpl* dbi = static_cast_with_check<DBImpl>(db_);
ASSERT_OK(dbi->TEST_FlushMemTable());
ASSERT_OK(dbi->TEST_CompactRange(0, nullptr, nullptr, nullptr, true));
}
db_->ReleaseSnapshot(snap);
}
}
} // namespace ROCKSDB_NAMESPACE
int main(int argc, char** argv) {

View File

@ -8,6 +8,7 @@
// found in the LICENSE file. See the AUTHORS file for names of contributors.
#include <cstring>
#include <regex>
#include "db/db_test_util.h"
#include "port/stack_trace.h"
@ -62,6 +63,13 @@ TEST_F(DBBasicTest, UniqueSession) {
ASSERT_EQ(sid2, sid4);
// Expected compact format for session ids (see notes in implementation)
std::regex expected("[0-9A-Z]{20}");
const std::string match("match");
EXPECT_EQ(match, std::regex_replace(sid1, expected, match));
EXPECT_EQ(match, std::regex_replace(sid2, expected, match));
EXPECT_EQ(match, std::regex_replace(sid3, expected, match));
#ifndef ROCKSDB_LITE
Close();
ASSERT_OK(ReadOnlyReopen(options));

View File

@ -3691,13 +3691,29 @@ Status DBImpl::GetDbSessionId(std::string& session_id) const {
}
void DBImpl::SetDbSessionId() {
// GenerateUniqueId() generates an identifier
// that has a negligible probability of being duplicated
db_session_id_ = env_->GenerateUniqueId();
// Remove the extra '\n' at the end if there is one
if (!db_session_id_.empty() && db_session_id_.back() == '\n') {
db_session_id_.pop_back();
// GenerateUniqueId() generates an identifier that has a negligible
// probability of being duplicated, ~128 bits of entropy
std::string uuid = env_->GenerateUniqueId();
// Hash and reformat that down to a more compact format, 20 characters
// in base-36 ([0-9A-Z]), which is ~103 bits of entropy, which is enough
// to expect no collisions across a billion servers each opening DBs
// a million times (~2^50). Benefits vs. raw unique id:
// * Save ~ dozen bytes per SST file
// * Shorter shared backup file names (some platforms have low limits)
// * Visually distinct from DB id format
uint64_t a = NPHash64(uuid.data(), uuid.size(), 1234U);
uint64_t b = NPHash64(uuid.data(), uuid.size(), 5678U);
db_session_id_.resize(20);
static const char* const base36 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
size_t i = 0;
for (; i < 10U; ++i, a /= 36U) {
db_session_id_[i] = base36[a % 36];
}
for (; i < 20U; ++i, b /= 36U) {
db_session_id_[i] = base36[b % 36];
}
TEST_SYNC_POINT_CALLBACK("DBImpl::SetDbSessionId", &db_session_id_);
}
// Default implementation -- returns not supported status

View File

@ -62,7 +62,9 @@ struct SstFileMetaData {
being_compacted(false),
num_entries(0),
num_deletions(0),
oldest_blob_file_number(0) {}
oldest_blob_file_number(0),
oldest_ancester_time(0),
file_creation_time(0) {}
SstFileMetaData(const std::string& _file_name, uint64_t _file_number,
const std::string& _path, size_t _size,
@ -117,6 +119,8 @@ struct SstFileMetaData {
// oldest SST file that is the compaction ancester of this file.
// The timestamp is provided Env::GetCurrentTime().
// 0 if the information is not available.
//
// Note: for TTL blob files, it contains the start of the expiration range.
uint64_t oldest_ancester_time;
// Timestamp when the SST file is created, provided by Env::GetCurrentTime().
// 0 if the information is not available.

View File

@ -29,24 +29,6 @@ constexpr char kDbFileChecksumFuncName[] = "FileChecksumCrc32c";
// The default BackupEngine file checksum function name.
constexpr char kBackupFileChecksumFuncName[] = "crc32c";
// BackupTableNameOption describes possible naming schemes for backup
// table file names when the table files are stored in the shared_checksum
// directory (i.e., both share_table_files and share_files_with_checksum
// are true).
enum BackupTableNameOption : unsigned char {
// Backup SST filenames are <file_number>_<crc32c>_<file_size>.sst
// where <crc32c> is uint32_t.
kChecksumAndFileSize = 0,
// Backup SST filenames are <file_number>_<crc32c>_<db_session_id>.sst
// where <crc32c> is hexidecimally encoded.
// When DBOptions::file_checksum_gen_factory is not set to
// GetFileChecksumGenCrc32cFactory(), the filenames will be
// <file_number>_<db_session_id>.sst
// When there are no db session ids available in the table file, this
// option will use kChecksumAndFileSize as a fallback.
kOptionalChecksumAndDbSessionId = 1
};
struct BackupableDBOptions {
// Where to keep the backup files. Has to be different than dbname_
// Best to set this to dbname_ + "/backups"
@ -110,17 +92,11 @@ struct BackupableDBOptions {
// Default: nullptr
std::shared_ptr<RateLimiter> restore_rate_limiter{nullptr};
// Only used if share_table_files is set to true. If true, will consider that
// backups can come from different databases, hence an sst is not uniquely
// identifed by its name, but by the triple
// (file name, crc32c, db session id or file length)
//
// Note: If this option is set to true, we recommend setting
// share_files_with_checksum_naming to kOptionalChecksumAndDbSessionId, which
// is also our default option. Otherwise, there is a non-negligible chance of
// filename collision when sharing tables in shared_checksum among several
// DBs.
// *turn it on only if you know what you're doing*
// Only used if share_table_files is set to true. If true, will consider
// that backups can come from different databases, even differently mutated
// databases with the same DB ID. See share_files_with_checksum_naming and
// ShareFilesNaming for details on how table files names are made
// unique between databases.
//
// Default: false
bool share_files_with_checksum;
@ -146,24 +122,79 @@ struct BackupableDBOptions {
// Default: INT_MAX
int max_valid_backups_to_open;
// Naming option for share_files_with_checksum table files. This option
// can be set to kChecksumAndFileSize or kOptionalChecksumAndDbSessionId.
// kChecksumAndFileSize is susceptible to collision as file size is not a
// good source of entroy.
// kOptionalChecksumAndDbSessionId is immune to collision.
// ShareFilesNaming describes possible naming schemes for backup
// table file names when the table files are stored in the shared_checksum
// directory (i.e., both share_table_files and share_files_with_checksum
// are true).
enum ShareFilesNaming : int {
// Backup SST filenames are <file_number>_<crc32c>_<file_size>.sst
// where <crc32c> is an unsigned decimal integer. This is the
// original/legacy naming scheme for share_files_with_checksum,
// with two problems:
// * At massive scale, collisions on this triple with different file
// contents is plausible.
// * Determining the name to use requires computing the checksum,
// so generally requires reading the whole file even if the file
// is already backed up.
// ** ONLY RECOMMENDED FOR PRESERVING OLD BEHAVIOR **
kLegacyCrc32cAndFileSize = 1,
// Backup SST filenames are <file_number>_s<db_session_id>.sst. This
// pair of values should be very strongly unique for a given SST file
// and easily determined before computing a checksum. The 's' indicates
// the value is a DB session id, not a checksum.
//
// Exceptions:
// * For old SST files without a DB session id, kLegacyCrc32cAndFileSize
// will be used instead, matching the names assigned by RocksDB versions
// not supporting the newer naming scheme.
// * See also flags below.
kUseDbSessionId = 2,
kMaskNoNamingFlags = 0xffff,
// If not already part of the naming scheme, insert
// _<file_size>
// before .sst in the name. In case of user code actually parsing the
// last _<whatever> before the .sst as the file size, this preserves that
// feature of kLegacyCrc32cAndFileSize. In other words, this option makes
// official that unofficial feature of the backup metadata.
//
// We do not consider SST file sizes to have sufficient entropy to
// contribute significantly to naming uniqueness.
kFlagIncludeFileSize = 1 << 31,
// When encountering an SST file from a Facebook-internal early
// release of 6.12, use the default naming scheme in effect for
// when the SST file was generated (assuming full file checksum
// was not set to GetFileChecksumGenCrc32cFactory()). That naming is
// <file_number>_<db_session_id>.sst
// and ignores kFlagIncludeFileSize setting.
// NOTE: This flag is intended to be temporary and should be removed
// in a later release.
kFlagMatchInterimNaming = 1 << 30,
kMaskNamingFlags = ~kMaskNoNamingFlags,
};
// Naming option for share_files_with_checksum table files. See
// ShareFilesNaming for details.
//
// Modifying this option cannot introduce a downgrade compatibility issue
// because RocksDB can read, restore, and delete backups using different file
// names, and it's OK for a backup directory to use a mixture of table file
// naming schemes.
//
// Default: kOptionalChecksumAndDbSessionId
// However, modifying this option and saving more backups to the same
// directory can lead to the same file getting saved again to that
// directory, under the new shared name in addition to the old shared
// name.
//
// Default: kUseDbSessionId | kFlagIncludeFileSize | kFlagMatchInterimNaming
//
// Note: This option comes into effect only if both share_files_with_checksum
// and share_table_files are true. In the cases of old table files where no
// db_session_id is stored, we use the file_size to replace the empty
// db_session_id as a fallback.
BackupTableNameOption share_files_with_checksum_naming;
// and share_table_files are true.
ShareFilesNaming share_files_with_checksum_naming;
void Dump(Logger* logger) const;
@ -175,8 +206,9 @@ struct BackupableDBOptions {
uint64_t _restore_rate_limit = 0, int _max_background_operations = 1,
uint64_t _callback_trigger_interval_size = 4 * 1024 * 1024,
int _max_valid_backups_to_open = INT_MAX,
BackupTableNameOption _share_files_with_checksum_naming =
kOptionalChecksumAndDbSessionId)
ShareFilesNaming _share_files_with_checksum_naming =
static_cast<ShareFilesNaming>(kUseDbSessionId | kFlagIncludeFileSize |
kFlagMatchInterimNaming))
: backup_dir(_backup_dir),
backup_env(_backup_env),
share_table_files(_share_table_files),
@ -192,16 +224,36 @@ struct BackupableDBOptions {
max_valid_backups_to_open(_max_valid_backups_to_open),
share_files_with_checksum_naming(_share_files_with_checksum_naming) {
assert(share_table_files || !share_files_with_checksum);
assert((share_files_with_checksum_naming & kMaskNoNamingFlags) != 0);
}
};
inline BackupableDBOptions::ShareFilesNaming operator&(
BackupableDBOptions::ShareFilesNaming lhs,
BackupableDBOptions::ShareFilesNaming rhs) {
int l = static_cast<int>(lhs);
int r = static_cast<int>(rhs);
assert(r == BackupableDBOptions::kMaskNoNamingFlags ||
(r & BackupableDBOptions::kMaskNoNamingFlags) == 0);
return static_cast<BackupableDBOptions::ShareFilesNaming>(l & r);
}
inline BackupableDBOptions::ShareFilesNaming operator|(
BackupableDBOptions::ShareFilesNaming lhs,
BackupableDBOptions::ShareFilesNaming rhs) {
int l = static_cast<int>(lhs);
int r = static_cast<int>(rhs);
assert((r & BackupableDBOptions::kMaskNoNamingFlags) == 0);
return static_cast<BackupableDBOptions::ShareFilesNaming>(l | r);
}
struct CreateBackupOptions {
// Flush will always trigger if 2PC is enabled.
// If write-ahead logs are disabled, set flush_before_backup=true to
// avoid losing unflushed key/value pairs from the memtable.
bool flush_before_backup = false;
// Callback for reporting progress.
// Callback for reporting progress, based on callback_trigger_interval_size.
std::function<void()> progress_callback = []() {};
// If false, background_thread_cpu_priority is ignored.

View File

@ -6,7 +6,7 @@
#define ROCKSDB_MAJOR 6
#define ROCKSDB_MINOR 12
#define ROCKSDB_PATCH 0
#define ROCKSDB_PATCH 8
// Do not use these. We made the mistake of declaring macros starting with
// double underscore. Now we have to live with our choice. We'll deprecate these

View File

@ -813,7 +813,12 @@ TEST_F(OptionsTest, GetBlockBasedTableOptionsFromString) {
"block_cache=1M;block_cache_compressed=1k;block_size=1024;"
"block_size_deviation=8;block_restart_interval=4;"
"format_version=5;whole_key_filtering=1;"
"filter_policy=bloomfilter:4.567:false;",
"filter_policy=bloomfilter:4.567:false;"
// A bug caused read_amp_bytes_per_bit to be a large integer in OPTIONS
// file generated by 6.10 to 6.14. Though bug is fixed in these releases,
// we need to handle the case of loading OPTIONS file generated before the
// fix.
"read_amp_bytes_per_bit=17179869185;",
&new_opt));
ASSERT_TRUE(new_opt.cache_index_and_filter_blocks);
ASSERT_EQ(new_opt.index_type, BlockBasedTableOptions::kHashSearch);
@ -834,6 +839,9 @@ TEST_F(OptionsTest, GetBlockBasedTableOptionsFromString) {
dynamic_cast<const BloomFilterPolicy&>(*new_opt.filter_policy);
EXPECT_EQ(bfp.GetMillibitsPerKey(), 4567);
EXPECT_EQ(bfp.GetWholeBitsPerKey(), 5);
// Verify that only the lower 32bits are stored in
// new_opt.read_amp_bytes_per_bit.
EXPECT_EQ(1U, new_opt.read_amp_bytes_per_bit);
// unknown option
ASSERT_NOK(GetBlockBasedTableOptionsFromString(

View File

@ -331,8 +331,24 @@ static std::unordered_map<std::string, OptionTypeInfo>
OptionTypeFlags::kNone, 0}},
{"read_amp_bytes_per_bit",
{offsetof(struct BlockBasedTableOptions, read_amp_bytes_per_bit),
OptionType::kSizeT, OptionVerificationType::kNormal,
OptionTypeFlags::kNone, 0}},
OptionType::kUInt32T, OptionVerificationType::kNormal,
OptionTypeFlags::kNone, 0,
[](const ConfigOptions& /*opts*/, const std::string& /*name*/,
const std::string& value, char* addr) {
// A workaround to fix a bug in 6.10, 6.11, 6.12, 6.13
// and 6.14. The bug will write out 8 bytes to OPTIONS file from the
// starting address of BlockBasedTableOptions.read_amp_bytes_per_bit
// which is actually a uint32. Consequently, the value of
// read_amp_bytes_per_bit written in the OPTIONS file is wrong.
// From 6.15, RocksDB will try to parse the read_amp_bytes_per_bit
// from OPTIONS file as a uint32. To be able to load OPTIONS file
// generated by affected releases before the fix, we need to
// manually parse read_amp_bytes_per_bit with this special hack.
uint64_t read_amp_bytes_per_bit = ParseUint64(value);
*(reinterpret_cast<uint32_t*>(addr)) =
static_cast<uint32_t>(read_amp_bytes_per_bit);
return Status::OK();
}}},
{"enable_index_compression",
{offsetof(struct BlockBasedTableOptions, enable_index_compression),
OptionType::kBoolean, OptionVerificationType::kNormal,

View File

@ -173,7 +173,7 @@ void PartitionIndexReader::CacheDependencies(const ReadOptions& ro, bool pin) {
assert(s.ok() || block.GetValue() == nullptr);
if (s.ok() && block.GetValue() != nullptr) {
if (block.IsCached()) {
if (block.IsCached() || block.GetOwnValue()) {
if (pin) {
partition_map_[handle.offset()] = std::move(block);
}

View File

@ -149,6 +149,11 @@ class IteratorWrapperBase {
return result_.value_prepared;
}
Slice user_key() const {
assert(Valid());
return iter_->user_key();
}
private:
void Update() {
valid_ = iter_->Valid();

View File

@ -96,8 +96,12 @@ void PropertyBlockBuilder::AddTableProperty(const TableProperties& props) {
if (props.file_creation_time > 0) {
Add(TablePropertiesNames::kFileCreationTime, props.file_creation_time);
}
Add(TablePropertiesNames::kDbId, props.db_id);
Add(TablePropertiesNames::kDbSessionId, props.db_session_id);
if (!props.db_id.empty()) {
Add(TablePropertiesNames::kDbId, props.db_id);
}
if (!props.db_session_id.empty()) {
Add(TablePropertiesNames::kDbSessionId, props.db_session_id);
}
if (!props.filter_policy_name.empty()) {
Add(TablePropertiesNames::kFilterPolicy, props.filter_policy_name);

View File

@ -43,6 +43,10 @@ class TwoLevelIndexIterator : public InternalIteratorBase<IndexValue> {
assert(Valid());
return second_level_iter_.key();
}
Slice user_key() const override {
assert(Valid());
return second_level_iter_.user_key();
}
IndexValue value() const override {
assert(Valid());
return second_level_iter_.value();

View File

@ -47,6 +47,8 @@
namespace ROCKSDB_NAMESPACE {
namespace {
using ShareFilesNaming = BackupableDBOptions::ShareFilesNaming;
inline uint32_t ChecksumHexToInt32(const std::string& checksum_hex) {
std::string checksum_str;
Slice(checksum_hex).DecodeHex(&checksum_str);
@ -149,9 +151,13 @@ class BackupEngineImpl : public BackupEngine {
Status Initialize();
// Obtain the naming option for backup table files
BackupTableNameOption GetTableNamingOption() const {
return options_.share_files_with_checksum_naming;
ShareFilesNaming GetNamingNoFlags() const {
return options_.share_files_with_checksum_naming &
BackupableDBOptions::kMaskNoNamingFlags;
}
ShareFilesNaming GetNamingFlags() const {
return options_.share_files_with_checksum_naming &
BackupableDBOptions::kMaskNamingFlags;
}
private:
@ -186,7 +192,7 @@ class BackupEngineImpl : public BackupEngine {
// currently
const std::string db_id;
// db_session_id appears in the backup SST filename if the table naming
// option is kOptionalChecksumAndDbSessionId
// option is kUseDbSessionId
const std::string db_session_id;
};
@ -320,9 +326,17 @@ class BackupEngineImpl : public BackupEngine {
return GetSharedChecksumDirRel() + "/" + (tmp ? "." : "") + file +
(tmp ? ".tmp" : "");
}
inline bool UseSessionId(const std::string& sid) const {
return GetTableNamingOption() == kOptionalChecksumAndDbSessionId &&
!sid.empty();
inline bool UseLegacyNaming(const std::string& sid) const {
return GetNamingNoFlags() ==
BackupableDBOptions::kLegacyCrc32cAndFileSize ||
sid.empty();
}
inline bool UseInterimNaming(const std::string& sid) const {
// The indicator of SST file from early internal 6.12 release
// is a '-' in the DB session id. DB session id was made more
// concise without '-' after that.
return (GetNamingFlags() & BackupableDBOptions::kFlagMatchInterimNaming) &&
sid.find('-') != std::string::npos;
}
inline std::string GetSharedFileWithChecksum(
const std::string& file, bool has_checksum,
@ -330,19 +344,22 @@ class BackupEngineImpl : public BackupEngine {
const std::string& db_session_id) const {
assert(file.size() == 0 || file[0] != '/');
std::string file_copy = file;
if (UseSessionId(db_session_id)) {
if (has_checksum) {
return file_copy.insert(file_copy.find_last_of('.'),
"_" + checksum_hex + "_" + db_session_id);
} else {
return file_copy.insert(file_copy.find_last_of('.'),
"_" + db_session_id);
}
if (UseLegacyNaming(db_session_id)) {
assert(has_checksum);
(void)has_checksum;
file_copy.insert(file_copy.find_last_of('.'),
"_" + ToString(ChecksumHexToInt32(checksum_hex)) + "_" +
ToString(file_size));
} else if (UseInterimNaming(db_session_id)) {
file_copy.insert(file_copy.find_last_of('.'), "_" + db_session_id);
} else {
return file_copy.insert(file_copy.find_last_of('.'),
"_" + ToString(ChecksumHexToInt32(checksum_hex)) +
"_" + ToString(file_size));
file_copy.insert(file_copy.find_last_of('.'), "_s" + db_session_id);
if (GetNamingFlags() & BackupableDBOptions::kFlagIncludeFileSize) {
file_copy.insert(file_copy.find_last_of('.'),
"_" + ToString(file_size));
}
}
return file_copy;
}
inline std::string GetFileFromChecksumFile(const std::string& file) const {
assert(file.size() == 0 || file[0] != '/');
@ -1598,7 +1615,7 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
// Step 1: Prepare the relative path to destination
if (shared && shared_checksum) {
if (GetTableNamingOption() == kOptionalChecksumAndDbSessionId) {
if (GetNamingNoFlags() != BackupableDBOptions::kLegacyCrc32cAndFileSize) {
// Prepare db_session_id to add to the file name
// Ignore the returned status
// In the failed cases, db_id and db_session_id will be empty
@ -1621,7 +1638,7 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
return Status::NotFound("File missing: " + src_dir + fname);
}
// dst_relative depends on the following conditions:
// 1) the naming scheme is kOptionalChecksumAndDbSessionId,
// 1) the naming scheme is kUseDbSessionId,
// 2) db_session_id is not empty,
// 3) checksum is available in the DB manifest.
// If 1,2,3) are satisfied, then dst_relative will be of the form:
@ -1697,6 +1714,7 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
} else {
// file exists and referenced
if (!has_checksum) {
// FIXME(peterd): extra I/O
s = CalculateChecksum(src_dir + fname, db_env_, src_env_options,
size_limit, &checksum_hex);
if (!s.ok()) {
@ -1704,7 +1722,7 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
}
has_checksum = true;
}
if (UseSessionId(db_session_id)) {
if (!db_session_id.empty()) {
ROCKS_LOG_INFO(options_.info_log,
"%s already present, with checksum %s, size %" PRIu64
" and DB session identity %s",
@ -1734,6 +1752,7 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
ROCKS_LOG_INFO(options_.info_log,
"%s already present, calculate checksum", fname.c_str());
if (!has_checksum) {
// FIXME(peterd): extra I/O
s = CalculateChecksum(src_dir + fname, db_env_, src_env_options,
size_limit, &checksum_hex);
if (!s.ok()) {

View File

@ -13,6 +13,7 @@
#include <algorithm>
#include <limits>
#include <regex>
#include <string>
#include <utility>
@ -37,6 +38,15 @@
namespace ROCKSDB_NAMESPACE {
namespace {
using ShareFilesNaming = BackupableDBOptions::ShareFilesNaming;
const auto kLegacyCrc32cAndFileSize =
BackupableDBOptions::kLegacyCrc32cAndFileSize;
const auto kUseDbSessionId = BackupableDBOptions::kUseDbSessionId;
const auto kFlagIncludeFileSize = BackupableDBOptions::kFlagIncludeFileSize;
const auto kFlagMatchInterimNaming =
BackupableDBOptions::kFlagMatchInterimNaming;
const auto kNamingDefault =
kUseDbSessionId | kFlagIncludeFileSize | kFlagMatchInterimNaming;
class DummyDB : public StackableDB {
public:
@ -634,8 +644,8 @@ class BackupableDBTest : public testing::Test {
backup_engine_.reset();
}
void OpenBackupEngine() {
backupable_options_->destroy_old_data = false;
void OpenBackupEngine(bool destroy_old_data = false) {
backupable_options_->destroy_old_data = destroy_old_data;
BackupEngine* backup_engine;
ASSERT_OK(BackupEngine::Open(test_db_env_.get(), *backupable_options_,
&backup_engine));
@ -725,6 +735,45 @@ class BackupableDBTest : public testing::Test {
return WriteStringToFile(test_db_env_.get(), file_contents, fname);
}
void AssertDirectoryFilesMatchRegex(const std::string& dir,
const std::regex& pattern,
int minimum_count) {
std::vector<FileAttributes> children;
ASSERT_OK(file_manager_->GetChildrenFileAttributes(dir, &children));
int found_count = 0;
for (const auto& child : children) {
if (child.name == "." || child.name == "..") {
continue;
}
const std::string match("match");
ASSERT_EQ(match, std::regex_replace(child.name, pattern, match));
++found_count;
}
ASSERT_GE(found_count, minimum_count);
}
void AssertDirectoryFilesSizeIndicators(const std::string& dir,
int minimum_count) {
std::vector<FileAttributes> children;
ASSERT_OK(file_manager_->GetChildrenFileAttributes(dir, &children));
int found_count = 0;
for (const auto& child : children) {
if (child.name == "." || child.name == "..") {
continue;
}
auto last_underscore = child.name.find_last_of('_');
auto last_dot = child.name.find_last_of('.');
ASSERT_NE(child.name, child.name.substr(0, last_underscore));
ASSERT_NE(child.name, child.name.substr(0, last_dot));
ASSERT_LT(last_underscore, last_dot);
std::string s = child.name.substr(last_underscore + 1,
last_dot - (last_underscore + 1));
ASSERT_EQ(s, ToString(child.size_bytes));
++found_count;
}
ASSERT_GE(found_count, minimum_count);
}
// files
std::string dbname_;
std::string backupdir_;
@ -1253,7 +1302,8 @@ TEST_P(BackupableDBTestWithParam, TableFileCorruptedBeforeBackup) {
TEST_F(BackupableDBTest, TableFileWithoutDbChecksumCorruptedDuringBackup) {
const int keys_iteration = 50000;
backupable_options_->share_files_with_checksum_naming = kChecksumAndFileSize;
backupable_options_->share_files_with_checksum_naming =
kLegacyCrc32cAndFileSize;
// When share_files_with_checksum is on, we calculate checksums of table
// files before and after copying. So we can test whether a corruption has
// happened during the file is copied to backup directory.
@ -1357,6 +1407,38 @@ TEST_F(BackupableDBTest, InterruptCreationTest) {
AssertBackupConsistency(0, 0, keys_iteration);
}
TEST_F(BackupableDBTest, FlushCompactDuringBackupCheckpoint) {
const int keys_iteration = 5000;
options_.file_checksum_gen_factory = GetFileChecksumGenCrc32cFactory();
for (const auto& sopt : kAllShareOptions) {
OpenDBAndBackupEngine(true /* destroy_old_data */, false /* dummy */, sopt);
FillDB(db_.get(), 0, keys_iteration);
// That FillDB leaves a mix of flushed and unflushed data
SyncPoint::GetInstance()->LoadDependency(
{{"CheckpointImpl::CreateCustomCheckpoint:AfterGetLive1",
"BackupableDBTest::FlushDuringBackupCheckpoint:BeforeFlush"},
{"BackupableDBTest::FlushDuringBackupCheckpoint:AfterFlush",
"CheckpointImpl::CreateCustomCheckpoint:AfterGetLive2"}});
SyncPoint::GetInstance()->EnableProcessing();
ROCKSDB_NAMESPACE::port::Thread flush_thread{[this]() {
TEST_SYNC_POINT(
"BackupableDBTest::FlushDuringBackupCheckpoint:BeforeFlush");
FillDB(db_.get(), keys_iteration, 2 * keys_iteration);
ASSERT_OK(db_->Flush(FlushOptions()));
DBImpl* dbi = static_cast<DBImpl*>(db_.get());
dbi->TEST_WaitForFlushMemTable();
ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr));
dbi->TEST_WaitForCompact();
TEST_SYNC_POINT(
"BackupableDBTest::FlushDuringBackupCheckpoint:AfterFlush");
}};
ASSERT_OK(backup_engine_->CreateNewBackup(db_.get()));
flush_thread.join();
CloseDBAndBackupEngine();
AssertBackupConsistency(0, 0, keys_iteration);
}
}
inline std::string OptionsPath(std::string ret, int backupID) {
ret += "/private/";
ret += std::to_string(backupID);
@ -1564,42 +1646,146 @@ TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsTransition) {
}
}
// Verify backup and restore with share_files_with_checksum on and
// share_files_with_checksum_naming = kOptionalChecksumAndDbSessionId
// Verify backup and restore with various naming options, check names
TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsNewNaming) {
// Use session id in the name of SST files
ASSERT_TRUE(backupable_options_->share_files_with_checksum_naming ==
kOptionalChecksumAndDbSessionId);
kNamingDefault);
const int keys_iteration = 5000;
int i = 0;
OpenDBAndBackupEngine(true, false, kShareWithChecksum);
FillDB(db_.get(), keys_iteration * i, keys_iteration * (i + 1));
ASSERT_OK(backup_engine_->CreateNewBackup(db_.get(), !!(i % 2)));
FillDB(db_.get(), 0, keys_iteration);
CloseDBAndBackupEngine();
AssertBackupConsistency(i + 1, 0, keys_iteration * (i + 1),
keys_iteration * (i + 2));
// Both checksum and session id in the name of SST files
options_.file_checksum_gen_factory = GetFileChecksumGenCrc32cFactory();
OpenDBAndBackupEngine(false, false, kShareWithChecksum);
FillDB(db_.get(), keys_iteration * i, keys_iteration * (i + 1));
ASSERT_OK(backup_engine_->CreateNewBackup(db_.get(), !!(i % 2)));
static const std::map<ShareFilesNaming, std::string> option_to_expected = {
{kLegacyCrc32cAndFileSize, "[0-9]+_[0-9]+_[0-9]+[.]sst"},
// kFlagIncludeFileSize redundant here
{kLegacyCrc32cAndFileSize | kFlagIncludeFileSize,
"[0-9]+_[0-9]+_[0-9]+[.]sst"},
{kUseDbSessionId, "[0-9]+_s[0-9A-Z]{20}[.]sst"},
{kUseDbSessionId | kFlagIncludeFileSize,
"[0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst"},
};
for (const auto& pair : option_to_expected) {
// kFlagMatchInterimNaming must not matter on new SST files
for (const auto option :
{pair.first, pair.first | kFlagMatchInterimNaming}) {
CloseAndReopenDB();
backupable_options_->share_files_with_checksum_naming = option;
OpenBackupEngine(true /*destroy_old_data*/);
ASSERT_OK(backup_engine_->CreateNewBackup(db_.get()));
CloseDBAndBackupEngine();
AssertBackupConsistency(1, 0, keys_iteration, keys_iteration * 2);
AssertDirectoryFilesMatchRegex(backupdir_ + "/shared_checksum",
std::regex(pair.second),
1 /* minimum_count */);
if (std::string::npos != pair.second.find("_[0-9]+[.]sst")) {
AssertDirectoryFilesSizeIndicators(backupdir_ + "/shared_checksum",
1 /* minimum_count */);
}
}
}
}
// Mimic SST file generated by early internal-only 6.12 release
// and test various naming options. This test can be removed when
// the kFlagMatchInterimNaming feature is removed.
TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsInterimNaming) {
const int keys_iteration = 5000;
// Essentially, reinstate old implementaiton of generating a DB
// session id. This is how we distinguish "interim" SST files from
// newer ones: from the form of the db session id string.
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"DBImpl::SetDbSessionId", [&](void* sid_void_star) {
std::string* sid = static_cast<std::string*>(sid_void_star);
*sid = test_db_env_->GenerateUniqueId();
if (!sid->empty() && sid->back() == '\n') {
sid->pop_back();
}
});
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
OpenDBAndBackupEngine(true, false, kShareWithChecksum);
FillDB(db_.get(), 0, keys_iteration);
CloseDBAndBackupEngine();
AssertBackupConsistency(i + 1, 0, keys_iteration * (i + 1),
keys_iteration * (i + 2));
static const std::map<ShareFilesNaming, std::string> option_to_expected = {
{kLegacyCrc32cAndFileSize, "[0-9]+_[0-9]+_[0-9]+[.]sst"},
// kFlagMatchInterimNaming ignored here
{kLegacyCrc32cAndFileSize | kFlagMatchInterimNaming,
"[0-9]+_[0-9]+_[0-9]+[.]sst"},
{kUseDbSessionId, "[0-9]+_s[0-9a-fA-F-]+[.]sst"},
{kUseDbSessionId | kFlagIncludeFileSize,
"[0-9]+_s[0-9a-fA-F-]+_[0-9]+[.]sst"},
{kUseDbSessionId | kFlagMatchInterimNaming, "[0-9]+_[0-9a-fA-F-]+[.]sst"},
{kUseDbSessionId | kFlagIncludeFileSize | kFlagMatchInterimNaming,
"[0-9]+_[0-9a-fA-F-]+[.]sst"},
};
for (const auto& pair : option_to_expected) {
CloseAndReopenDB();
backupable_options_->share_files_with_checksum_naming = pair.first;
OpenBackupEngine(true /*destroy_old_data*/);
ASSERT_OK(backup_engine_->CreateNewBackup(db_.get()));
CloseDBAndBackupEngine();
AssertBackupConsistency(1, 0, keys_iteration, keys_iteration * 2);
AssertDirectoryFilesMatchRegex(backupdir_ + "/shared_checksum",
std::regex(pair.second),
1 /* minimum_count */);
}
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
}
// Mimic SST file generated by pre-6.12 releases and verify that
// old names are always used regardless of naming option.
TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsOldFileNaming) {
const int keys_iteration = 5000;
// Pre-6.12 release did not include db id and db session id properties.
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"PropertyBlockBuilder::AddTableProperty:Start", [&](void* props_vs) {
auto props = static_cast<TableProperties*>(props_vs);
props->db_id = "";
props->db_session_id = "";
});
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
OpenDBAndBackupEngine(true, false, kShareWithChecksum);
FillDB(db_.get(), 0, keys_iteration);
CloseDBAndBackupEngine();
// Old names should always be used on old files
const std::regex expected("[0-9]+_[0-9]+_[0-9]+[.]sst");
for (ShareFilesNaming option : {kNamingDefault, kUseDbSessionId}) {
CloseAndReopenDB();
backupable_options_->share_files_with_checksum_naming = option;
OpenBackupEngine(true /*destroy_old_data*/);
ASSERT_OK(backup_engine_->CreateNewBackup(db_.get()));
CloseDBAndBackupEngine();
AssertBackupConsistency(1, 0, keys_iteration, keys_iteration * 2);
AssertDirectoryFilesMatchRegex(backupdir_ + "/shared_checksum", expected,
1 /* minimum_count */);
}
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
}
// Verify backup and restore with share_files_with_checksum off and then
// transition this option to on and share_files_with_checksum_naming to be
// kOptionalChecksumAndDbSessionId
// based on kUseDbSessionId
TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsNewNamingTransition) {
const int keys_iteration = 5000;
// We may set share_files_with_checksum_naming to kChecksumAndFileSize
// We may set share_files_with_checksum_naming to kLegacyCrc32cAndFileSize
// here but even if we don't, it should have no effect when
// share_files_with_checksum is false
ASSERT_TRUE(backupable_options_->share_files_with_checksum_naming ==
kOptionalChecksumAndDbSessionId);
kNamingDefault);
// set share_files_with_checksum to false
OpenDBAndBackupEngine(true, false, kShareNoChecksum);
int j = 3;
@ -1617,7 +1803,7 @@ TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsNewNamingTransition) {
// set share_files_with_checksum to true and do some more backups
// and use session id in the name of SST file backup
ASSERT_TRUE(backupable_options_->share_files_with_checksum_naming ==
kOptionalChecksumAndDbSessionId);
kNamingDefault);
OpenDBAndBackupEngine(false /* destroy_old_data */, false,
kShareWithChecksum);
FillDB(db_.get(), keys_iteration * j, keys_iteration * (j + 1));
@ -1637,9 +1823,9 @@ TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsNewNamingTransition) {
// For an extra challenge, make sure that GarbageCollect / DeleteBackup
// is OK even if we open without share_table_files but with
// share_files_with_checksum_naming being kOptionalChecksumAndDbSessionId
// share_files_with_checksum_naming based on kUseDbSessionId
ASSERT_TRUE(backupable_options_->share_files_with_checksum_naming ==
kOptionalChecksumAndDbSessionId);
kNamingDefault);
OpenDBAndBackupEngine(false /* destroy_old_data */, false, kNoShare);
backup_engine_->DeleteBackup(1);
backup_engine_->GarbageCollect();
@ -1651,7 +1837,8 @@ TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsNewNamingTransition) {
// Use checksum and file size for backup table file names and open without
// share_table_files
// Again, make sure that GarbageCollect / DeleteBackup is OK
backupable_options_->share_files_with_checksum_naming = kChecksumAndFileSize;
backupable_options_->share_files_with_checksum_naming =
kLegacyCrc32cAndFileSize;
OpenDBAndBackupEngine(false /* destroy_old_data */, false, kNoShare);
backup_engine_->DeleteBackup(2);
backup_engine_->GarbageCollect();
@ -1665,9 +1852,10 @@ TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsNewNamingTransition) {
}
// Verify backup and restore with share_files_with_checksum on and transition
// from kChecksumAndFileSize to kOptionalChecksumAndDbSessionId
// from kLegacyCrc32cAndFileSize to kUseDbSessionId
TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsNewNamingUpgrade) {
backupable_options_->share_files_with_checksum_naming = kChecksumAndFileSize;
backupable_options_->share_files_with_checksum_naming =
kLegacyCrc32cAndFileSize;
const int keys_iteration = 5000;
// set share_files_with_checksum to true
OpenDBAndBackupEngine(true, false, kShareWithChecksum);
@ -1683,8 +1871,7 @@ TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsNewNamingUpgrade) {
keys_iteration * (j + 1));
}
backupable_options_->share_files_with_checksum_naming =
kOptionalChecksumAndDbSessionId;
backupable_options_->share_files_with_checksum_naming = kUseDbSessionId;
OpenDBAndBackupEngine(false /* destroy_old_data */, false,
kShareWithChecksum);
FillDB(db_.get(), keys_iteration * j, keys_iteration * (j + 1));
@ -1715,7 +1902,8 @@ TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsNewNamingUpgrade) {
// Use checksum and file size for backup table file names and open without
// share_table_files
// Again, make sure that GarbageCollect / DeleteBackup is OK
backupable_options_->share_files_with_checksum_naming = kChecksumAndFileSize;
backupable_options_->share_files_with_checksum_naming =
kLegacyCrc32cAndFileSize;
OpenDBAndBackupEngine(false /* destroy_old_data */, false, kNoShare);
backup_engine_->DeleteBackup(2);
backup_engine_->GarbageCollect();

View File

@ -98,6 +98,9 @@ void BlobDBImpl::GetLiveFilesMetaData(std::vector<LiveFileMetaData>* metadata) {
// Path should be relative to db_name, but begin with slash.
filemetadata.name = BlobFileName("", bdb_options_.blob_dir, file_number);
filemetadata.file_number = file_number;
if (blob_file->HasTTL()) {
filemetadata.oldest_ancester_time = blob_file->GetExpirationRange().first;
}
auto cfh =
static_cast_with_check<ColumnFamilyHandleImpl>(DefaultColumnFamily());
filemetadata.column_family_name = cfh->GetName();

View File

@ -791,29 +791,50 @@ TEST_F(BlobDBTest, ColumnFamilyNotSupported) {
TEST_F(BlobDBTest, GetLiveFilesMetaData) {
Random rnd(301);
BlobDBOptions bdb_options;
bdb_options.blob_dir = "blob_dir";
bdb_options.path_relative = true;
bdb_options.ttl_range_secs = 10;
bdb_options.min_blob_size = 0;
bdb_options.disable_background_tasks = true;
Open(bdb_options);
Options options;
options.env = mock_env_.get();
Open(bdb_options, options);
std::map<std::string, std::string> data;
for (size_t i = 0; i < 100; i++) {
PutRandom("key" + ToString(i), &rnd, &data);
}
constexpr uint64_t expiration = 1000ULL;
PutRandomUntil("key100", expiration, &rnd, &data);
std::vector<LiveFileMetaData> metadata;
blob_db_->GetLiveFilesMetaData(&metadata);
ASSERT_EQ(1U, metadata.size());
ASSERT_EQ(2U, metadata.size());
// Path should be relative to db_name, but begin with slash.
std::string filename = "/blob_dir/000001.blob";
ASSERT_EQ(filename, metadata[0].name);
const std::string filename1("/blob_dir/000001.blob");
ASSERT_EQ(filename1, metadata[0].name);
ASSERT_EQ(1, metadata[0].file_number);
ASSERT_EQ("default", metadata[0].column_family_name);
ASSERT_EQ(0, metadata[0].oldest_ancester_time);
ASSERT_EQ(kDefaultColumnFamilyName, metadata[0].column_family_name);
const std::string filename2("/blob_dir/000002.blob");
ASSERT_EQ(filename2, metadata[1].name);
ASSERT_EQ(2, metadata[1].file_number);
ASSERT_EQ(expiration, metadata[1].oldest_ancester_time);
ASSERT_EQ(kDefaultColumnFamilyName, metadata[1].column_family_name);
std::vector<std::string> livefile;
uint64_t mfs;
ASSERT_OK(blob_db_->GetLiveFiles(livefile, &mfs, false));
ASSERT_EQ(4U, livefile.size());
ASSERT_EQ(filename, livefile[3]);
ASSERT_EQ(5U, livefile.size());
ASSERT_EQ(filename1, livefile[3]);
ASSERT_EQ(filename2, livefile[4]);
VerifyDB(data);
}

View File

@ -189,7 +189,9 @@ class BlobFile {
// All Get functions which are not atomic, will need ReadLock on the mutex
ExpirationRange GetExpirationRange() const { return expiration_range_; }
const ExpirationRange& GetExpirationRange() const {
return expiration_range_;
}
void ExtendExpirationRange(uint64_t expiration) {
expiration_range_.first = std::min(expiration_range_.first, expiration);

View File

@ -250,6 +250,8 @@ Status CheckpointImpl::CreateCustomCheckpoint(
TEST_SYNC_POINT("CheckpointImpl::CreateCheckpoint:SavedLiveFiles2");
db_->FlushWAL(false /* sync */);
}
TEST_SYNC_POINT("CheckpointImpl::CreateCustomCheckpoint:AfterGetLive1");
TEST_SYNC_POINT("CheckpointImpl::CreateCustomCheckpoint:AfterGetLive2");
// if we have more than one column family, we need to also get WAL files
if (s.ok()) {
s = db_->GetSortedWalFiles(live_wal_files);
@ -314,8 +316,15 @@ Status CheckpointImpl::CreateCustomCheckpoint(
// find checksum info for table files
s = checksum_list->SearchOneFileChecksum(number, &checksum_value,
&checksum_name);
// XXX/FIXME(peterd): There's currently a race between GetLiveFiles
// and GetLiveFilesChecksumInfo that could lead to not finding
// checksum info on a file that has it. For now, we can accept
// that and treat it like a legacy file lacking checksum info.
if (!s.ok()) {
return Status::NotFound("Can't find checksum for " + src_fname);
assert(checksum_name == kUnknownFileChecksumFuncName);
assert(checksum_value == kUnknownFileChecksum);
s = Status::OK();
}
}
s = copy_file_cb(db_->GetName(), src_fname,