Less I/O for incremental backups, slightly better corruption detection (#7413)

Summary:
Two relatively simple functional changes to incremental backup
behavior, integrated with a minor refactoring to reduce code redundancy and
improve error/log message. There are nuances to the impact of these changes,
but I believe they are fundamentally good and generally safe. Those functional
changes:

* Incremental backups no longer read DB table files that are already saved to a
shared part of the backup directory, unless `share_files_with_checksum` is used
with `kLegacyCrc32cAndFileSize` naming (discouraged) where crc32c full file
checksums are needed to determine file naming.
  * Justification: incremental backups should not need to read the whole DB,
especially without rate limiting. (Although other BackupEngine reads are not
rate limited either, other non-trivial reads are generally limited by a
corresponding write, as in copying files.) Also, the fact that this is not
already fixed was arguably a bug/oversight in the implementation of https://github.com/facebook/rocksdb/issues/7110.

* When considering whether a table file is already backed up in a shared part
of backup directory, BackupEngine would already query the sizes of source (DB)
and pre-existing destination (backup) files. BackupEngine now uses these file
sizes to detect corruption, as at least one of (a) old backup, (b) backup in
progress, or (c) current DB is corrupt if there's a size mismatch.
  * Justification: a random related fix that also helps to cover a small hole
in corruption checking uncovered by the other functional change:
  * For `share_table_files` without "checksum" (not recommended), the other
change regresses in detecting fundamentally unsafe use of this option
combination: when you might generate different versions of same SST file
number. As demonstrated by `BackupableDBTest.FailOverwritingBackups,` this
regression is greatly mitigated by the new file size checking. Nevertheless,
almost no reason to use `share_files_with_checksum=false` should remain, and
comments are updated appropriately.

Also, this change renames internal function `CalculateChecksum` to
`ReadFileAndComputeChecksum` to make the performance impact of this function
clear in code reviews.

It is not clear what 'same_path' is for in backupable_db.cc, and I suspect it
cannot be true for a DB with unique file names (like DBImpl). Nevertheless,
I've tried to keep its functionality intact when `true` to minimize risk for
now, despite having no unit tests for which it is true.

Select impact details (much more in unit tests): For
`share_files_with_checksum`, I am confident there is no regression (vs.
pre-6.12) in detecting DB or backup corruption at backup creation time, mostly
because the old design did not leverage this extra checksum computation for
detecting inconsistencies at backup creation time. (With computed checksums in
names, a recently corrupted file just looked like a different file vs. what was
already backed up.)

Even in the hypothetical case of DB session id collision (~100 bits entropy
collision), file size in name and/or our file size check add an extra layer of
protection against false success in creating an accurate new backup. (Unit test
included.)

`DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking
are still able to catch corruptions that `CreateNewBackup` does not. Note that
when custom file checksum support is added to BackupEngine, that will
essentially give the same power as `DB::VerifyChecksum` into `CreateNewBackup`.
We could add options for `CreateNewBackup` to cover some of what would be
caught by `VerifyBackup` with checksum checking.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7413

Test Plan:
Two new unit tests included, both of which fail without these
changes. Although we don't test the I/O improvement directly, we test it
indirectly in DB corruption detection power that was inadvertently unlocked
with new backup file naming PLUS computing current content checksums (now
removed). (I don't think that case of DB corruption detection justifies reading
the whole DB on incremental backup.)

Reviewed By: zhichao-cao

Differential Revision: D23818480

Pulled By: pdillinger

fbshipit-source-id: 148aff16f001af5b9fd4b22f155311c2461f1bac
This commit is contained in:
Peter Dillinger 2020-09-21 16:18:11 -07:00
parent fb98398ca9
commit cda8a74eb7
4 changed files with 392 additions and 91 deletions

View File

@ -1,5 +1,5 @@
# Rocksdb Change Log # Rocksdb Change Log
## 6.13 (09/12/2020) ## 6.13 (09/22/2020)
### Bug fixes ### Bug fixes
* Fix a performance regression introduced in 6.4 that makes a upper bound check for every Next() even if keys are within a data block that is within the upper bound. * Fix a performance regression introduced in 6.4 that makes a upper bound check for every Next() even if keys are within a data block that is within the upper bound.
* Fix a possible corruption to the LSM state (overlapping files within a level) when a `CompactRange()` for refitting levels (`CompactRangeOptions::change_level == true`) and another manual compaction are executed in parallel. * Fix a possible corruption to the LSM state (overlapping files within a level) when a `CompactRange()` for refitting levels (`CompactRangeOptions::change_level == true`) and another manual compaction are executed in parallel.
@ -20,6 +20,10 @@
### Performance Improvements ### Performance Improvements
* Reduce thread number for multiple DB instances by re-using one global thread for statistics dumping and persisting. * Reduce thread number for multiple DB instances by re-using one global thread for statistics dumping and persisting.
* Reduce write-amp in heavy write bursts in `kCompactionStyleLevel` compaction style with `level_compaction_dynamic_level_bytes` set. * Reduce write-amp in heavy write bursts in `kCompactionStyleLevel` compaction style with `level_compaction_dynamic_level_bytes` set.
* BackupEngine incremental backups no longer read DB table files that are already saved to a shared part of the backup directory, unless `share_files_with_checksum` is used with `kLegacyCrc32cAndFileSize` naming (discouraged).
* For `share_files_with_checksum`, we are confident there is no regression (vs. pre-6.12) in detecting DB or backup corruption at backup creation time, mostly because the old design did not leverage this extra checksum computation for detecting inconsistencies at backup creation time.
* For `share_table_files` without "checksum" (not recommended), there is a regression in detecting fundamentally unsafe use of the option, greatly mitigated by file size checking (under "Behavior Changes"). Almost no reason to use `share_files_with_checksum=false` should remain.
* `DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking are still able to catch corruptions that `CreateNewBackup` does not.
### Public API Change ### Public API Change
* Expose kTypeDeleteWithTimestamp in EntryType and update GetEntryType() accordingly. * Expose kTypeDeleteWithTimestamp in EntryType and update GetEntryType() accordingly.
@ -30,7 +34,8 @@
### Behavior Changes ### Behavior Changes
* File abstraction `FSRandomAccessFile.Prefetch()` default return status is changed from `OK` to `NotSupported`. If the user inherited file doesn't implement prefetch, RocksDB will create internal prefetch buffer to improve read performance. * File abstraction `FSRandomAccessFile.Prefetch()` default return status is changed from `OK` to `NotSupported`. If the user inherited file doesn't implement prefetch, RocksDB will create internal prefetch buffer to improve read performance.
* When retryabel IO error happens during Flush (manifest write error is excluded) and WAL is disabled, originally it is mapped to kHardError. Now,it is mapped to soft error. So DB will not stall the writes unless the memtable is full. At the same time, when auto resume is triggered to recover the retryable IO error during Flush, SwitchMemtable is not called to avoid generating to many small immutable memtables. If WAL is enabled, no behavior changes. * When retryable IO error happens during Flush (manifest write error is excluded) and WAL is disabled, originally it is mapped to kHardError. Now,it is mapped to soft error. So DB will not stall the writes unless the memtable is full. At the same time, when auto resume is triggered to recover the retryable IO error during Flush, SwitchMemtable is not called to avoid generating to many small immutable memtables. If WAL is enabled, no behavior changes.
* When considering whether a table file is already backed up in a shared part of backup directory, BackupEngine would already query the sizes of source (DB) and pre-existing destination (backup) files. BackupEngine now uses these file sizes to detect corruption, as at least one of (a) old backup, (b) backup in progress, or (c) current DB is corrupt if there's a size mismatch.
### Others ### Others
* Error in prefetching partitioned index blocks will not be swallowed. It will fail the query and return the IOError users. * Error in prefetching partitioned index blocks will not be swallowed. It will fail the query and return the IOError users.

View File

@ -98,7 +98,10 @@ struct BackupableDBOptions {
// ShareFilesNaming for details on how table files names are made // ShareFilesNaming for details on how table files names are made
// unique between databases. // unique between databases.
// //
// Default: false // Using 'true' is fundamentally safer, and performance improvements vs.
// original design should leave almost no reason to use the 'false' setting.
//
// Default (only for historical reasons): false
bool share_files_with_checksum; bool share_files_with_checksum;
// Up to this many background threads will copy files for CreateNewBackup() // Up to this many background threads will copy files for CreateNewBackup()

View File

@ -391,9 +391,10 @@ class BackupEngineImpl : public BackupEngine {
uint64_t size_limit = 0, uint64_t size_limit = 0,
std::function<void()> progress_callback = []() {}); std::function<void()> progress_callback = []() {});
Status CalculateChecksum(const std::string& src, Env* src_env, Status ReadFileAndComputeChecksum(const std::string& src, Env* src_env,
const EnvOptions& src_env_options, const EnvOptions& src_env_options,
uint64_t size_limit, std::string* checksum_hex); uint64_t size_limit,
std::string* checksum_hex);
// Obtain db_id and db_session_id from the table properties of file_path // Obtain db_id and db_session_id from the table properties of file_path
Status GetFileDbIdentities(Env* src_env, const EnvOptions& src_env_options, Status GetFileDbIdentities(Env* src_env, const EnvOptions& src_env_options,
@ -1463,8 +1464,8 @@ Status BackupEngineImpl::VerifyBackup(BackupID backup_id,
std::string checksum_hex; std::string checksum_hex;
ROCKS_LOG_INFO(options_.info_log, "Verifying %s checksum...\n", ROCKS_LOG_INFO(options_.info_log, "Verifying %s checksum...\n",
abs_path.c_str()); abs_path.c_str());
CalculateChecksum(abs_path, backup_env_, EnvOptions(), 0 /* size_limit */, ReadFileAndComputeChecksum(abs_path, backup_env_, EnvOptions(),
&checksum_hex); 0 /* size_limit */, &checksum_hex);
if (file_info->checksum_hex != checksum_hex) { if (file_info->checksum_hex != checksum_hex) {
std::string checksum_info( std::string checksum_info(
"Expected checksum is " + file_info->checksum_hex + "Expected checksum is " + file_info->checksum_hex +
@ -1629,7 +1630,7 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
// since the session id should suffice to avoid file name collision in // since the session id should suffice to avoid file name collision in
// the shared_checksum directory. // the shared_checksum directory.
if (!has_checksum && db_session_id.empty()) { if (!has_checksum && db_session_id.empty()) {
s = CalculateChecksum(src_dir + fname, db_env_, src_env_options, s = ReadFileAndComputeChecksum(src_dir + fname, db_env_, src_env_options,
size_limit, &checksum_hex); size_limit, &checksum_hex);
if (!s.ok()) { if (!s.ok()) {
return s; return s;
@ -1701,10 +1702,8 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
need_to_copy = false; need_to_copy = false;
} else if (shared && (same_path || file_exists)) { } else if (shared && (same_path || file_exists)) {
need_to_copy = false; need_to_copy = false;
if (shared_checksum) { auto find_result = backuped_file_infos_.find(dst_relative);
if (backuped_file_infos_.find(dst_relative) == if (find_result == backuped_file_infos_.end() && !same_path) {
backuped_file_infos_.end() &&
!same_path) {
// file exists but not referenced // file exists but not referenced
ROCKS_LOG_INFO( ROCKS_LOG_INFO(
options_.info_log, options_.info_log,
@ -1716,13 +1715,38 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
} else { } else {
// file exists and referenced // file exists and referenced
if (!has_checksum) { if (!has_checksum) {
// FIXME(peterd): extra I/O if (!same_path) {
s = CalculateChecksum(src_dir + fname, db_env_, src_env_options, assert(find_result != backuped_file_infos_.end());
size_limit, &checksum_hex); // Note: to save I/O on incremental backups, we copy prior known
// checksum of the file instead of reading entire file contents
// to recompute it.
checksum_hex = find_result->second->checksum_hex;
has_checksum = true;
// Regarding corruption detection, consider:
// (a) the DB file is corrupt (since previous backup) and the backup
// file is OK: we failed to detect, but the backup is safe. DB can
// be repaired/restored once its corruption is detected.
// (b) the backup file is corrupt (since previous backup) and the
// db file is OK: we failed to detect, but the backup is corrupt.
// CreateNewBackup should support fast incremental backups and
// there's no way to support that without reading all the files.
// We might add an option for extra checks on incremental backup,
// but until then, use VerifyBackups to check existing backup data.
// (c) file name collision with legitimately different content.
// This is almost inconceivable with a well-generated DB session
// ID, but even in that case, we double check the file sizes in
// BackupMeta::AddFile.
} else {
// same_path should not happen for a standard DB, so OK to
// read file contents to check for checksum mismatch between
// two files from same DB getting same name.
s = ReadFileAndComputeChecksum(src_dir + fname, db_env_,
src_env_options, size_limit,
&checksum_hex);
if (!s.ok()) { if (!s.ok()) {
return s; return s;
} }
has_checksum = true; }
} }
if (!db_session_id.empty()) { if (!db_session_id.empty()) {
ROCKS_LOG_INFO(options_.info_log, ROCKS_LOG_INFO(options_.info_log,
@ -1731,38 +1755,11 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
fname.c_str(), checksum_hex.c_str(), size_bytes, fname.c_str(), checksum_hex.c_str(), size_bytes,
db_session_id.c_str()); db_session_id.c_str());
} else { } else {
ROCKS_LOG_INFO( ROCKS_LOG_INFO(options_.info_log,
options_.info_log,
"%s already present, with checksum %s and size %" PRIu64, "%s already present, with checksum %s and size %" PRIu64,
fname.c_str(), checksum_hex.c_str(), size_bytes); fname.c_str(), checksum_hex.c_str(), size_bytes);
} }
} }
} else if (backuped_file_infos_.find(dst_relative) ==
backuped_file_infos_.end() &&
!same_path) {
// file already exists, but it's not referenced by any backup. overwrite
// the file
ROCKS_LOG_INFO(
options_.info_log,
"%s already present, but not referenced by any backup. We will "
"overwrite the file.",
fname.c_str());
need_to_copy = true;
backup_env_->DeleteFile(final_dest_path);
} else {
// the file is present and referenced by a backup
ROCKS_LOG_INFO(options_.info_log,
"%s already present, calculate checksum", fname.c_str());
if (!has_checksum) {
// FIXME(peterd): extra I/O
s = CalculateChecksum(src_dir + fname, db_env_, src_env_options,
size_limit, &checksum_hex);
if (!s.ok()) {
return s;
}
has_checksum = true;
}
}
} }
live_dst_paths.insert(final_dest_path); live_dst_paths.insert(final_dest_path);
@ -1797,10 +1794,9 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
return s; return s;
} }
Status BackupEngineImpl::CalculateChecksum(const std::string& src, Env* src_env, Status BackupEngineImpl::ReadFileAndComputeChecksum(
const EnvOptions& src_env_options, const std::string& src, Env* src_env, const EnvOptions& src_env_options,
uint64_t size_limit, uint64_t size_limit, std::string* checksum_hex) {
std::string* checksum_hex) {
if (checksum_hex == nullptr) { if (checksum_hex == nullptr) {
return Status::Aborted("Checksum pointer is null"); return Status::Aborted("Checksum pointer is null");
} }
@ -2064,10 +2060,33 @@ Status BackupEngineImpl::BackupMeta::AddFile(
return Status::Corruption("In memory metadata insertion error"); return Status::Corruption("In memory metadata insertion error");
} }
} else { } else {
// Compare sizes, because we scanned that off the filesystem on both
// ends. This is like a check in VerifyBackup.
if (itr->second->size != file_info->size) {
std::string msg = "Size mismatch for existing backup file: ";
msg.append(file_info->filename);
msg.append(" Size in backup is " + ToString(itr->second->size) +
" while size in DB is " + ToString(file_info->size));
msg.append(
" If this DB file checks as not corrupt, try deleting old"
" backups or backing up to a different backup directory.");
return Status::Corruption(msg);
}
// Note: to save I/O, this check will pass trivially on already backed
// up files that don't have the checksum in their name. And it should
// never fail for files that do have checksum in their name.
if (itr->second->checksum_hex != file_info->checksum_hex) { if (itr->second->checksum_hex != file_info->checksum_hex) {
return Status::Corruption( // Should never reach here, but produce an appropriate corruption
"Checksum mismatch for existing backup file. Delete old backups and " // message in case we do in a release build.
"try again."); assert(false);
std::string msg = "Checksum mismatch for existing backup file: ";
msg.append(file_info->filename);
msg.append(" Expected checksum is " + itr->second->checksum_hex +
" while computed checksum is " + file_info->checksum_hex);
msg.append(
" If this DB file checks as not corrupt, try deleting old"
" backups or backing up to a different backup directory.");
return Status::Corruption(msg);
} }
++itr->second->refs; // increase refcount if already present ++itr->second->refs; // increase refcount if already present
} }

View File

@ -304,7 +304,11 @@ class TestEnv : public EnvWrapper {
const std::string& dir, std::vector<Env::FileAttributes>* r) override { const std::string& dir, std::vector<Env::FileAttributes>* r) override {
if (filenames_for_mocked_attrs_.size() > 0) { if (filenames_for_mocked_attrs_.size() > 0) {
for (const auto& filename : filenames_for_mocked_attrs_) { for (const auto& filename : filenames_for_mocked_attrs_) {
r->push_back({dir + filename, 10 /* size_bytes */}); uint64_t size_bytes = 200; // Match TestEnv
if (filename.find("MANIFEST") == 0) {
size_bytes = 100; // Match DummyDB::GetLiveFiles
}
r->push_back({dir + filename, size_bytes});
} }
return Status::OK(); return Status::OK();
} }
@ -316,7 +320,10 @@ class TestEnv : public EnvWrapper {
auto filename_iter = std::find(filenames_for_mocked_attrs_.begin(), auto filename_iter = std::find(filenames_for_mocked_attrs_.begin(),
filenames_for_mocked_attrs_.end(), fname); filenames_for_mocked_attrs_.end(), fname);
if (filename_iter != filenames_for_mocked_attrs_.end()) { if (filename_iter != filenames_for_mocked_attrs_.end()) {
*size_bytes = 10; *size_bytes = 200; // Match TestEnv
if (fname.find("MANIFEST") == 0) {
*size_bytes = 100; // Match DummyDB::GetLiveFiles
}
return Status::OK(); return Status::OK();
} }
return Status::NotFound(fname); return Status::NotFound(fname);
@ -462,6 +469,23 @@ class FileManager : public EnvWrapper {
return WriteToFile(fname, file_contents); return WriteToFile(fname, file_contents);
} }
Status CorruptFileStart(const std::string& fname) {
std::string to_xor = "blah";
std::string file_contents;
Status s = ReadFileToString(this, fname, &file_contents);
if (!s.ok()) {
return s;
}
s = DeleteFile(fname);
if (!s.ok()) {
return s;
}
for (size_t i = 0; i < to_xor.size(); ++i) {
file_contents[i] ^= to_xor[i];
}
return WriteToFile(fname, file_contents);
}
Status CorruptChecksum(const std::string& fname, bool appear_valid) { Status CorruptChecksum(const std::string& fname, bool appear_valid) {
std::string metadata; std::string metadata;
Status s = ReadFileToString(this, fname, &metadata); Status s = ReadFileToString(this, fname, &metadata);
@ -594,6 +618,7 @@ class BackupableDBTest : public testing::Test {
test_db_env_.reset(new TestEnv(db_chroot_env_.get())); test_db_env_.reset(new TestEnv(db_chroot_env_.get()));
test_backup_env_.reset(new TestEnv(backup_chroot_env_.get())); test_backup_env_.reset(new TestEnv(backup_chroot_env_.get()));
file_manager_.reset(new FileManager(backup_chroot_env_.get())); file_manager_.reset(new FileManager(backup_chroot_env_.get()));
db_file_manager_.reset(new FileManager(db_chroot_env_.get()));
// set up db options // set up db options
options_.create_if_missing = true; options_.create_if_missing = true;
@ -724,29 +749,47 @@ class BackupableDBTest : public testing::Test {
} }
} }
Status CorruptRandomTableFileInDB() { Status GetTableFilesInDB(std::vector<FileAttributes>* table_files) {
Random rnd(6);
std::vector<FileAttributes> children; std::vector<FileAttributes> children;
test_db_env_->GetChildrenFileAttributes(dbname_, &children); Status s = test_db_env_->GetChildrenFileAttributes(dbname_, &children);
if (children.size() <= 2) { // . and .. for (const auto& child : children) {
if (child.size_bytes > 0 && child.name.size() > 4 &&
child.name.rfind(".sst") == child.name.length() - 4) {
table_files->push_back(child);
}
}
return s;
}
Status GetRandomTableFileInDB(std::string* fname_out,
uint64_t* fsize_out = nullptr) {
Random rnd(6); // NB: hardly "random"
std::vector<FileAttributes> table_files;
Status s = GetTableFilesInDB(&table_files);
if (!s.ok()) {
return s;
}
if (table_files.empty()) {
return Status::NotFound(""); return Status::NotFound("");
} }
size_t i = rnd.Uniform(static_cast<int>(table_files.size()));
*fname_out = dbname_ + "/" + table_files[i].name;
if (fsize_out) {
*fsize_out = table_files[i].size_bytes;
}
return Status::OK();
}
Status CorruptRandomTableFileInDB() {
std::string fname; std::string fname;
uint64_t fsize = 0; uint64_t fsize = 0;
while (true) { Status s = GetRandomTableFileInDB(&fname, &fsize);
int i = rnd.Next() % children.size(); if (!s.ok()) {
fname = children[i].name; return s;
fsize = children[i].size_bytes;
// find an sst file
if (fsize > 0 && fname.length() > 4 &&
fname.rfind(".sst") == fname.length() - 4) {
fname = dbname_ + "/" + fname;
break;
}
} }
std::string file_contents; std::string file_contents;
Status s = ReadFileToString(test_db_env_.get(), fname, &file_contents); s = ReadFileToString(test_db_env_.get(), fname, &file_contents);
if (!s.ok()) { if (!s.ok()) {
return s; return s;
} }
@ -812,6 +855,7 @@ class BackupableDBTest : public testing::Test {
std::unique_ptr<TestEnv> test_db_env_; std::unique_ptr<TestEnv> test_db_env_;
std::unique_ptr<TestEnv> test_backup_env_; std::unique_ptr<TestEnv> test_backup_env_;
std::unique_ptr<FileManager> file_manager_; std::unique_ptr<FileManager> file_manager_;
std::unique_ptr<FileManager> db_file_manager_;
// all the dbs! // all the dbs!
DummyDB* dummy_db_; // BackupableDB owns dummy_db_ DummyDB* dummy_db_; // BackupableDB owns dummy_db_
@ -1632,8 +1676,8 @@ TEST_F(BackupableDBTest, FailOverwritingBackups) {
CloseDBAndBackupEngine(); CloseDBAndBackupEngine();
DeleteLogFiles(); DeleteLogFiles();
OpenDBAndBackupEngine(false); OpenDBAndBackupEngine(false);
FillDB(db_.get(), 100 * i, 100 * (i + 1)); FillDB(db_.get(), 100 * i, 100 * (i + 1), kFlushAll);
ASSERT_OK(backup_engine_->CreateNewBackup(db_.get(), true)); ASSERT_OK(backup_engine_->CreateNewBackup(db_.get()));
} }
CloseDBAndBackupEngine(); CloseDBAndBackupEngine();
@ -1643,19 +1687,20 @@ TEST_F(BackupableDBTest, FailOverwritingBackups) {
CloseBackupEngine(); CloseBackupEngine();
OpenDBAndBackupEngine(false); OpenDBAndBackupEngine(false);
FillDB(db_.get(), 0, 300); // More data, bigger SST
Status s = backup_engine_->CreateNewBackup(db_.get(), true); FillDB(db_.get(), 1000, 1300, kFlushAll);
Status s = backup_engine_->CreateNewBackup(db_.get());
// the new backup fails because new table files // the new backup fails because new table files
// clash with old table files from backups 4 and 5 // clash with old table files from backups 4 and 5
// (since write_buffer_size is huge, we can be sure that // (since write_buffer_size is huge, we can be sure that
// each backup will generate only one sst file and that // each backup will generate only one sst file and that
// a file generated by a new backup is the same as // a file generated here would have the same name as an
// sst file generated by backup 4) // sst file generated by backup 4, and will be bigger)
ASSERT_TRUE(s.IsCorruption()); ASSERT_TRUE(s.IsCorruption());
ASSERT_OK(backup_engine_->DeleteBackup(4)); ASSERT_OK(backup_engine_->DeleteBackup(4));
ASSERT_OK(backup_engine_->DeleteBackup(5)); ASSERT_OK(backup_engine_->DeleteBackup(5));
// now, the backup can succeed // now, the backup can succeed
ASSERT_OK(backup_engine_->CreateNewBackup(db_.get(), true)); ASSERT_OK(backup_engine_->CreateNewBackup(db_.get()));
CloseDBAndBackupEngine(); CloseDBAndBackupEngine();
} }
@ -1863,6 +1908,235 @@ TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsOldFileNaming) {
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks(); ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
} }
// Test how naming options interact with detecting DB corruption
// between incremental backups
TEST_F(BackupableDBTest, TableFileCorruptionBeforeIncremental) {
const auto share_no_checksum = static_cast<ShareFilesNaming>(0);
for (bool corrupt_before_first_backup : {false, true}) {
for (ShareFilesNaming option :
{share_no_checksum, kLegacyCrc32cAndFileSize, kNamingDefault}) {
auto share =
option == share_no_checksum ? kShareNoChecksum : kShareWithChecksum;
if (option != share_no_checksum) {
backupable_options_->share_files_with_checksum_naming = option;
}
OpenDBAndBackupEngine(true, false, share);
DBImpl* dbi = static_cast<DBImpl*>(db_.get());
// A small SST file
ASSERT_OK(dbi->Put(WriteOptions(), "x", "y"));
ASSERT_OK(dbi->Flush(FlushOptions()));
// And a bigger one
ASSERT_OK(dbi->Put(WriteOptions(), "y", Random(42).RandomString(500)));
ASSERT_OK(dbi->Flush(FlushOptions()));
dbi->TEST_WaitForFlushMemTable();
CloseDBAndBackupEngine();
std::vector<FileAttributes> table_files;
ASSERT_OK(GetTableFilesInDB(&table_files));
ASSERT_EQ(table_files.size(), 2);
std::string tf0 = dbname_ + "/" + table_files[0].name;
std::string tf1 = dbname_ + "/" + table_files[1].name;
if (corrupt_before_first_backup) {
// This corrupts a data block, which does not cause DB open
// failure, only failure on accessing the block.
ASSERT_OK(db_file_manager_->CorruptFileStart(tf0));
}
OpenDBAndBackupEngine(false, false, share);
ASSERT_OK(backup_engine_->CreateNewBackup(db_.get()));
CloseDBAndBackupEngine();
// if corrupt_before_first_backup, this undoes the initial corruption
ASSERT_OK(db_file_manager_->CorruptFileStart(tf0));
OpenDBAndBackupEngine(false, false, share);
Status s = backup_engine_->CreateNewBackup(db_.get());
// Even though none of the naming options catch the inconsistency
// between the first and second time backing up fname, in the case
// of kUseDbSessionId (kNamingDefault), this is an intentional
// trade-off to avoid full scan of files from the DB that are
// already backed up. If we did the scan, kUseDbSessionId could catch
// the corruption. kLegacyCrc32cAndFileSize does the scan (to
// compute checksum for name) without catching the corruption,
// because the corruption means the names don't merge.
EXPECT_OK(s);
// VerifyBackup doesn't check DB integrity or table file internal
// checksums
EXPECT_OK(backup_engine_->VerifyBackup(1, true));
EXPECT_OK(backup_engine_->VerifyBackup(2, true));
db_.reset();
ASSERT_OK(backup_engine_->RestoreDBFromBackup(2, dbname_, dbname_));
{
DB* db = OpenDB();
s = db->VerifyChecksum();
delete db;
}
if (option != kLegacyCrc32cAndFileSize && !corrupt_before_first_backup) {
// Second backup is OK because it used (uncorrupt) file from first
// backup instead of (corrupt) file from DB.
// This is arguably a good trade-off vs. treating the file as distinct
// from the old version, because a file should be more likely to be
// corrupt as it ages. Although the backed-up file might also corrupt
// with age, the alternative approach (checksum in file name computed
// from current DB file contents) wouldn't detect that case at backup
// time either. Although you would have both copies of the file with
// the alternative approach, that would only last until the older
// backup is deleted.
ASSERT_OK(s);
} else if (option == kLegacyCrc32cAndFileSize &&
corrupt_before_first_backup) {
// Second backup is OK because it saved the updated (uncorrupt)
// file from DB, instead of the sharing with first backup.
// Recall: if corrupt_before_first_backup, [second CorruptFileStart]
// undoes the initial corruption.
// This is arguably a bad trade-off vs. sharing the old version of the
// file because a file should be more likely to corrupt as it ages.
// (Not likely that the previously backed-up version was already
// corrupt and the new version is non-corrupt. This approach doesn't
// help if backed-up version is corrupted after taking the backup.)
ASSERT_OK(s);
} else {
// Something is legitimately corrupted, but we can't be sure what
// with information available (TODO? unless one passes block checksum
// test and other doesn't. Probably better to use end-to-end full file
// checksum anyway.)
ASSERT_TRUE(s.IsCorruption());
}
CloseDBAndBackupEngine();
ASSERT_OK(DestroyDB(dbname_, options_));
}
}
}
// Test how naming options interact with detecting file size corruption
// between incremental backups
TEST_F(BackupableDBTest, FileSizeForIncremental) {
const auto share_no_checksum = static_cast<ShareFilesNaming>(0);
for (ShareFilesNaming option : {share_no_checksum, kLegacyCrc32cAndFileSize,
kNamingDefault, kUseDbSessionId}) {
auto share =
option == share_no_checksum ? kShareNoChecksum : kShareWithChecksum;
if (option != share_no_checksum) {
backupable_options_->share_files_with_checksum_naming = option;
}
OpenDBAndBackupEngine(true, false, share);
std::vector<FileAttributes> children;
const std::string shared_dir =
backupdir_ +
(option == share_no_checksum ? "/shared" : "/shared_checksum");
// A single small SST file
ASSERT_OK(db_->Put(WriteOptions(), "x", "y"));
// First, test that we always detect file size corruption on the shared
// backup side on incremental. (Since sizes aren't really part of backup
// meta file, this works by querying the filesystem for the sizes.)
ASSERT_OK(backup_engine_->CreateNewBackup(db_.get(), true /*flush*/));
CloseDBAndBackupEngine();
// Corrupt backup SST
ASSERT_OK(file_manager_->GetChildrenFileAttributes(shared_dir, &children));
ASSERT_EQ(children.size(), 3U); // ".", "..", one sst
for (const auto& child : children) {
if (child.name.size() > 4 && child.size_bytes > 0) {
ASSERT_OK(
file_manager_->WriteToFile(shared_dir + "/" + child.name, "asdf"));
break;
}
}
OpenDBAndBackupEngine(false, false, share);
Status s = backup_engine_->CreateNewBackup(db_.get());
EXPECT_TRUE(s.IsCorruption());
ASSERT_OK(backup_engine_->PurgeOldBackups(0));
CloseDBAndBackupEngine();
// Second, test that a hypothetical db session id collision would likely
// not suffice to corrupt a backup, because there's a good chance of
// file size difference (in this test, guaranteed) so either no name
// collision or detected collision.
// Create backup 1
OpenDBAndBackupEngine(false, false, share);
ASSERT_OK(backup_engine_->CreateNewBackup(db_.get()));
// Even though we have "the same" DB state as backup 1, we need
// to restore to recreate the same conditions as later restore.
db_.reset();
ASSERT_OK(DestroyDB(dbname_, options_));
ASSERT_OK(backup_engine_->RestoreDBFromBackup(1, dbname_, dbname_));
CloseDBAndBackupEngine();
// Forge session id
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"DBImpl::SetDbSessionId", [](void* sid_void_star) {
std::string* sid = static_cast<std::string*>(sid_void_star);
*sid = "01234567890123456789";
});
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
// Create another SST file
OpenDBAndBackupEngine(false, false, share);
ASSERT_OK(db_->Put(WriteOptions(), "y", "x"));
// Create backup 2
ASSERT_OK(backup_engine_->CreateNewBackup(db_.get(), true /*flush*/));
// Restore backup 1 (again)
db_.reset();
ASSERT_OK(DestroyDB(dbname_, options_));
ASSERT_OK(backup_engine_->RestoreDBFromBackup(1, dbname_, dbname_));
CloseDBAndBackupEngine();
// Create another SST file with same number and db session id, only bigger
OpenDBAndBackupEngine(false, false, share);
ASSERT_OK(db_->Put(WriteOptions(), "y", Random(42).RandomString(500)));
// Count backup SSTs
children.clear();
ASSERT_OK(file_manager_->GetChildrenFileAttributes(shared_dir, &children));
ASSERT_EQ(children.size(), 4U); // ".", "..", two sst
// Try create backup 3
s = backup_engine_->CreateNewBackup(db_.get(), true /*flush*/);
// Re-count backup SSTs
children.clear();
ASSERT_OK(file_manager_->GetChildrenFileAttributes(shared_dir, &children));
if (option == kUseDbSessionId) {
// Acceptable to call it corruption if size is not in name and
// db session id collision is practically impossible.
EXPECT_TRUE(s.IsCorruption());
EXPECT_EQ(children.size(), 4U); // no SST added
} else if (option == share_no_checksum) {
// Good to call it corruption if both backups cannot be
// accommodated.
EXPECT_TRUE(s.IsCorruption());
EXPECT_EQ(children.size(), 4U); // no SST added
} else {
// Since opening a DB seems sufficient for detecting size corruption
// on the DB side, this should be a good thing, ...
EXPECT_OK(s);
// ... as long as we did actually treat it as a distinct SST file.
EXPECT_EQ(children.size(), 5U); // Another SST added
}
CloseDBAndBackupEngine();
ASSERT_OK(DestroyDB(dbname_, options_));
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
}
}
// Verify backup and restore with share_files_with_checksum off and then // Verify backup and restore with share_files_with_checksum off and then
// transition this option to on and share_files_with_checksum_naming to be // transition this option to on and share_files_with_checksum_naming to be
// based on kUseDbSessionId // based on kUseDbSessionId