Less I/O for incremental backups, slightly better corruption detection (#7413)

Summary: Two relatively simple functional changes to incremental backup behavior, integrated with a minor refactoring to reduce code redundancy and improve error/log message. There are nuances to the impact of these changes, but I believe they are fundamentally good and generally safe. Those functional changes: * Incremental backups no longer read DB table files that are already saved to a shared part of the backup directory, unless `share_files_with_checksum` is used with `kLegacyCrc32cAndFileSize` naming (discouraged) where crc32c full file checksums are needed to determine file naming. * Justification: incremental backups should not need to read the whole DB, especially without rate limiting. (Although other BackupEngine reads are not rate limited either, other non-trivial reads are generally limited by a corresponding write, as in copying files.) Also, the fact that this is not already fixed was arguably a bug/oversight in the implementation of https://github.com/facebook/rocksdb/issues/7110. * When considering whether a table file is already backed up in a shared part of backup directory, BackupEngine would already query the sizes of source (DB) and pre-existing destination (backup) files. BackupEngine now uses these file sizes to detect corruption, as at least one of (a) old backup, (b) backup in progress, or (c) current DB is corrupt if there's a size mismatch. * Justification: a random related fix that also helps to cover a small hole in corruption checking uncovered by the other functional change: * For `share_table_files` without "checksum" (not recommended), the other change regresses in detecting fundamentally unsafe use of this option combination: when you might generate different versions of same SST file number. As demonstrated by `BackupableDBTest.FailOverwritingBackups,` this regression is greatly mitigated by the new file size checking. Nevertheless, almost no reason to use `share_files_with_checksum=false` should remain, and comments are updated appropriately. Also, this change renames internal function `CalculateChecksum` to `ReadFileAndComputeChecksum` to make the performance impact of this function clear in code reviews. It is not clear what 'same_path' is for in backupable_db.cc, and I suspect it cannot be true for a DB with unique file names (like DBImpl). Nevertheless, I've tried to keep its functionality intact when `true` to minimize risk for now, despite having no unit tests for which it is true. Select impact details (much more in unit tests): For `share_files_with_checksum`, I am confident there is no regression (vs. pre-6.12) in detecting DB or backup corruption at backup creation time, mostly because the old design did not leverage this extra checksum computation for detecting inconsistencies at backup creation time. (With computed checksums in names, a recently corrupted file just looked like a different file vs. what was already backed up.) Even in the hypothetical case of DB session id collision (~100 bits entropy collision), file size in name and/or our file size check add an extra layer of protection against false success in creating an accurate new backup. (Unit test included.) `DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking are still able to catch corruptions that `CreateNewBackup` does not. Note that when custom file checksum support is added to BackupEngine, that will essentially give the same power as `DB::VerifyChecksum` into `CreateNewBackup`. We could add options for `CreateNewBackup` to cover some of what would be caught by `VerifyBackup` with checksum checking. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7413 Test Plan: Two new unit tests included, both of which fail without these changes. Although we don't test the I/O improvement directly, we test it indirectly in DB corruption detection power that was inadvertently unlocked with new backup file naming PLUS computing current content checksums (now removed). (I don't think that case of DB corruption detection justifies reading the whole DB on incremental backup.) Reviewed By: zhichao-cao Differential Revision: D23818480 Pulled By: pdillinger fbshipit-source-id: 148aff16f001af5b9fd4b22f155311c2461f1bac
2020-09-21 16:18:11 -07:00 · 2020-09-21 16:18:11 -07:00 · 9d8eb77c4d
commit 9d8eb77c4d
parent 52691703fc
4 changed files with 391 additions and 92 deletions
--- a/HISTORY.md
+++ b/HISTORY.md
@ -32,6 +32,10 @@
 ### Performance Improvements
 * Reduce thread number for multiple DB instances by re-using one global thread for statistics dumping and persisting.
 * Reduce write-amp in heavy write bursts in `kCompactionStyleLevel` compaction style with `level_compaction_dynamic_level_bytes` set.
+* BackupEngine incremental backups no longer read DB table files that are already saved to a shared part of the backup directory, unless `share_files_with_checksum` is used with `kLegacyCrc32cAndFileSize` naming (discouraged).
+  * For `share_files_with_checksum`, we are confident there is no regression (vs. pre-6.12) in detecting DB or backup corruption at backup creation time, mostly because the old design did not leverage this extra checksum computation for detecting inconsistencies at backup creation time.
+  * For `share_table_files` without "checksum" (not recommended), there is a regression in detecting fundamentally unsafe use of the option, greatly mitigated by file size checking (under "Behavior Changes"). Almost no reason to use `share_files_with_checksum=false` should remain.
+  * `DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking are still able to catch corruptions that `CreateNewBackup` does not.

 ### Public API Change
 * Expose kTypeDeleteWithTimestamp in EntryType and update GetEntryType() accordingly.
@ -42,9 +46,8 @@

 ### Behavior Changes
 * File abstraction `FSRandomAccessFile.Prefetch()` default return status is changed from `OK` to `NotSupported`. If the user inherited file doesn't implement prefetch, RocksDB will create internal prefetch buffer to improve read performance.
-
-### Behavior Changes
-* When retryabel IO error happens during Flush (manifest write error is excluded) and WAL is disabled, originally it is mapped to kHardError. Now,it is mapped to soft error. So DB will not stall the writes unless the memtable is full. At the same time, when auto resume is triggered to recover the retryable IO error during Flush, SwitchMemtable is not called to avoid generating to many small immutable memtables. If WAL is enabled, no behavior changes. 
+* When retryabel IO error happens during Flush (manifest write error is excluded) and WAL is disabled, originally it is mapped to kHardError. Now,it is mapped to soft error. So DB will not stall the writes unless the memtable is full. At the same time, when auto resume is triggered to recover the retryable IO error during Flush, SwitchMemtable is not called to avoid generating to many small immutable memtables. If WAL is enabled, no behavior changes.
+* When considering whether a table file is already backed up in a shared part of backup directory, BackupEngine would already query the sizes of source (DB) and pre-existing destination (backup) files. BackupEngine now uses these file sizes to detect corruption, as at least one of (a) old backup, (b) backup in progress, or (c) current DB is corrupt if there's a size mismatch.

 ### Others
 * Error in prefetching partitioned index blocks will not be swallowed. It will fail the query and return the IOError users.
--- a/include/rocksdb/utilities/backupable_db.h
+++ b/include/rocksdb/utilities/backupable_db.h
@ -98,7 +98,10 @@ struct BackupableDBOptions {
  // ShareFilesNaming for details on how table files names are made
  // unique between databases.
  //
-  // Default: false
+  // Using 'true' is fundamentally safer, and performance improvements vs.
+  // original design should leave almost no reason to use the 'false' setting.
+  //
+  // Default (only for historical reasons): false
  bool share_files_with_checksum;

  // Up to this many background threads will copy files for CreateNewBackup()
--- a/utilities/backupable/backupable_db.cc
+++ b/utilities/backupable/backupable_db.cc
@ -391,9 +391,10 @@ class BackupEngineImpl : public BackupEngine {
                          uint64_t size_limit = 0,
                          std::function<void()> progress_callback = []() {});

-  Status CalculateChecksum(const std::string& src, Env* src_env,
-                           const EnvOptions& src_env_options,
-                           uint64_t size_limit, std::string* checksum_hex);
+  Status ReadFileAndComputeChecksum(const std::string& src, Env* src_env,
+                                    const EnvOptions& src_env_options,
+                                    uint64_t size_limit,
+                                    std::string* checksum_hex);

  // Obtain db_id and db_session_id from the table properties of file_path
  Status GetFileDbIdentities(Env* src_env, const EnvOptions& src_env_options,
@ -1463,8 +1464,8 @@ Status BackupEngineImpl::VerifyBackup(BackupID backup_id,
      std::string checksum_hex;
      ROCKS_LOG_INFO(options_.info_log, "Verifying %s checksum...\n",
                     abs_path.c_str());
-      CalculateChecksum(abs_path, backup_env_, EnvOptions(), 0 /* size_limit */,
-                        &checksum_hex);
+      ReadFileAndComputeChecksum(abs_path, backup_env_, EnvOptions(),
+                                 0 /* size_limit */, &checksum_hex);
      if (file_info->checksum_hex != checksum_hex) {
        std::string checksum_info(
            "Expected checksum is " + file_info->checksum_hex +
@ -1629,8 +1630,8 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
    // since the session id should suffice to avoid file name collision in
    // the shared_checksum directory.
    if (!has_checksum && db_session_id.empty()) {
-      s = CalculateChecksum(src_dir + fname, db_env_, src_env_options,
-                            size_limit, &checksum_hex);
+      s = ReadFileAndComputeChecksum(src_dir + fname, db_env_, src_env_options,
+                                     size_limit, &checksum_hex);
      if (!s.ok()) {
        return s;
      }
@ -1701,47 +1702,9 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
    need_to_copy = false;
  } else if (shared && (same_path || file_exists)) {
    need_to_copy = false;
-    if (shared_checksum) {
-      if (backuped_file_infos_.find(dst_relative) ==
-              backuped_file_infos_.end() &&
-          !same_path) {
-        // file exists but not referenced
-        ROCKS_LOG_INFO(
-            options_.info_log,
-            "%s already present, but not referenced by any backup. We will "
-            "overwrite the file.",
-            fname.c_str());
-        need_to_copy = true;
-        backup_env_->DeleteFile(final_dest_path);
-      } else {
-        // file exists and referenced
-        if (!has_checksum) {
-          // FIXME(peterd): extra I/O
-          s = CalculateChecksum(src_dir + fname, db_env_, src_env_options,
-                                size_limit, &checksum_hex);
-          if (!s.ok()) {
-            return s;
-          }
-          has_checksum = true;
-        }
-        if (!db_session_id.empty()) {
-          ROCKS_LOG_INFO(options_.info_log,
-                         "%s already present, with checksum %s, size %" PRIu64
-                         " and DB session identity %s",
-                         fname.c_str(), checksum_hex.c_str(), size_bytes,
-                         db_session_id.c_str());
-        } else {
-          ROCKS_LOG_INFO(
-              options_.info_log,
-              "%s already present, with checksum %s and size %" PRIu64,
-              fname.c_str(), checksum_hex.c_str(), size_bytes);
-        }
-      }
-    } else if (backuped_file_infos_.find(dst_relative) ==
-                   backuped_file_infos_.end() &&
-               !same_path) {
-      // file already exists, but it's not referenced by any backup. overwrite
-      // the file
+    auto find_result = backuped_file_infos_.find(dst_relative);
+    if (find_result == backuped_file_infos_.end() && !same_path) {
+      // file exists but not referenced
      ROCKS_LOG_INFO(
          options_.info_log,
          "%s already present, but not referenced by any backup. We will "
@ -1750,17 +1713,51 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
      need_to_copy = true;
      backup_env_->DeleteFile(final_dest_path);
    } else {
-      // the file is present and referenced by a backup
-      ROCKS_LOG_INFO(options_.info_log,
-                     "%s already present, calculate checksum", fname.c_str());
+      // file exists and referenced
      if (!has_checksum) {
-        // FIXME(peterd): extra I/O
-        s = CalculateChecksum(src_dir + fname, db_env_, src_env_options,
-                              size_limit, &checksum_hex);
-        if (!s.ok()) {
-          return s;
+        if (!same_path) {
+          assert(find_result != backuped_file_infos_.end());
+          // Note: to save I/O on incremental backups, we copy prior known
+          // checksum of the file instead of reading entire file contents
+          // to recompute it.
+          checksum_hex = find_result->second->checksum_hex;
+          has_checksum = true;
+          // Regarding corruption detection, consider:
+          // (a) the DB file is corrupt (since previous backup) and the backup
+          // file is OK: we failed to detect, but the backup is safe. DB can
+          // be repaired/restored once its corruption is detected.
+          // (b) the backup file is corrupt (since previous backup) and the
+          // db file is OK: we failed to detect, but the backup is corrupt.
+          // CreateNewBackup should support fast incremental backups and
+          // there's no way to support that without reading all the files.
+          // We might add an option for extra checks on incremental backup,
+          // but until then, use VerifyBackups to check existing backup data.
+          // (c) file name collision with legitimately different content.
+          // This is almost inconceivable with a well-generated DB session
+          // ID, but even in that case, we double check the file sizes in
+          // BackupMeta::AddFile.
+        } else {
+          // same_path should not happen for a standard DB, so OK to
+          // read file contents to check for checksum mismatch between
+          // two files from same DB getting same name.
+          s = ReadFileAndComputeChecksum(src_dir + fname, db_env_,
+                                         src_env_options, size_limit,
+                                         &checksum_hex);
+          if (!s.ok()) {
+            return s;
+          }
        }
-        has_checksum = true;
+      }
+      if (!db_session_id.empty()) {
+        ROCKS_LOG_INFO(options_.info_log,
+                       "%s already present, with checksum %s, size %" PRIu64
+                       " and DB session identity %s",
+                       fname.c_str(), checksum_hex.c_str(), size_bytes,
+                       db_session_id.c_str());
+      } else {
+        ROCKS_LOG_INFO(options_.info_log,
+                       "%s already present, with checksum %s and size %" PRIu64,
+                       fname.c_str(), checksum_hex.c_str(), size_bytes);
      }
    }
  }
@ -1797,10 +1794,9 @@ Status BackupEngineImpl::AddBackupFileWorkItem(
  return s;
 }

-Status BackupEngineImpl::CalculateChecksum(const std::string& src, Env* src_env,
-                                           const EnvOptions& src_env_options,
-                                           uint64_t size_limit,
-                                           std::string* checksum_hex) {
+Status BackupEngineImpl::ReadFileAndComputeChecksum(
+    const std::string& src, Env* src_env, const EnvOptions& src_env_options,
+    uint64_t size_limit, std::string* checksum_hex) {
  if (checksum_hex == nullptr) {
    return Status::Aborted("Checksum pointer is null");
  }
@ -2064,10 +2060,33 @@ Status BackupEngineImpl::BackupMeta::AddFile(
      return Status::Corruption("In memory metadata insertion error");
    }
  } else {
+    // Compare sizes, because we scanned that off the filesystem on both
+    // ends. This is like a check in VerifyBackup.
+    if (itr->second->size != file_info->size) {
+      std::string msg = "Size mismatch for existing backup file: ";
+      msg.append(file_info->filename);
+      msg.append(" Size in backup is " + ToString(itr->second->size) +
+                 " while size in DB is " + ToString(file_info->size));
+      msg.append(
+          " If this DB file checks as not corrupt, try deleting old"
+          " backups or backing up to a different backup directory.");
+      return Status::Corruption(msg);
+    }
+    // Note: to save I/O, this check will pass trivially on already backed
+    // up files that don't have the checksum in their name. And it should
+    // never fail for files that do have checksum in their name.
    if (itr->second->checksum_hex != file_info->checksum_hex) {
-      return Status::Corruption(
-          "Checksum mismatch for existing backup file. Delete old backups and "
-          "try again.");
+      // Should never reach here, but produce an appropriate corruption
+      // message in case we do in a release build.
+      assert(false);
+      std::string msg = "Checksum mismatch for existing backup file: ";
+      msg.append(file_info->filename);
+      msg.append(" Expected checksum is " + itr->second->checksum_hex +
+                 " while computed checksum is " + file_info->checksum_hex);
+      msg.append(
+          " If this DB file checks as not corrupt, try deleting old"
+          " backups or backing up to a different backup directory.");
+      return Status::Corruption(msg);
    }
    ++itr->second->refs;  // increase refcount if already present
  }
--- a/utilities/backupable/backupable_db_test.cc
+++ b/utilities/backupable/backupable_db_test.cc
@ -304,7 +304,11 @@ class TestEnv : public EnvWrapper {
      const std::string& dir, std::vector<Env::FileAttributes>* r) override {
    if (filenames_for_mocked_attrs_.size() > 0) {
      for (const auto& filename : filenames_for_mocked_attrs_) {
-        r->push_back({dir + filename, 10 /* size_bytes */});
+        uint64_t size_bytes = 200;  // Match TestEnv
+        if (filename.find("MANIFEST") == 0) {
+          size_bytes = 100;  // Match DummyDB::GetLiveFiles
+        }
+        r->push_back({dir + filename, size_bytes});
      }
      return Status::OK();
    }
@ -316,7 +320,10 @@ class TestEnv : public EnvWrapper {
      auto filename_iter = std::find(filenames_for_mocked_attrs_.begin(),
                                     filenames_for_mocked_attrs_.end(), fname);
      if (filename_iter != filenames_for_mocked_attrs_.end()) {
-        *size_bytes = 10;
+        *size_bytes = 200;  // Match TestEnv
+        if (fname.find("MANIFEST") == 0) {
+          *size_bytes = 100;  // Match DummyDB::GetLiveFiles
+        }
        return Status::OK();
      }
      return Status::NotFound(fname);
@ -462,6 +469,23 @@ class FileManager : public EnvWrapper {
    return WriteToFile(fname, file_contents);
  }

+  Status CorruptFileStart(const std::string& fname) {
+    std::string to_xor = "blah";
+    std::string file_contents;
+    Status s = ReadFileToString(this, fname, &file_contents);
+    if (!s.ok()) {
+      return s;
+    }
+    s = DeleteFile(fname);
+    if (!s.ok()) {
+      return s;
+    }
+    for (size_t i = 0; i < to_xor.size(); ++i) {
+      file_contents[i] ^= to_xor[i];
+    }
+    return WriteToFile(fname, file_contents);
+  }
+
  Status CorruptChecksum(const std::string& fname, bool appear_valid) {
    std::string metadata;
    Status s = ReadFileToString(this, fname, &metadata);
@ -594,6 +618,7 @@ class BackupableDBTest : public testing::Test {
    test_db_env_.reset(new TestEnv(db_chroot_env_.get()));
    test_backup_env_.reset(new TestEnv(backup_chroot_env_.get()));
    file_manager_.reset(new FileManager(backup_chroot_env_.get()));
+    db_file_manager_.reset(new FileManager(db_chroot_env_.get()));

    // set up db options
    options_.create_if_missing = true;
@ -724,29 +749,47 @@ class BackupableDBTest : public testing::Test {
    }
  }

-  Status CorruptRandomTableFileInDB() {
-    Random rnd(6);
+  Status GetTableFilesInDB(std::vector<FileAttributes>* table_files) {
    std::vector<FileAttributes> children;
-    test_db_env_->GetChildrenFileAttributes(dbname_, &children);
-    if (children.size() <= 2) {  // . and ..
+    Status s = test_db_env_->GetChildrenFileAttributes(dbname_, &children);
+    for (const auto& child : children) {
+      if (child.size_bytes > 0 && child.name.size() > 4 &&
+          child.name.rfind(".sst") == child.name.length() - 4) {
+        table_files->push_back(child);
+      }
+    }
+    return s;
+  }
+
+  Status GetRandomTableFileInDB(std::string* fname_out,
+                                uint64_t* fsize_out = nullptr) {
+    Random rnd(6);  // NB: hardly "random"
+    std::vector<FileAttributes> table_files;
+    Status s = GetTableFilesInDB(&table_files);
+    if (!s.ok()) {
+      return s;
+    }
+    if (table_files.empty()) {
      return Status::NotFound("");
    }
+    size_t i = rnd.Uniform(static_cast<int>(table_files.size()));
+    *fname_out = dbname_ + "/" + table_files[i].name;
+    if (fsize_out) {
+      *fsize_out = table_files[i].size_bytes;
+    }
+    return Status::OK();
+  }
+
+  Status CorruptRandomTableFileInDB() {
    std::string fname;
    uint64_t fsize = 0;
-    while (true) {
-      int i = rnd.Next() % children.size();
-      fname = children[i].name;
-      fsize = children[i].size_bytes;
-      // find an sst file
-      if (fsize > 0 && fname.length() > 4 &&
-          fname.rfind(".sst") == fname.length() - 4) {
-        fname = dbname_ + "/" + fname;
-        break;
-      }
+    Status s = GetRandomTableFileInDB(&fname, &fsize);
+    if (!s.ok()) {
+      return s;
    }

    std::string file_contents;
-    Status s = ReadFileToString(test_db_env_.get(), fname, &file_contents);
+    s = ReadFileToString(test_db_env_.get(), fname, &file_contents);
    if (!s.ok()) {
      return s;
    }
@ -812,6 +855,7 @@ class BackupableDBTest : public testing::Test {
  std::unique_ptr<TestEnv> test_db_env_;
  std::unique_ptr<TestEnv> test_backup_env_;
  std::unique_ptr<FileManager> file_manager_;
+  std::unique_ptr<FileManager> db_file_manager_;

  // all the dbs!
  DummyDB* dummy_db_; // BackupableDB owns dummy_db_
@ -1632,8 +1676,8 @@ TEST_F(BackupableDBTest, FailOverwritingBackups) {
    CloseDBAndBackupEngine();
    DeleteLogFiles();
    OpenDBAndBackupEngine(false);
-    FillDB(db_.get(), 100 * i, 100 * (i + 1));
-    ASSERT_OK(backup_engine_->CreateNewBackup(db_.get(), true));
+    FillDB(db_.get(), 100 * i, 100 * (i + 1), kFlushAll);
+    ASSERT_OK(backup_engine_->CreateNewBackup(db_.get()));
  }
  CloseDBAndBackupEngine();

@ -1643,19 +1687,20 @@ TEST_F(BackupableDBTest, FailOverwritingBackups) {
  CloseBackupEngine();

  OpenDBAndBackupEngine(false);
-  FillDB(db_.get(), 0, 300);
-  Status s = backup_engine_->CreateNewBackup(db_.get(), true);
+  // More data, bigger SST
+  FillDB(db_.get(), 1000, 1300, kFlushAll);
+  Status s = backup_engine_->CreateNewBackup(db_.get());
  // the new backup fails because new table files
  // clash with old table files from backups 4 and 5
  // (since write_buffer_size is huge, we can be sure that
  // each backup will generate only one sst file and that
-  // a file generated by a new backup is the same as
-  // sst file generated by backup 4)
+  // a file generated here would have the same name as an
+  // sst file generated by backup 4, and will be bigger)
  ASSERT_TRUE(s.IsCorruption());
  ASSERT_OK(backup_engine_->DeleteBackup(4));
  ASSERT_OK(backup_engine_->DeleteBackup(5));
  // now, the backup can succeed
-  ASSERT_OK(backup_engine_->CreateNewBackup(db_.get(), true));
+  ASSERT_OK(backup_engine_->CreateNewBackup(db_.get()));
  CloseDBAndBackupEngine();
 }

@ -1863,6 +1908,235 @@ TEST_F(BackupableDBTest, ShareTableFilesWithChecksumsOldFileNaming) {
  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
 }

+// Test how naming options interact with detecting DB corruption
+// between incremental backups
+TEST_F(BackupableDBTest, TableFileCorruptionBeforeIncremental) {
+  const auto share_no_checksum = static_cast<ShareFilesNaming>(0);
+
+  for (bool corrupt_before_first_backup : {false, true}) {
+    for (ShareFilesNaming option :
+         {share_no_checksum, kLegacyCrc32cAndFileSize, kNamingDefault}) {
+      auto share =
+          option == share_no_checksum ? kShareNoChecksum : kShareWithChecksum;
+      if (option != share_no_checksum) {
+        backupable_options_->share_files_with_checksum_naming = option;
+      }
+      OpenDBAndBackupEngine(true, false, share);
+      DBImpl* dbi = static_cast<DBImpl*>(db_.get());
+      // A small SST file
+      ASSERT_OK(dbi->Put(WriteOptions(), "x", "y"));
+      ASSERT_OK(dbi->Flush(FlushOptions()));
+      // And a bigger one
+      ASSERT_OK(dbi->Put(WriteOptions(), "y", Random(42).RandomString(500)));
+      ASSERT_OK(dbi->Flush(FlushOptions()));
+      dbi->TEST_WaitForFlushMemTable();
+      CloseDBAndBackupEngine();
+
+      std::vector<FileAttributes> table_files;
+      ASSERT_OK(GetTableFilesInDB(&table_files));
+      ASSERT_EQ(table_files.size(), 2);
+      std::string tf0 = dbname_ + "/" + table_files[0].name;
+      std::string tf1 = dbname_ + "/" + table_files[1].name;
+
+      if (corrupt_before_first_backup) {
+        // This corrupts a data block, which does not cause DB open
+        // failure, only failure on accessing the block.
+        ASSERT_OK(db_file_manager_->CorruptFileStart(tf0));
+      }
+
+      OpenDBAndBackupEngine(false, false, share);
+      ASSERT_OK(backup_engine_->CreateNewBackup(db_.get()));
+      CloseDBAndBackupEngine();
+
+      // if corrupt_before_first_backup, this undoes the initial corruption
+      ASSERT_OK(db_file_manager_->CorruptFileStart(tf0));
+
+      OpenDBAndBackupEngine(false, false, share);
+      Status s = backup_engine_->CreateNewBackup(db_.get());
+
+      // Even though none of the naming options catch the inconsistency
+      // between the first and second time backing up fname, in the case
+      // of kUseDbSessionId (kNamingDefault), this is an intentional
+      // trade-off to avoid full scan of files from the DB that are
+      // already backed up. If we did the scan, kUseDbSessionId could catch
+      // the corruption. kLegacyCrc32cAndFileSize does the scan (to
+      // compute checksum for name) without catching the corruption,
+      // because the corruption means the names don't merge.
+      EXPECT_OK(s);
+
+      // VerifyBackup doesn't check DB integrity or table file internal
+      // checksums
+      EXPECT_OK(backup_engine_->VerifyBackup(1, true));
+      EXPECT_OK(backup_engine_->VerifyBackup(2, true));
+
+      db_.reset();
+      ASSERT_OK(backup_engine_->RestoreDBFromBackup(2, dbname_, dbname_));
+      {
+        DB* db = OpenDB();
+        s = db->VerifyChecksum();
+        delete db;
+      }
+      if (option != kLegacyCrc32cAndFileSize && !corrupt_before_first_backup) {
+        // Second backup is OK because it used (uncorrupt) file from first
+        // backup instead of (corrupt) file from DB.
+        // This is arguably a good trade-off vs. treating the file as distinct
+        // from the old version, because a file should be more likely to be
+        // corrupt as it ages. Although the backed-up file might also corrupt
+        // with age, the alternative approach (checksum in file name computed
+        // from current DB file contents) wouldn't detect that case at backup
+        // time either. Although you would have both copies of the file with
+        // the alternative approach, that would only last until the older
+        // backup is deleted.
+        ASSERT_OK(s);
+      } else if (option == kLegacyCrc32cAndFileSize &&
+                 corrupt_before_first_backup) {
+        // Second backup is OK because it saved the updated (uncorrupt)
+        // file from DB, instead of the sharing with first backup.
+        // Recall: if corrupt_before_first_backup, [second CorruptFileStart]
+        // undoes the initial corruption.
+        // This is arguably a bad trade-off vs. sharing the old version of the
+        // file because a file should be more likely to corrupt as it ages.
+        // (Not likely that the previously backed-up version was already
+        // corrupt and the new version is non-corrupt. This approach doesn't
+        // help if backed-up version is corrupted after taking the backup.)
+        ASSERT_OK(s);
+      } else {
+        // Something is legitimately corrupted, but we can't be sure what
+        // with information available (TODO? unless one passes block checksum
+        // test and other doesn't. Probably better to use end-to-end full file
+        // checksum anyway.)
+        ASSERT_TRUE(s.IsCorruption());
+      }
+
+      CloseDBAndBackupEngine();
+      ASSERT_OK(DestroyDB(dbname_, options_));
+    }
+  }
+}
+
+// Test how naming options interact with detecting file size corruption
+// between incremental backups
+TEST_F(BackupableDBTest, FileSizeForIncremental) {
+  const auto share_no_checksum = static_cast<ShareFilesNaming>(0);
+
+  for (ShareFilesNaming option : {share_no_checksum, kLegacyCrc32cAndFileSize,
+                                  kNamingDefault, kUseDbSessionId}) {
+    auto share =
+        option == share_no_checksum ? kShareNoChecksum : kShareWithChecksum;
+    if (option != share_no_checksum) {
+      backupable_options_->share_files_with_checksum_naming = option;
+    }
+    OpenDBAndBackupEngine(true, false, share);
+
+    std::vector<FileAttributes> children;
+    const std::string shared_dir =
+        backupdir_ +
+        (option == share_no_checksum ? "/shared" : "/shared_checksum");
+
+    // A single small SST file
+    ASSERT_OK(db_->Put(WriteOptions(), "x", "y"));
+
+    // First, test that we always detect file size corruption on the shared
+    // backup side on incremental. (Since sizes aren't really part of backup
+    // meta file, this works by querying the filesystem for the sizes.)
+    ASSERT_OK(backup_engine_->CreateNewBackup(db_.get(), true /*flush*/));
+    CloseDBAndBackupEngine();
+
+    // Corrupt backup SST
+    ASSERT_OK(file_manager_->GetChildrenFileAttributes(shared_dir, &children));
+    ASSERT_EQ(children.size(), 3U);  // ".", "..", one sst
+    for (const auto& child : children) {
+      if (child.name.size() > 4 && child.size_bytes > 0) {
+        ASSERT_OK(
+            file_manager_->WriteToFile(shared_dir + "/" + child.name, "asdf"));
+        break;
+      }
+    }
+
+    OpenDBAndBackupEngine(false, false, share);
+    Status s = backup_engine_->CreateNewBackup(db_.get());
+    EXPECT_TRUE(s.IsCorruption());
+
+    ASSERT_OK(backup_engine_->PurgeOldBackups(0));
+    CloseDBAndBackupEngine();
+
+    // Second, test that a hypothetical db session id collision would likely
+    // not suffice to corrupt a backup, because there's a good chance of
+    // file size difference (in this test, guaranteed) so either no name
+    // collision or detected collision.
+
+    // Create backup 1
+    OpenDBAndBackupEngine(false, false, share);
+    ASSERT_OK(backup_engine_->CreateNewBackup(db_.get()));
+
+    // Even though we have "the same" DB state as backup 1, we need
+    // to restore to recreate the same conditions as later restore.
+    db_.reset();
+    ASSERT_OK(DestroyDB(dbname_, options_));
+    ASSERT_OK(backup_engine_->RestoreDBFromBackup(1, dbname_, dbname_));
+    CloseDBAndBackupEngine();
+
+    // Forge session id
+    ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
+        "DBImpl::SetDbSessionId", [](void* sid_void_star) {
+          std::string* sid = static_cast<std::string*>(sid_void_star);
+          *sid = "01234567890123456789";
+        });
+    ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
+
+    // Create another SST file
+    OpenDBAndBackupEngine(false, false, share);
+    ASSERT_OK(db_->Put(WriteOptions(), "y", "x"));
+
+    // Create backup 2
+    ASSERT_OK(backup_engine_->CreateNewBackup(db_.get(), true /*flush*/));
+
+    // Restore backup 1 (again)
+    db_.reset();
+    ASSERT_OK(DestroyDB(dbname_, options_));
+    ASSERT_OK(backup_engine_->RestoreDBFromBackup(1, dbname_, dbname_));
+    CloseDBAndBackupEngine();
+
+    // Create another SST file with same number and db session id, only bigger
+    OpenDBAndBackupEngine(false, false, share);
+    ASSERT_OK(db_->Put(WriteOptions(), "y", Random(42).RandomString(500)));
+
+    // Count backup SSTs
+    children.clear();
+    ASSERT_OK(file_manager_->GetChildrenFileAttributes(shared_dir, &children));
+    ASSERT_EQ(children.size(), 4U);  // ".", "..", two sst
+
+    // Try create backup 3
+    s = backup_engine_->CreateNewBackup(db_.get(), true /*flush*/);
+
+    // Re-count backup SSTs
+    children.clear();
+    ASSERT_OK(file_manager_->GetChildrenFileAttributes(shared_dir, &children));
+
+    if (option == kUseDbSessionId) {
+      // Acceptable to call it corruption if size is not in name and
+      // db session id collision is practically impossible.
+      EXPECT_TRUE(s.IsCorruption());
+      EXPECT_EQ(children.size(), 4U);  // no SST added
+    } else if (option == share_no_checksum) {
+      // Good to call it corruption if both backups cannot be
+      // accommodated.
+      EXPECT_TRUE(s.IsCorruption());
+      EXPECT_EQ(children.size(), 4U);  // no SST added
+    } else {
+      // Since opening a DB seems sufficient for detecting size corruption
+      // on the DB side, this should be a good thing, ...
+      EXPECT_OK(s);
+      // ... as long as we did actually treat it as a distinct SST file.
+      EXPECT_EQ(children.size(), 5U);  // Another SST added
+    }
+    CloseDBAndBackupEngine();
+    ASSERT_OK(DestroyDB(dbname_, options_));
+    ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
+    ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
+  }
+}
+
 // Verify backup and restore with share_files_with_checksum off and then
 // transition this option to on and share_files_with_checksum_naming to be
 // based on kUseDbSessionId