From eeea27a0481658419e99433abf40fbe598d44d7a Mon Sep 17 00:00:00 2001 From: Cheng Chang Date: Thu, 7 Jan 2021 23:00:12 -0800 Subject: [PATCH] Get manifest size again after getting min_log_num during checkpoint (#7836) Summary: Currently, manifest size is determined before getting min_log_num. But between getting manifest size and getting min_log_num, concurrently, a flush might succeed, which will write new records to manifest to make some WALs become outdated, then min_log_num will be correspondingly increased, but the new records in manifest will not be copied into the checkpoint because the manifest's size is determined before them, then the newly outdated WALs will still exist in the checkpoint's manifest, but they are not linked/copied to the checkpoint because their log number is < min_log_num, so a corruption of missing WAL will be reported when restoring from the checkpoint. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7836 Test Plan: make crash_test Reviewed By: ajkr Differential Revision: D25788204 Pulled By: cheng-chang fbshipit-source-id: a4e5acf30f08270b3c0a95304ff559a9e655252f --- utilities/checkpoint/checkpoint_impl.cc | 32 ++++++++++--------------- 1 file changed, 12 insertions(+), 20 deletions(-) diff --git a/utilities/checkpoint/checkpoint_impl.cc b/utilities/checkpoint/checkpoint_impl.cc index d496da57c..b533c1d3d 100644 --- a/utilities/checkpoint/checkpoint_impl.cc +++ b/utilities/checkpoint/checkpoint_impl.cc @@ -235,26 +235,18 @@ Status CheckpointImpl::CreateCustomCheckpoint( if (!db_->GetIntProperty(DB::Properties::kMinLogNumberToKeep, &min_log_num)) { return Status::InvalidArgument("cannot get the min log number to keep."); } - - if (s.ok() && db_options.allow_2pc) { - // We need to refetch live files with flush to handle this case: - // A previous 000001.log contains the prepare record of transaction tnx1. - // The current log file is 000002.log, and sequence_number points to this - // file. - // After calling GetLiveFiles(), 000003.log is created. - // Then tnx1 is committed. The commit record is written to 000003.log. - // Now we fetch min_log_num, which will be 3. - // Then only 000002.log and 000003.log will be copied, and 000001.log will - // be skipped. 000003.log contains commit message of tnx1, but we don't - // have respective prepare record for it. - // In order to avoid this situation, we need to force flush to make sure - // all transactions committed before getting min_log_num will be flushed - // to SST files. - // We cannot get min_log_num before calling the GetLiveFiles() for the - // first time, because if we do that, all the logs files will be included, - // far more than needed. - s = db_->GetLiveFiles(live_files, &manifest_file_size, flush_memtable); - } + // Between GetLiveFiles and getting min_log_num, flush might happen + // concurrently, so new WAL deletions might be tracked in MANIFEST. If we do + // not get the new MANIFEST size, the deleted WALs might not be reflected in + // the checkpoint's MANIFEST. + // + // If we get min_log_num before the above GetLiveFiles, then there might + // be too many unnecessary WALs to be included in the checkpoint. + // + // Ideally, min_log_num should be got together with manifest_file_size in + // GetLiveFiles atomically. But that needs changes to GetLiveFiles' signature + // which is a public API. + s = db_->GetLiveFiles(live_files, &manifest_file_size, flush_memtable); TEST_SYNC_POINT("CheckpointImpl::CreateCheckpoint:SavedLiveFiles1"); TEST_SYNC_POINT("CheckpointImpl::CreateCheckpoint:SavedLiveFiles2");