Fix WAL corruption from checkpoint/backup race condition

Summary: `Writer::WriteBuffer` was always called at the beginning of checkpoint/backup. But that log writer has no internal synchronization, which meant the same buffer could be flushed twice in a race condition case, causing a WAL entry to be duplicated. Then subsequent WAL entries would be at unexpected offsets, causing the 32KB block boundaries to be overlapped and manifesting as a corruption. This PR fixes the behavior to only use `WriteBuffer` (via `FlushWAL`) in checkpoint/backup when manual WAL flush is enabled. In that case, users are responsible for providing synchronization between WAL flushes. We can also consider removing the call entirely. Closes https://github.com/facebook/rocksdb/pull/3603 Differential Revision: D7277447 Pulled By: ajkr fbshipit-source-id: 1b15bd7fd930511222b075418c10de0aaa70a35a
2018-03-16 14:15:53 -07:00 · 2018-03-16 14:15:53 -07:00 · 06858ac58d
commit 06858ac58d
parent dbd8fa09b8
2 changed files with 7 additions and 1 deletions
--- a/HISTORY.md
+++ b/HISTORY.md
@ -1,4 +1,8 @@
 # Rocksdb Change Log
+## Unreleased
+### Bug Fixes
+* Fix WAL corruption caused by race condition between user write thread and backup/checkpoint thread.
+
 ## 5.11.2 (02/24/2018)
 ### Bug Fixes
 * Fix bug in iterator readahead causing blocks to incorrectly be considered truncated (corrupted).
--- a/utilities/checkpoint/checkpoint_impl.cc
+++ b/utilities/checkpoint/checkpoint_impl.cc
@ -207,7 +207,9 @@ Status CheckpointImpl::CreateCustomCheckpoint(

    TEST_SYNC_POINT("CheckpointImpl::CreateCheckpoint:SavedLiveFiles1");
    TEST_SYNC_POINT("CheckpointImpl::CreateCheckpoint:SavedLiveFiles2");
-    db_->FlushWAL(false /* sync */);
+    if (db_options.manual_wal_flush) {
+      db_->FlushWAL(false /* sync */);
+    }
  }
  // if we have more than one column family, we need to also get WAL files
  if (s.ok()) {