rocksdb/utilities
cheng-chang 2eaad9e1b1 Skip WALs according to MinLogNumberToKeep when creating checkpoint (#7789)
Summary:
In a stress test failure, we observe that a WAL is skipped when creating checkpoint, although its log number >= MinLogNumberToKeep(). This might happen in the following case:

1. when creating the checkpoint, there are 2 column families: CF0 and CF1, and there are 2 WALs: 1, 2;
2. CF0's log number is 1, CF0's active memtable is empty, CF1's log number is 2, CF1's active memtable is not empty, WAL 2 is not empty, the sequence number points to WAL 2;
2. the checkpoint process flushes CF0, since CF0' active memtable is empty, there is no need to SwitchMemtable, thus no new WAL will be created, so CF0's log number is now 2, concurrently, some data is written to CF0 and WAL 2;
3. the checkpoint process flushes CF1, WAL 3 is created and CF1's log number is now 3, CF0's log number is still 2 because CF0 is not empty and WAL 2 contains its unflushed data concurrently written in step 2;
4.  the checkpoint process determines that WAL 1 and 2 are no longer needed according to [live_wal_files[i]->StartSequence() >= *sequence_number](https://github.com/facebook/rocksdb/blob/master/utilities/checkpoint/checkpoint_impl.cc#L388), so it skips linking them to the checkpoint directory;
5. but according to `MinLogNumberToKeep()`, WAL 2 still needs to be kept because CF0's log number is 2.

If the checkpoint is reopened in read-only mode, and only read from the snapshot with the initial sequence number, then there will be no data loss or data inconsistency.

But if the checkpoint is reopened and read from the most recent sequence number, suppose in step 3, there are also data concurrently written to CF1 and WAL 3, then the most recent sequence number refers to the latest entry in WAL 3, so the data written in step 2 should also be visible, but since WAL 2 is discarded, those data are lost.

When tracking WAL in MANIFEST is enabled, when reopening the checkpoint, since WAL 2 is still tracked in MANIFEST as alive, but it's missing from the checkpoint directory, a corruption will be reported.

This PR makes the checkpoint process to only skip a WAL if its log number < `MinLogNumberToKeep`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7789

Test Plan: watch existing tests to pass.

Reviewed By: ajkr

Differential Revision: D25662346

Pulled By: cheng-chang

fbshipit-source-id: 136471095baa01886cf44809455cf855f24857a0
2020-12-23 14:22:45 -08:00
..
backupable Skip WALs according to MinLogNumberToKeep when creating checkpoint (#7789) 2020-12-23 14:22:45 -08:00
blob_db Handling misuse of snprintf return value (#7686) 2020-12-07 13:43:55 -08:00
cassandra Remove unused includes (#7604) 2020-10-28 23:22:27 -07:00
checkpoint Skip WALs according to MinLogNumberToKeep when creating checkpoint (#7789) 2020-12-23 14:22:45 -08:00
compaction_filters Compaction filter support for BlobDB (#6850) 2020-06-29 17:32:14 -07:00
convenience Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
leveldb_options Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
memory Add few unit test cases in ASSERT_STATUS_CHECKED (#7500) 2020-10-08 11:22:44 -07:00
merge_operators Remove unused includes (#7604) 2020-10-28 23:22:27 -07:00
option_change_migration Whole DBTest to skip fsync (#7274) 2020-08-17 18:42:25 -07:00
options Return NotFound from TableFactory configuration errors during options loading (#7615) 2020-10-29 18:44:24 -07:00
persistent_cache Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566) 2020-10-27 10:33:09 -07:00
simulator_cache Bring the Configurable options together (#5753) 2020-09-14 17:01:01 -07:00
table_properties_collectors Trigger compaction in CompactOnDeletionCollector based on deletion ratio (#6806) 2020-05-18 08:42:05 -07:00
trace Add some simulator cache and block tracer tests to ASSERT_STATUS_CHECKED (#7305) 2020-08-24 16:43:31 -07:00
transactions Apply the changes from: PS-5501 : Re-license PerconaFT 'locktree' to Apache V2 (#7801) 2020-12-23 14:20:51 -08:00
ttl DBWithTTL::Open() param ttls: vector<int32_t> to const vector<int32_t>& (#7196) 2020-08-24 16:24:16 -07:00
write_batch_with_index Invalidate iterator on transaction clear (#7733) 2020-12-09 19:13:22 -08:00
debug.cc In ParseInternalKey(), include corrupt key info in Status (#7515) 2020-10-28 10:12:58 -07:00
env_librados_test.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
env_librados.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
env_librados.md Add EnvLibrados - RocksDB Env of RADOS (#1222) 2016-07-21 11:16:34 -07:00
env_mirror_test.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
env_mirror.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
env_timed_test.cc Make env*_test work with ASSERT_STATUS_CHECKED (#7176) 2020-07-28 22:59:48 -07:00
env_timed.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
fault_injection_env.cc Add further tests to ASSERT_STATUS_CHECKED (2) (#7698) 2020-12-09 21:21:16 -08:00
fault_injection_env.h Status check enforcement for error_handler_fs_test (#7342) 2020-10-02 16:41:13 -07:00
fault_injection_fs.cc Inject the random write error to stress test (#7653) 2020-12-17 11:52:28 -08:00
fault_injection_fs.h Inject the random write error to stress test (#7653) 2020-12-17 11:52:28 -08:00
merge_operators.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
object_registry_test.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
object_registry.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
util_merge_operators_test.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00