rocksdb/db
Yanqin Jin e0c84aa0dc Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.

The race condition is between two background flush threads trying to install flush results to the MANIFEST.

Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.

1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst

```
Time  BgFlushThread1                                    BgFlushThread2
 |    mutex_.Lock()
 |    precompute min_wal_to_keep as 6
 |    mutex_.Unlock()
 |                                                     mutex_.Lock()
 |                                                     precompute min_wal_to_keep as 6
 |                                                     join MANIFEST write queue and mutex_.Unlock()
 |    write to MANIFEST
 |    mutex_.Lock()
 |    cfd1->log_number = 7
 |    Signal bg_flush_2 and mutex_.Unlock()
 |                                                     wake up and mutex_.Lock()
 |                                                     cfd0->log_number = 8
 |                                                     FindObsoleteFiles() with job_context->log_number == 7
 |                                                     mutex_.Unlock()
 |                                                     PurgeObsoleteFiles() deletes 6.log
 V
```

As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.

We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715

Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```

Reviewed By: ltamasi

Differential Revision: D34984412

Pulled By: riversand963

fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
2022-03-23 19:41:31 -07:00
..
blob Add rate limiter priority to ReadOptions (#9424) 2022-02-16 23:18:14 -08:00
compaction Fix a bug in PosixClock (#9695) 2022-03-21 16:11:02 -07:00
db_impl Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
arena_wrapped_db_iter.cc fix: Reusing-Iterator reads stale keys after DeleteRange() performed (#9258) 2022-03-15 09:50:21 -07:00
arena_wrapped_db_iter.h Cleanup includes in dbformat.h (#8930) 2021-09-29 04:04:40 -07:00
builder.cc Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
builder.h Expose blob file information through the EventListener interface (#8675) 2021-09-16 17:23:36 -07:00
c_test.c fix a bug, c api, if enable inplace_update_support, and use create sn… (#9471) 2022-03-21 12:04:33 -07:00
c.cc Fix a major performance bug in 7.0 re: filter compatibility (#9736) 2022-03-23 10:00:54 -07:00
column_family_test.cc More refactoring ahead of footer & meta changes (#9240) 2021-12-10 08:13:26 -08:00
column_family.cc DisableManualCompaction may fail to cancel an unscheduled task (#9659) 2022-03-12 20:07:04 -08:00
column_family.h Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410) 2022-03-11 16:13:23 -08:00
compact_files_test.cc Fix test race conditions with OnFlushCompleted() (#9617) 2022-02-22 12:23:00 -08:00
comparator_db_test.cc More refactoring ahead of footer & meta changes (#9240) 2021-12-10 08:13:26 -08:00
convenience.cc Add rate limiter priority to ReadOptions (#9424) 2022-02-16 23:18:14 -08:00
corruption_test.cc Make the Env class Customizable (#9293) 2022-01-04 16:45:49 -08:00
cuckoo_table_db_test.cc Experimental support for SST unique IDs (#8990) 2021-10-18 23:32:01 -07:00
db_basic_test.cc Fix some MultiGet batching stats (#9583) 2022-02-17 16:31:41 -08:00
db_block_cache_test.cc Enhance new cache key testing & comments (#9329) 2022-02-04 14:15:58 -08:00
db_bloom_filter_test.cc Fix a major performance bug in 7.0 re: filter compatibility (#9736) 2022-03-23 10:00:54 -07:00
db_compaction_filter_test.cc Fix a minor issue with initializing the test path (#8555) 2021-07-23 08:38:45 -07:00
db_compaction_test.cc Fix a race condition when disable and enable manual compaction (#9694) 2022-03-15 12:31:14 -07:00
db_dynamic_level_test.cc Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit (#9452) 2022-01-27 13:01:09 -08:00
db_encryption_test.cc Fix a minor issue with initializing the test path (#8555) 2021-07-23 08:38:45 -07:00
db_filesnapshot.cc Use a sorted vector instead of a map to store blob file metadata (#9526) 2022-02-09 12:36:43 -08:00
db_flush_test.cc Fix mempurge crash reported in #8958 (#9671) 2022-03-10 15:16:55 -08:00
db_info_dumper.cc Allow WAL dir to change with db dir (#8582) 2021-07-30 12:16:44 -07:00
db_info_dumper.h Add a DB Session ID (#6959) 2020-06-15 10:47:02 -07:00
db_inplace_update_test.cc fix a bug, c api, if enable inplace_update_support, and use create sn… (#9471) 2022-03-21 12:04:33 -07:00
db_io_failure_test.cc Enable a few unit tests to use custom Env objects (#9087) 2021-11-08 11:05:59 -08:00
db_iter_stress_test.cc Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262) 2021-05-05 14:00:17 -07:00
db_iter_test.cc Remove iter_start_seqnum and preserve_deletes (#9430) 2022-01-28 13:28:38 -08:00
db_iter.cc Remove iter_start_seqnum and preserve_deletes (#9430) 2022-01-28 13:28:38 -08:00
db_iter.h Cleanup includes in dbformat.h (#8930) 2021-09-29 04:04:40 -07:00
db_iterator_test.cc Fix a minor issue with initializing the test path (#8555) 2021-07-23 08:38:45 -07:00
db_kv_checksum_test.cc Revise APIs related to user-defined timestamp (#8946) 2022-02-01 22:19:01 -08:00
db_log_iter_test.cc Attempt to deflake DBTestXactLogIterator.TransactionLogIteratorCorruptedLog (#8627) 2021-08-10 11:10:07 -07:00
db_logical_block_size_cache_test.cc Attempt to deflake DBLogicalBlockSizeCacheTest.CreateColumnFamilies (#9516) 2022-03-04 11:35:28 -08:00
db_memtable_test.cc Fix a minor issue with initializing the test path (#8555) 2021-07-23 08:38:45 -07:00
db_merge_operand_test.cc Fix PinSelf() read-after-free in DB::GetMergeOperands() (#9507) 2022-02-15 12:25:18 -08:00
db_merge_operator_test.cc Fix a minor issue with initializing the test path (#8555) 2021-07-23 08:38:45 -07:00
db_options_test.cc Remove deprecated option new_table_reader_for_compaction_inputs (#9443) 2022-02-08 19:31:28 -08:00
db_properties_test.cc compression_per_level should be used for flush and changeable (#9658) 2022-03-07 18:06:19 -08:00
db_range_del_test.cc fix: Reusing-Iterator reads stale keys after DeleteRange() performed (#9258) 2022-03-15 09:50:21 -07:00
db_rate_limiter_test.cc Rate-limit automatic WAL flush after each user write (#9607) 2022-03-08 13:19:39 -08:00
db_secondary_test.cc Make the Env class Customizable (#9293) 2022-01-04 16:45:49 -08:00
db_sst_test.cc Make the Env class Customizable (#9293) 2022-01-04 16:45:49 -08:00
db_statistics_test.cc Bytes read stat for VerifyChecksum() and VerifyFileChecksums() APIs (#8741) 2021-09-07 13:28:29 -07:00
db_table_properties_test.cc Fix backward compatibility with 2.5 through 2.7 (#9189) 2021-11-19 17:31:01 -08:00
db_tailing_iter_test.cc Fix a minor issue with initializing the test path (#8555) 2021-07-23 08:38:45 -07:00
db_test2.cc Add manifest fix-up utility for file temperatures (#9683) 2022-03-18 16:35:51 -07:00
db_test_util.cc Use a sorted vector instead of a map to store blob file metadata (#9526) 2022-02-09 12:36:43 -08:00
db_test_util.h Update Cache::Release param from force_erase to erase_if_last_ref (#9728) 2022-03-22 10:22:18 -07:00
db_test.cc Fix spelling in public API (#9490) 2022-02-03 15:15:23 -08:00
db_universal_compaction_test.cc Adhere to per-DB concurrency limit when bottom-pri compactions exist (#9179) 2021-11-18 17:31:50 -08:00
db_wal_test.cc Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
db_with_timestamp_basic_test.cc Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410) 2022-03-11 16:13:23 -08:00
db_with_timestamp_compaction_test.cc Use the comparator from the sst file table properties in sst_dump_tool (#9491) 2022-02-08 12:15:35 -08:00
db_write_buffer_manager_test.cc Enable a few unit tests to use custom Env objects (#9087) 2021-11-08 11:05:59 -08:00
db_write_test.cc Enable a few unit tests to use custom Env objects (#9087) 2021-11-08 11:05:59 -08:00
dbformat_test.cc Enable a few unit tests to use custom Env objects (#9087) 2021-11-08 11:05:59 -08:00
dbformat.cc Track per-SST user-defined timestamp information in MANIFEST (#9092) 2021-11-10 10:49:04 -08:00
dbformat.h Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410) 2022-03-11 16:13:23 -08:00
deletefile_test.cc Enable a few unit tests to use custom Env objects (#9087) 2021-11-08 11:05:59 -08:00
error_handler_fs_test.cc Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
error_handler.cc Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
error_handler.h Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
event_helpers.cc Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
event_helpers.h Add a listener callback for end of auto error recovery (#9244) 2021-12-08 14:30:57 -08:00
experimental.cc Add manifest fix-up utility for file temperatures (#9683) 2022-03-18 16:35:51 -07:00
external_sst_file_basic_test.cc Enable a few unit tests to use custom Env objects (#9087) 2021-11-08 11:05:59 -08:00
external_sst_file_ingestion_job.cc Do not rely on ADL when invoking std::max_element (#9608) 2022-03-02 17:41:02 -08:00
external_sst_file_ingestion_job.h New stable, fixed-length cache keys (#9126) 2021-12-16 17:15:13 -08:00
external_sst_file_test.cc Make the Env class Customizable (#9293) 2022-01-04 16:45:49 -08:00
fault_injection_test.cc Fix a bug causing duplicate trailing entries in WritableFile (buffered IO) (#9236) 2021-12-13 09:00:36 -08:00
file_indexer_test.cc
file_indexer.cc
file_indexer.h
filename_test.cc Remove unused includes (#7604) 2020-10-28 23:22:27 -07:00
flush_job_test.cc Use the comparator from the sst file table properties in sst_dump_tool (#9491) 2022-02-08 12:15:35 -08:00
flush_job.cc Fix a bug in PosixClock (#9695) 2022-03-21 16:11:02 -07:00
flush_job.h Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
flush_scheduler.cc
flush_scheduler.h Include C++ standard library headers instead of C compatibility headers (#8068) 2021-03-19 12:09:47 -07:00
forward_iterator_bench.cc Remove using namespace (#9369) 2022-01-12 09:31:12 -08:00
forward_iterator.cc Fast path for detecting unchanged prefix_extractor (#9407) 2022-01-21 11:37:46 -08:00
forward_iterator.h Fast path for detecting unchanged prefix_extractor (#9407) 2022-01-21 11:37:46 -08:00
history_trimming_iterator.h Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410) 2022-03-11 16:13:23 -08:00
import_column_family_job.cc Add Temperature info in NewSequentialFile() (#9499) 2022-02-18 18:23:07 -08:00
import_column_family_job.h New stable, fixed-length cache keys (#9126) 2021-12-16 17:15:13 -08:00
import_column_family_test.cc Fix a minor issue with initializing the test path (#8555) 2021-07-23 08:38:45 -07:00
internal_stats.cc Add manifest fix-up utility for file temperatures (#9683) 2022-03-18 16:35:51 -07:00
internal_stats.h Support GetMapProperty() with "rocksdb.dbstats" (#9057) 2021-10-20 13:17:00 -07:00
job_context.h Add manifest fix-up utility for file temperatures (#9683) 2022-03-18 16:35:51 -07:00
kv_checksum.h fix compile errors in db/kv_checksum.h (#9173) 2021-11-16 10:20:50 -08:00
listener_test.cc Fix test race conditions with OnFlushCompleted() (#9617) 2022-02-22 12:23:00 -08:00
log_format.h Add record to set WAL compression type if enabled (#9556) 2022-02-17 16:19:31 -08:00
log_reader.cc Integrate WAL compression into log reader/writer. (#9642) 2022-03-09 15:49:53 -08:00
log_reader.h Integrate WAL compression into log reader/writer. (#9642) 2022-03-09 15:49:53 -08:00
log_test.cc Integrate WAL compression into log reader/writer. (#9642) 2022-03-09 15:49:53 -08:00
log_writer.cc Integrate WAL compression into log reader/writer. (#9642) 2022-03-09 15:49:53 -08:00
log_writer.h Integrate WAL compression into log reader/writer. (#9642) 2022-03-09 15:49:53 -08:00
logs_with_prep_tracker.cc
logs_with_prep_tracker.h Include C++ standard library headers instead of C compatibility headers (#8068) 2021-03-19 12:09:47 -07:00
lookup_key.h Cleanup includes in dbformat.h (#8930) 2021-09-29 04:04:40 -07:00
malloc_stats.cc Replace most typedef with using= (#8751) 2021-09-07 11:31:59 -07:00
malloc_stats.h
manual_compaction_test.cc Remove using namespace (#9369) 2022-01-12 09:31:12 -08:00
memtable_list_test.cc Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
memtable_list.cc Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
memtable_list.h Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
memtable.cc Fix major bug with MultiGet, DeleteRange, and memtable Bloom (#9453) 2022-01-27 14:55:04 -08:00
memtable.h Fix major bug with MultiGet, DeleteRange, and memtable Bloom (#9453) 2022-01-27 14:55:04 -08:00
merge_context.h Add Merge Operator support to WriteBatchWithIndex (#8135) 2021-05-10 12:50:25 -07:00
merge_helper_test.cc Support readahead during compaction for blob files (#9187) 2021-11-19 17:53:47 -08:00
merge_helper.cc Support readahead during compaction for blob files (#9187) 2021-11-19 17:53:47 -08:00
merge_helper.h Support readahead during compaction for blob files (#9187) 2021-11-19 17:53:47 -08:00
merge_operator.cc
merge_test.cc Make the Env class Customizable (#9293) 2022-01-04 16:45:49 -08:00
obsolete_files_test.cc Add commit marker with timestamp (#9266) 2021-12-10 11:05:35 -08:00
options_file_test.cc No elide constructors (#7798) 2020-12-23 16:55:53 -08:00
output_validator.cc Cleanup includes in dbformat.h (#8930) 2021-09-29 04:04:40 -07:00
output_validator.h Cleanup includes in dbformat.h (#8930) 2021-09-29 04:04:40 -07:00
perf_context_test.cc Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033) 2021-03-15 04:34:11 -07:00
periodic_work_scheduler_test.cc Fix a minor issue with initializing the test path (#8555) 2021-07-23 08:38:45 -07:00
periodic_work_scheduler.cc Fix a timer crash caused by invalid memory management (#9656) 2022-03-12 11:45:56 -08:00
periodic_work_scheduler.h Fix a timer crash caused by invalid memory management (#9656) 2022-03-12 11:45:56 -08:00
pinned_iterators_manager.h Replace most typedef with using= (#8751) 2021-09-07 11:31:59 -07:00
plain_table_db_test.cc Fast path for detecting unchanged prefix_extractor (#9407) 2022-01-21 11:37:46 -08:00
pre_release_callback.h Fix and detect headers with missing dependencies (#8893) 2021-09-10 10:00:26 -07:00
prefix_test.cc Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033) 2021-03-15 04:34:11 -07:00
range_del_aggregator_bench.cc Cleanup multiple implementations of VectorIterator (#8901) 2021-10-06 07:48:31 -07:00
range_del_aggregator_test.cc Cleanup multiple implementations of VectorIterator (#8901) 2021-10-06 07:48:31 -07:00
range_del_aggregator.cc In ParseInternalKey(), include corrupt key info in Status (#7515) 2020-10-28 10:12:58 -07:00
range_del_aggregator.h Fix some typos in comments (#8066) 2021-03-25 21:18:08 -07:00
range_tombstone_fragmenter_test.cc Cleanup multiple implementations of VectorIterator (#8901) 2021-10-06 07:48:31 -07:00
range_tombstone_fragmenter.cc Added memtable garbage statistics (#8411) 2021-06-18 04:57:27 -07:00
range_tombstone_fragmenter.h Added memtable garbage statistics (#8411) 2021-06-18 04:57:27 -07:00
read_callback.h Fix and detect headers with missing dependencies (#8893) 2021-09-10 10:00:26 -07:00
repair_test.cc Some fixes and enhancements to ldb repair (#8544) 2021-07-28 16:44:14 -07:00
repair.cc Fast path for detecting unchanged prefix_extractor (#9407) 2022-01-21 11:37:46 -08:00
snapshot_checker.h
snapshot_impl.cc
snapshot_impl.h Fix and detect headers with missing dependencies (#8893) 2021-09-10 10:00:26 -07:00
table_cache.cc fix a bug of the ticker NO_FILE_OPENS (#9677) 2022-03-15 09:55:49 -07:00
table_cache.h Fast path for detecting unchanged prefix_extractor (#9407) 2022-01-21 11:37:46 -08:00
table_properties_collector_test.cc Improve / clean up meta block code & integrity (#9163) 2021-11-18 11:43:44 -08:00
table_properties_collector.cc Apply sample_for_compression to all block-based tables (#8105) 2021-03-25 15:00:45 -07:00
table_properties_collector.h Track each SST's timestamp information as user properties (#9093) 2021-11-19 11:37:06 -08:00
transaction_log_impl.cc Add commit marker with timestamp (#9266) 2021-12-10 11:05:35 -08:00
transaction_log_impl.h Cleanup includes in dbformat.h (#8930) 2021-09-29 04:04:40 -07:00
trim_history_scheduler.cc
trim_history_scheduler.h
version_builder_test.cc Use a sorted vector instead of a map to store blob file metadata (#9526) 2022-02-09 12:36:43 -08:00
version_builder.cc Use a sorted vector instead of a map to store blob file metadata (#9526) 2022-02-09 12:36:43 -08:00
version_builder.h Fast path for detecting unchanged prefix_extractor (#9407) 2022-01-21 11:37:46 -08:00
version_edit_handler.cc Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
version_edit_handler.h Fixed manifest_dump issues when printing keys and values containing null characters (#8378) 2021-06-10 12:55:20 -07:00
version_edit_test.cc File temperature information should be preserved when restart the DB (#9242) 2021-12-03 14:43:14 -08:00
version_edit.cc Print file checksum in hex (#9196) 2021-11-22 09:30:47 -08:00
version_edit.h Clean up VersionStorageInfo a bit (#9494) 2022-02-04 08:19:20 -08:00
version_set_test.cc Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
version_set.cc Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
version_set.h Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
version_util.h Add manifest fix-up utility for file temperatures (#9683) 2022-03-18 16:35:51 -07:00
wal_edit_test.cc Always track WAL obsoletion (#7759) 2020-12-09 16:02:12 -08:00
wal_edit.cc Always track WAL obsoletion (#7759) 2020-12-09 16:02:12 -08:00
wal_edit.h Always track WAL obsoletion (#7759) 2020-12-09 16:02:12 -08:00
wal_manager_test.cc Make SystemClock into a Customizable Class (#8636) 2021-09-21 09:23:48 -07:00
wal_manager.cc Allow WAL dir to change with db dir (#8582) 2021-07-30 12:16:44 -07:00
wal_manager.h Allow WAL dir to change with db dir (#8582) 2021-07-30 12:16:44 -07:00
write_batch_base.cc
write_batch_internal.h Support WBWI for keys having timestamps (#9603) 2022-02-22 14:23:01 -08:00
write_batch_test.cc Support WBWI for keys having timestamps (#9603) 2022-02-22 14:23:01 -08:00
write_batch.cc Improve stress test for transactions (#9568) 2022-03-16 19:00:04 -07:00
write_callback_test.cc Move slow valgrind tests behind -DROCKSDB_FULL_VALGRIND_RUN (#8475) 2021-07-07 11:14:05 -07:00
write_callback.h
write_controller_test.cc Revamp WriteController (#8064) 2021-03-18 09:47:31 -07:00
write_controller.cc Revamp WriteController (#8064) 2021-03-18 09:47:31 -07:00
write_controller.h Revamp WriteController (#8064) 2021-03-18 09:47:31 -07:00
write_thread.cc Rate-limit automatic WAL flush after each user write (#9607) 2022-03-08 13:19:39 -08:00
write_thread.h Rate-limit automatic WAL flush after each user write (#9607) 2022-03-08 13:19:39 -08:00