rocksdb/db
Little-Wallace ec3e3c3e02 Fix corruption with intra-L0 on ingested files (#5958)
Summary:
## Problem Description

Our process was abort when it call `CheckConsistency`. And the information in  `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ".  Here are the causes of the accident I investigated.

* RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested.
* When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
* `CheckConsistency`  determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
* If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst,  the `smallest_seqno` of this new file will be smaller than his `largest_seqno`.
    * If more than one ingested file was ingested before memtable schedule flush,  and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable.  So `CheckConsistency` will return a `Corruption`.
    * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start,  but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2).  **But there was still some data in s1  written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files:
    > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno  < s1.largest_seqno

So I skip pick sst file which was ingested in function `FindIntraL0Compaction `

## Reason

Here is my bug report: https://github.com/facebook/rocksdb/issues/5913

There are two situations that can cause the check to fail.

### First situation:
- First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.

- If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.

- Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
- RocksDB check consistency , and find the `smallest_seqno` of B is  less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.

### Secondary situaion

- First we have flushed many sst in L0,  we call them [s1, s2, s3].

- There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1.  And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.

- m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.

- [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction.  We call it s6.
  - compacted 4@0 files to L0
- When s6 is added into manifest,  the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5.  But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
   - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958

Differential Revision: D18601316

fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
2019-11-19 15:09:11 -08:00
..
compaction Fix corruption with intra-L0 on ingested files (#5958) 2019-11-19 15:09:11 -08:00
db_impl Fix IngestExternalFile's bug with two_write_queue (#5976) 2019-11-15 14:00:37 -08:00
arena_wrapped_db_iter.cc Apply formatter on recent 45 commits. (#5827) 2019-09-19 12:34:17 -07:00
arena_wrapped_db_iter.h Refactor ArenaWrappedDBIter into separate files (#5801) 2019-09-13 13:50:43 -07:00
blob_index.h Support decoding blob indexes in sst_dump (#5926) 2019-10-17 19:36:54 -07:00
builder.cc BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903) 2019-10-14 15:21:01 -07:00
builder.h Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
c_test.c Add a limited support for iteration bounds into BaseDeltaIterator (#5403) 2019-11-05 11:39:36 -08:00
c.cc Remove snap_refresh_nanos option (#5826) 2019-09-18 20:26:04 -07:00
column_family_test.cc Refactor trimming logic for immutable memtables (#5022) 2019-08-23 13:55:34 -07:00
column_family.cc Fix corruption with intra-L0 on ingested files (#5958) 2019-11-19 15:09:11 -08:00
column_family.h Bump up memory order of ref counting of ColumnFamilyData (#5723) 2019-08-20 10:34:33 -07:00
compact_files_test.cc Use aggregate initialization for FlushJobInfo/CompactionJobInfo (#5997) 2019-11-01 11:46:19 -07:00
compacted_db_impl.cc New API to get all merge operands for a Key (#5604) 2019-08-06 14:26:44 -07:00
compacted_db_impl.h Use delete to disable automatic generated methods. (#5009) 2019-09-11 18:09:00 -07:00
comparator_db_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
convenience.cc Do readahead in VerifyChecksum() (#5713) 2019-08-16 16:42:56 -07:00
corruption_test.cc Fix VerifyChecksum readahead with mmap mode (#5945) 2019-10-21 11:38:30 -07:00
cuckoo_table_db_test.cc Fix memory leak on error opening PlainTable (#5951) 2019-10-21 16:53:06 -07:00
db_basic_test.cc Fix test failure in LITE mode (#6050) 2019-11-19 10:13:24 -08:00
db_blob_index_test.cc Disable blob iterator test with max_sequential_skip_in_iterations==0 in LITE mode (#6052) 2019-11-19 15:02:41 -08:00
db_block_cache_test.cc Charge block cache for cache internal usage (#5797) 2019-09-16 15:26:21 -07:00
db_bloom_filter_test.cc Abandon use of folly::Optional (#6036) 2019-11-14 14:04:15 -08:00
db_compaction_filter_test.cc upgrade gtest 1.7.0 => 1.8.1 for json result writing 2019-09-09 11:24:11 -07:00
db_compaction_test.cc Fix corruption with intra-L0 on ingested files (#5958) 2019-11-19 15:09:11 -08:00
db_dynamic_level_test.cc Fix flaky DBDynamicLevelTest.DynamicLevelMaxBytesBase2 (#4668) 2018-11-12 16:42:16 -08:00
db_encryption_test.cc Fix EncryptedEnv assert (#5735) 2019-09-05 17:21:42 -07:00
db_filesnapshot.cc Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
db_flush_test.cc Fix DBFlushTest::FireOnFlushCompletedAfterCommittedResult hang (#6018) 2019-11-08 13:47:29 -08:00
db_info_dumper.cc Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
db_info_dumper.h Change RocksDB License 2017-07-15 16:11:23 -07:00
db_inplace_update_test.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
db_io_failure_test.cc Disable DBIOFailureTest.NoSpaceCompactRange in LITE (#4596) 2018-10-29 14:36:31 -07:00
db_iter_stress_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
db_iter_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
db_iter.cc Fix blob context when db_iter uses seek (#6051) 2019-11-19 11:39:02 -08:00
db_iter.h Apply formatter on recent 45 commits. (#5827) 2019-09-19 12:34:17 -07:00
db_iterator_test.cc Fix data block upper bound checking for iterator reseek case (#5883) 2019-10-03 20:53:29 -07:00
db_log_iter_test.cc Apply modernize-use-override (2nd iteration) 2019-02-14 14:41:36 -08:00
db_memtable_test.cc upgrade gtest 1.7.0 => 1.8.1 for json result writing 2019-09-09 11:24:11 -07:00
db_merge_operand_test.cc New API to get all merge operands for a Key (#5604) 2019-08-06 14:26:44 -07:00
db_merge_operator_test.cc New API to get all merge operands for a Key (#5604) 2019-08-06 14:26:44 -07:00
db_options_test.cc Make FIFO compaction take default 30 days TTL by default (#5987) 2019-10-31 11:13:12 -07:00
db_properties_test.cc Refactor/consolidate legacy Bloom implementation details (#5784) 2019-09-16 16:17:09 -07:00
db_range_del_test.cc Add test showing range tombstones can create excessively large compactions (#5956) 2019-10-24 11:08:44 -07:00
db_sst_test.cc Fix bugs in WAL trash file handling (#5520) 2019-07-06 21:07:32 -07:00
db_statistics_test.cc Make statistics's stats_level change thread-safe (#5030) 2019-03-01 10:42:09 -08:00
db_table_properties_test.cc upgrade gtest 1.7.0 => 1.8.1 for json result writing 2019-09-09 11:24:11 -07:00
db_tailing_iter_test.cc Remove managed iterator 2018-07-17 14:43:18 -07:00
db_test2.cc Fix a regression bug on total order seek with prefix enabled and range delete (#6028) 2019-11-13 10:11:34 -08:00
db_test_util.cc Batched MultiGet API for multiple column families (#5816) 2019-11-12 13:52:55 -08:00
db_test_util.h Batched MultiGet API for multiple column families (#5816) 2019-11-12 13:52:55 -08:00
db_test.cc Add file number/oldest referenced blob file number to {Sst,Live}FileMetaData (#6011) 2019-11-07 14:04:16 -08:00
db_universal_compaction_test.cc Turn on periodic compaction in universal by default if compaction filter is used. (#5994) 2019-11-07 10:58:10 -08:00
db_wal_test.cc Adding DB::GetCurrentWalFile() API as a repliction/backup helper (#5765) 2019-09-04 12:10:17 -07:00
db_write_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
dbformat_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
dbformat.cc Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
dbformat.h Use delete to disable automatic generated methods. (#5009) 2019-09-11 18:09:00 -07:00
deletefile_test.cc Refactor deletefile_test.cc (#5822) 2019-09-18 16:58:21 -07:00
error_handler_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
error_handler.cc Make format 2019-05-31 15:24:43 -07:00
error_handler.h Fix typos in comments (#4456) 2018-10-04 20:46:50 -07:00
event_helpers.cc BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903) 2019-10-14 15:21:01 -07:00
event_helpers.h BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903) 2019-10-14 15:21:01 -07:00
experimental.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
external_sst_file_basic_test.cc Allow ingesting overlapping files (#5539) 2019-09-13 14:49:47 -07:00
external_sst_file_ingestion_job.cc BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903) 2019-10-14 15:21:01 -07:00
external_sst_file_ingestion_job.h Allow ingesting overlapping files (#5539) 2019-09-13 14:49:47 -07:00
external_sst_file_test.cc Fix IngestExternalFile's bug with two_write_queue (#5976) 2019-11-15 14:00:37 -08:00
fault_injection_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
file_indexer_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
file_indexer.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
file_indexer.h Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
filename_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
flush_job_test.cc Fix OnFlushCompleted fired before flush result write to MANIFEST (#5908) 2019-10-16 10:40:23 -07:00
flush_job.cc Use aggregate initialization for FlushJobInfo/CompactionJobInfo (#5997) 2019-11-01 11:46:19 -07:00
flush_job.h Fix OnFlushCompleted fired before flush result write to MANIFEST (#5908) 2019-10-16 10:40:23 -07:00
flush_scheduler.cc Refactor trimming logic for immutable memtables (#5022) 2019-08-23 13:55:34 -07:00
flush_scheduler.h Refactor trimming logic for immutable memtables (#5022) 2019-08-23 13:55:34 -07:00
forward_iterator_bench.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
forward_iterator.cc Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
forward_iterator.h Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
import_column_family_job.cc BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903) 2019-10-14 15:21:01 -07:00
import_column_family_job.h Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
import_column_family_test.cc Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
internal_stats.cc Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
internal_stats.h Rename InternalDBStatsType enum names (#5779) 2019-09-06 17:31:10 -07:00
job_context.h WritePrepared: Fix visible key compacted out by compaction (#4883) 2019-01-15 21:34:38 -08:00
listener_test.cc Propagate SST and blob file numbers through the EventListener interface (#5962) 2019-10-24 14:44:15 -07:00
log_format.h Fix an inaccurate comment (#4315) 2018-08-24 18:13:20 -07:00
log_reader.cc Divide file_reader_writer.h and .cc (#5803) 2019-09-16 10:33:51 -07:00
log_reader.h Divide file_reader_writer.h and .cc (#5803) 2019-09-16 10:33:51 -07:00
log_test.cc Divide file_reader_writer.h and .cc (#5803) 2019-09-16 10:33:51 -07:00
log_writer.cc Divide file_reader_writer.h and .cc (#5803) 2019-09-16 10:33:51 -07:00
log_writer.h Use delete to disable automatic generated methods. (#5009) 2019-09-11 18:09:00 -07:00
logs_with_prep_tracker.cc Skip deleted WALs during recovery 2018-05-03 15:43:09 -07:00
logs_with_prep_tracker.h Skip deleted WALs during recovery 2018-05-03 15:43:09 -07:00
lookup_key.h Avoid user key copying for Get/Put/Write with user-timestamp (#5502) 2019-07-25 15:27:39 -07:00
malloc_stats.cc Support jemalloc compiled with --with-jemalloc-prefix (#5521) 2019-07-02 12:07:01 -07:00
malloc_stats.h Change RocksDB License 2017-07-15 16:11:23 -07:00
manual_compaction_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
memtable_list_test.cc Fix OnFlushCompleted fired before flush result write to MANIFEST (#5908) 2019-10-16 10:40:23 -07:00
memtable_list.cc bugfix: MemTableList::RemoveOldMemTables invalid iterator after remov… (#6013) 2019-11-11 15:57:38 -08:00
memtable_list.h Fix OnFlushCompleted fired before flush result write to MANIFEST (#5908) 2019-10-16 10:40:23 -07:00
memtable.cc Misc hashing updates / upgrades (#5909) 2019-10-24 17:16:46 -07:00
memtable.h Fix OnFlushCompleted fired before flush result write to MANIFEST (#5908) 2019-10-16 10:40:23 -07:00
merge_context.h Introduce a new MultiGet batching implementation (#5011) 2019-04-11 14:28:26 -07:00
merge_helper_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
merge_helper.cc Fix merging range tombstone covering put during flush/compaction (#5406) 2019-06-04 10:24:14 -07:00
merge_helper.h Remove v1 RangeDelAggregator (#4778) 2018-12-17 17:33:46 -08:00
merge_operator.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
merge_test.cc Make format 2019-05-31 15:24:43 -07:00
obsolete_files_test.cc Refactor ObsoleteFilesTest to inherit from DBTestBase (#5820) 2019-09-18 11:52:17 -07:00
options_file_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
perf_context_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
pinned_iterators_manager.h Change RocksDB License 2017-07-15 16:11:23 -07:00
plain_table_db_test.cc Fix memory leak on error opening PlainTable (#5951) 2019-10-21 16:53:06 -07:00
pre_release_callback.h WritePrepared: reduce prepared_mutex_ overhead (#5420) 2019-06-10 11:53:31 -07:00
prefix_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
range_del_aggregator_bench.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
range_del_aggregator_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
range_del_aggregator.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
range_del_aggregator.h Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
range_tombstone_fragmenter_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
range_tombstone_fragmenter.cc Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
range_tombstone_fragmenter.h Initialized pinned_pos_ and pinned_seq_pos_ in FragmentedRangeTombstoneIterator (#5720) 2019-09-05 17:30:29 -07:00
read_callback.h WriteUnPrepared: improve read your own write functionality (#5573) 2019-07-23 08:08:19 -07:00
repair_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
repair.cc BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903) 2019-10-14 15:21:01 -07:00
snapshot_checker.h WritePrepared: fix issue with snapshot released during compaction (#4858) 2019-01-16 09:55:32 -08:00
snapshot_impl.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
snapshot_impl.h Refresh snapshot list during long compactions (2nd attempt) (#5278) 2019-05-03 17:30:22 -07:00
table_cache.cc Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
table_cache.h Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
table_properties_collector_test.cc Divide file_reader_writer.h and .cc (#5803) 2019-09-16 10:33:51 -07:00
table_properties_collector.cc Feature for sampling and reporting compressibility (#4842) 2019-03-18 12:15:34 -07:00
table_properties_collector.h Feature for sampling and reporting compressibility (#4842) 2019-03-18 12:15:34 -07:00
transaction_log_impl.cc Divide file_reader_writer.h and .cc (#5803) 2019-09-16 10:33:51 -07:00
transaction_log_impl.h reuse scratch buffer in transaction_log_reader (#5702) 2019-08-26 11:26:29 -07:00
trim_history_scheduler.cc Refactor trimming logic for immutable memtables (#5022) 2019-08-23 13:55:34 -07:00
trim_history_scheduler.h Refactor trimming logic for immutable memtables (#5022) 2019-08-23 13:55:34 -07:00
version_builder_test.cc BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903) 2019-10-14 15:21:01 -07:00
version_builder.cc save a few redundant container lookups (#5875) 2019-10-07 12:28:09 -07:00
version_builder.h Lower the risk for users to run options.force_consistency_checks = true (#5744) 2019-08-29 14:07:37 -07:00
version_edit_test.cc BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903) 2019-10-14 15:21:01 -07:00
version_edit.cc BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903) 2019-10-14 15:21:01 -07:00
version_edit.h BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903) 2019-10-14 15:21:01 -07:00
version_set_test.cc BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903) 2019-10-14 15:21:01 -07:00
version_set.cc Fix a regression bug on total order seek with prefix enabled and range delete (#6028) 2019-11-13 10:11:34 -08:00
version_set.h Support periodic compaction in universal compaction (#5970) 2019-10-31 11:31:37 -07:00
wal_manager_test.cc Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
wal_manager.cc Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
wal_manager.h Adding DB::GetCurrentWalFile() API as a repliction/backup helper (#5765) 2019-09-04 12:10:17 -07:00
write_batch_base.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
write_batch_internal.h Add insert hints for each writebatch (#5728) 2019-09-12 17:15:18 -07:00
write_batch_test.cc upgrade gtest 1.7.0 => 1.8.1 for json result writing 2019-09-09 11:24:11 -07:00
write_batch.cc Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
write_callback_test.cc WritePrepared: reduce prepared_mutex_ overhead (#5420) 2019-06-10 11:53:31 -07:00
write_callback.h Change RocksDB License 2017-07-15 16:11:23 -07:00
write_controller_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
write_controller.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
write_controller.h Change RocksDB License 2017-07-15 16:11:23 -07:00
write_thread.cc Option to make write group size configurable (#5759) 2019-09-11 18:28:33 -07:00
write_thread.h Option to make write group size configurable (#5759) 2019-09-11 18:28:33 -07:00