Fix assertion error during compaction with write-prepared txn enabled (#9105)

Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105

The user contract of SingleDelete is that: a SingleDelete can only be issued to
a key that exists and has NOT been updated. For example, application can insert
one key `key`, and uses a SingleDelete to delete it in the future. The `key`
cannot be updated or removed using Delete.
In reality, especially when write-prepared transaction is being used, things
can get tricky. For example, a prepared transaction already writes `key` to the
memtable after a successful Prepare(). Afterwards, should the transaction
rollback, it will insert a Delete into the memtable to cancel out the prior
Put. Consider the following sequence of operations.

```
// operation sequence 1
Begin txn
Put(key)
Prepare()
Flush()

Rollback txn
Flush()
```

There will be two SSTs resulting from above. One of the contains a PUT, while
the second one contains a Delete. It is also known that releasing a snapshot
can lead to an L0 containing only a SD for a particular key. Consider the
following operations following the above block.

```
// operation sequence 2
db->Put(key)
db->SingleDelete(key)
Flush()
```

The operation sequence 2 can result in an L0 with only the SD.

Should there be a snapshot for conflict checking created before operation
sequence 1, then an attempt to compact the db may hit the assertion failure
below, because ikey_.type is Delete (from a rollback).

```
else if (clear_and_output_next_key_) {
  assert(ikey_.type == kTypeValue || ikey_.type == kTypeBlobIndex);
}
```

To fix the assertion failure, we can skip the SingleDelete if we detect an
earlier Delete in the same snapshot interval.

Reviewed By: ltamasi

Differential Revision: D32056848

fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748
This commit is contained in:
Yanqin Jin 2021-11-05 15:28:20 -07:00 committed by Facebook GitHub Bot
parent 7d74d3471c
commit 5237b39d2e
3 changed files with 89 additions and 4 deletions

View File

@ -11,6 +11,7 @@
* EventListener::OnTableFileCreated was previously called with OK status and file_size==0 in cases of no SST file contents written (because there was no content to add) and the empty file deleted before calling the listener. Now the status is Aborted.
* Fixed a bug in CompactionIterator when write-preared transaction is used. Releasing earliest_snapshot during compaction may cause a SingleDelete to be output after a PUT of the same user key whose seq has been zeroed.
* Added input sanitization on negative bytes passed into `GenericRateLimiter::Request`.
* Fixed an assertion failure in CompactionIterator when write-prepared transaction is used. We prove that certain operations can lead to a Delete being followed by a SingleDelete (same user key). We can drop the SingleDelete.
### Behavior Changes
* `NUM_FILES_IN_SINGLE_COMPACTION` was only counting the first input level files, now it's including all input files.

View File

@ -608,11 +608,19 @@ void CompactionIterator::NextFromInput() {
TEST_SYNC_POINT_CALLBACK(
"CompactionIterator::NextFromInput:SingleDelete:2", nullptr);
if (next_ikey.type == kTypeSingleDeletion) {
if (next_ikey.type == kTypeSingleDeletion ||
next_ikey.type == kTypeDeletion) {
// We encountered two SingleDeletes for same key in a row. This
// could be due to unexpected user input. Skip the first
// SingleDelete and let the next iteration decide how to handle the
// second SingleDelete
// could be due to unexpected user input. If write-(un)prepared
// transaction is used, this could also be due to releasing an old
// snapshot between a Put and its matching SingleDelete.
// Furthermore, if write-(un)prepared transaction is rolled back
// after prepare, we will write a Delete to cancel a prior Put. If
// old snapshot is released between a later Put and its matching
// SingleDelete, we will end up with a Delete followed by
// SingleDelete.
// Skip the first SingleDelete and let the next iteration decide
// how to handle the second SingleDelete or Delete.
// First SingleDelete has been skipped since we already called
// input_.Next().

View File

@ -3199,6 +3199,82 @@ TEST_P(WritePreparedTransactionTest, ReleaseEarliestSnapshotAfterSeqZeroing2) {
SyncPoint::GetInstance()->ClearAllCallBacks();
}
// Although the user-contract indicates that a SD can only be issued for a key
// that exists and has not been overwritten, it is still possible for a Delete
// to be present when write-prepared transaction is rolled back.
TEST_P(WritePreparedTransactionTest, SingleDeleteAfterRollback) {
constexpr size_t kSnapshotCacheBits = 7; // same as default
constexpr size_t kCommitCacheBits = 0; // minimum commit cache
UpdateTransactionDBOptions(kSnapshotCacheBits, kCommitCacheBits);
options.disable_auto_compactions = true;
ASSERT_OK(ReOpen());
// Get a write conflict snapshot by creating a transaction with
// set_snapshot=true.
TransactionOptions txn_opts;
txn_opts.set_snapshot = true;
std::unique_ptr<Transaction> dummy_txn(
db->BeginTransaction(WriteOptions(), txn_opts));
std::unique_ptr<Transaction> txn0(
db->BeginTransaction(WriteOptions(), TransactionOptions()));
ASSERT_OK(txn0->Put("foo", "value"));
ASSERT_OK(txn0->SetName("xid0"));
ASSERT_OK(txn0->Prepare());
// Create an SST with only {"foo": "value"}.
ASSERT_OK(db->Flush(FlushOptions()));
// Insert a Delete to cancel out the prior Put by txn0.
ASSERT_OK(txn0->Rollback());
txn0.reset();
// Create a second SST.
ASSERT_OK(db->Flush(FlushOptions()));
ASSERT_OK(db->Put(WriteOptions(), "foo", "value1"));
auto* snapshot = db->GetSnapshot();
ASSERT_OK(db->SingleDelete(WriteOptions(), "foo"));
int count = 0;
SyncPoint::GetInstance()->DisableProcessing();
SyncPoint::GetInstance()->ClearAllCallBacks();
SyncPoint::GetInstance()->SetCallBack(
"CompactionIterator::NextFromInput:SingleDelete:1", [&](void* arg) {
const auto* const c = reinterpret_cast<const Compaction*>(arg);
assert(!c);
// Trigger once only for SingleDelete during flush.
if (0 == count) {
++count;
db->ReleaseSnapshot(snapshot);
// Bump max_evicted_seq
ASSERT_OK(db->Put(WriteOptions(), "x", "dontcare"));
}
});
SyncPoint::GetInstance()->EnableProcessing();
// Create a third SST containing a SD without its matching PUT.
ASSERT_OK(db->Flush(FlushOptions()));
SyncPoint::GetInstance()->DisableProcessing();
SyncPoint::GetInstance()->ClearAllCallBacks();
SyncPoint::GetInstance()->EnableProcessing();
DBImpl* dbimpl = static_cast_with_check<DBImpl>(db->GetRootDB());
assert(dbimpl);
ASSERT_OK(dbimpl->TEST_CompactRange(
/*level=*/0, /*begin=*/nullptr, /*end=*/nullptr,
/*column_family=*/nullptr, /*disallow_trivial_mode=*/true));
SyncPoint::GetInstance()->DisableProcessing();
SyncPoint::GetInstance()->ClearAllCallBacks();
// Release the conflict-checking snapshot.
ASSERT_OK(dummy_txn->Rollback());
}
// A more complex test to verify compaction/flush should keep keys visible
// to snapshots.
TEST_P(WritePreparedTransactionTest,