5237b39d2e
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105 The user contract of SingleDelete is that: a SingleDelete can only be issued to a key that exists and has NOT been updated. For example, application can insert one key `key`, and uses a SingleDelete to delete it in the future. The `key` cannot be updated or removed using Delete. In reality, especially when write-prepared transaction is being used, things can get tricky. For example, a prepared transaction already writes `key` to the memtable after a successful Prepare(). Afterwards, should the transaction rollback, it will insert a Delete into the memtable to cancel out the prior Put. Consider the following sequence of operations. ``` // operation sequence 1 Begin txn Put(key) Prepare() Flush() Rollback txn Flush() ``` There will be two SSTs resulting from above. One of the contains a PUT, while the second one contains a Delete. It is also known that releasing a snapshot can lead to an L0 containing only a SD for a particular key. Consider the following operations following the above block. ``` // operation sequence 2 db->Put(key) db->SingleDelete(key) Flush() ``` The operation sequence 2 can result in an L0 with only the SD. Should there be a snapshot for conflict checking created before operation sequence 1, then an attempt to compact the db may hit the assertion failure below, because ikey_.type is Delete (from a rollback). ``` else if (clear_and_output_next_key_) { assert(ikey_.type == kTypeValue || ikey_.type == kTypeBlobIndex); } ``` To fix the assertion failure, we can skip the SingleDelete if we detect an earlier Delete in the same snapshot interval. Reviewed By: ltamasi Differential Revision: D32056848 fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748 |
||
---|---|---|
.. | ||
backupable | ||
blob_db | ||
cassandra | ||
checkpoint | ||
compaction_filters | ||
convenience | ||
leveldb_options | ||
memory | ||
merge_operators | ||
option_change_migration | ||
options | ||
persistent_cache | ||
simulator_cache | ||
table_properties_collectors | ||
trace | ||
transactions | ||
ttl | ||
write_batch_with_index | ||
cache_dump_load_impl.cc | ||
cache_dump_load_impl.h | ||
cache_dump_load.cc | ||
compaction_filters.cc | ||
debug.cc | ||
env_librados_test.cc | ||
env_librados.cc | ||
env_librados.md | ||
env_mirror_test.cc | ||
env_mirror.cc | ||
env_timed_test.cc | ||
env_timed.cc | ||
env_timed.h | ||
fault_injection_env.cc | ||
fault_injection_env.h | ||
fault_injection_fs.cc | ||
fault_injection_fs.h | ||
merge_operators.cc | ||
merge_operators.h | ||
object_registry_test.cc | ||
object_registry.cc | ||
util_merge_operators_test.cc | ||
wal_filter.cc |