A library that provides an embeddable, persistent key-value store for fast storage.
Go to file
Yanqin Jin 0bd4dcde6b CompactionIterator sees consistent view of which keys are committed (#9830)
Summary:
**This PR does not affect the functionality of `DB` and write-committed transactions.**

`CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed.
As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if
it is committed. In fact, the implementation of `KeyCommitted()` is as follows:

```
inline bool KeyCommitted(SequenceNumber seq) {
  // For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
  return snapshot_checker_ == nullptr ||
         snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot;
}
```

With that being said, we focus on write-prepared/write-unprepared transactions.

A few notes:
- A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database.
- `CompactionIterator` outputs a key as long as the key is uncommitted.

Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without
doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes
committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`.
Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone.

To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that
for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting
processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot
to determine whether a key is committed or not with minor change to `KeyCommitted()`.

```
inline bool KeyCommitted(SequenceNumber sequence) {
  // For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
  return snapshot_checker_ == nullptr ||
         snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) ==
             SnapshotCheckerResult::kInSnapshot;
}
```

As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble
for `CompactionIterator`s assertions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D35561162

Pulled By: riversand963

fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f
2022-04-14 11:11:04 -07:00
.circleci Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
.github/workflows Use released clang-format instead of the one from dev branch (#9646) 2022-03-01 10:51:38 -08:00
buckifier Add db_stress to buck build (#9840) 2022-04-13 23:54:35 -07:00
build_tools Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
cache Expose CacheEntryRole and map keys for block cache stat collections (#9838) 2022-04-14 09:38:55 -07:00
cmake gcc-11 and cmake related cleanup (#9286) 2021-12-17 17:04:35 -08:00
coverage Fix commit_prereq and other targets (#9797) 2022-04-04 09:58:18 -07:00
db CompactionIterator sees consistent view of which keys are committed (#9830) 2022-04-14 11:11:04 -07:00
db_stress_tool Add WAL compression to stress tests (#9811) 2022-04-06 15:47:09 -07:00
docs Bump nokogiri from 1.13.3 to 1.13.4 in /docs (#9831) 2022-04-12 09:07:14 -07:00
env Fix segfault in FilePrefetchBuffer with async_io enabled (#9777) 2022-04-04 15:35:43 -07:00
examples Remove BlockBasedTableOptions.hash_index_allow_collision (#9454) 2022-03-01 13:58:02 -08:00
file Enable async prefetching for ReadOptions.readahead_size (#9827) 2022-04-11 13:46:57 -07:00
fuzz Fix compilation errors and add fuzzers to CircleCI (#9420) 2022-02-01 10:32:15 -08:00
include/rocksdb Expose CacheEntryRole and map keys for block cache stat collections (#9838) 2022-04-14 09:38:55 -07:00
java Remove corrupted WAL files in kPointRecoveryMode with avoid_flush_duing_recovery set true (#9634) 2022-04-11 15:39:31 -07:00
logging Use system-wide thread ID in info log lines (#9164) 2021-11-12 19:46:06 -08:00
memory Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
memtable Account memory of big memory users in BlockBasedTable in global memory limit (#9748) 2022-04-06 10:33:00 -07:00
microbench Add microbench document (#9781) 2022-03-31 17:17:44 -07:00
monitoring Remove corrupted WAL files in kPointRecoveryMode with avoid_flush_duing_recovery set true (#9634) 2022-04-11 15:39:31 -07:00
options Prevent double caching in the compressed secondary cache (#9747) 2022-04-11 13:28:33 -07:00
plugin Add initial CMake support to plugin (#9214) 2021-11-30 17:16:53 -08:00
port Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
table Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
test_util Remove BlockBasedTableOptions.hash_index_allow_collision (#9454) 2022-03-01 13:58:02 -08:00
third-party Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
tools Prevent double caching in the compressed secondary cache (#9747) 2022-04-11 13:28:33 -07:00
trace_replay Added TraceOptions::preserve_write_order (#9334) 2021-12-28 15:04:26 -08:00
util Fix minimum libzstd version that supports ZSTD_STREAMING (#9841) 2022-04-14 11:05:39 -07:00
utilities Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
.clang-format A script that automatically reformat affected lines 2014-01-14 12:21:24 -08:00
.gitignore Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
.lgtm.yml Create lgtm.yml for LGTM.com C/C++ analysis (#4058) 2018-06-26 12:43:04 -07:00
.travis.yml Fix remaining uses of "backupable" (#9792) 2022-04-05 09:52:33 -07:00
.watchmanconfig Added .watchmanconfig file to rocksdb repo (#5593) 2019-07-19 15:00:33 -07:00
AUTHORS Update RocksDB Authors File 2017-10-18 14:42:10 -07:00
CMakeLists.txt Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
CODE_OF_CONDUCT.md Adopt Contributor Covenant 2019-08-29 23:21:01 -07:00
CONTRIBUTING.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
COPYING Add GPLv2 as an alternative license. 2017-04-27 18:06:12 -07:00
crash_test.mk Add new determinators for multiops transactions stress test (#9708) 2022-03-23 22:29:50 -07:00
DEFAULT_OPTIONS_HISTORY.md Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363) 2022-01-18 17:31:03 -08:00
DUMP_FORMAT.md First version of rocksdb_dump and rocksdb_undump. 2015-06-19 16:24:36 -07:00
HISTORY.md Expose CacheEntryRole and map keys for block cache stat collections (#9838) 2022-04-14 09:38:55 -07:00
INSTALL.md Update supported VS versions in INSTALL.md (#9823) 2022-04-13 13:03:40 -07:00
issue_template.md Add Google Group to Issue Template 2020-01-28 14:40:37 -08:00
LANGUAGE-BINDINGS.md Update branch name to "main" in README/LANGUAGE_BINDINGS (#8727) 2021-09-01 15:26:34 -07:00
LICENSE.Apache Change RocksDB License 2017-07-15 16:11:23 -07:00
LICENSE.leveldb Add back the LevelDB license file 2017-07-16 18:42:18 -07:00
Makefile Serialize a space-hungry test (#9837) 2022-04-13 17:10:43 -07:00
PLUGINS.md Move RADOS support to separate repo (#9206) 2022-01-24 22:50:07 -08:00
python.mk crash_test Makefile refactoring, add to CircleCI (#9702) 2022-03-16 15:58:06 -07:00
README.md README: De-list slack channel, list Google group (#9387) 2022-01-18 08:19:48 -08:00
ROCKSDB_LITE.md Fix remaining uses of "backupable" (#9792) 2022-04-05 09:52:33 -07:00
src.mk Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
TARGETS Add db_stress to buck build (#9840) 2022-04-13 23:54:35 -07:00
thirdparty.inc Fix build jemalloc api (#5470) 2019-06-24 17:40:32 -07:00
USERS.md Add Solana's RocksDB use case in USERS.md (#9558) 2022-02-16 09:23:01 -08:00
Vagrantfile Adding CentOS 7 Vagrantfile & build script 2018-02-26 15:27:17 -08:00
WINDOWS_PORT.md Update branch name in WINDOWS_PORT.md (#8745) 2021-09-01 19:26:39 -07:00

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

CircleCI Status TravisCI Status Appveyor Build status PPC64le Build Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/main/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Questions and discussions are welcome on the RocksDB Developers Public Facebook group and email list on Google Groups.

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.