A library that provides an embeddable, persistent key-value store for fast storage.
Go to file
Yanqin Jin e5451b30db Fix a silent data loss for write-committed txn (#9571)
Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time    thread 1                                       bg flush
 |   db->Put("a")
 |   txn = NewTxn()
 |   txn->Put("b", "v")
 |   txn->Prepare()       // writes only to 5.log
 |   db->SwitchMemtable() // memtable 1 has "a"
 |                        // close 5.log,
 |                        // creates 8.log
 |   trigger flush
 |                                                  pick memtable 1
 |                                                  unlock db mutex
 |                                                  write new sst
 |   txn->ctwb->Put("gtid", "1") // writes 8.log
 |   txn->Commit() // writes to 8.log
 |                 // writes to memtable 2
 |                                               compute min_log_number_to_keep_2pc, this
 |                                               will be 8 (incorrect).
 |
 |                                             Purge obsolete wals, including 5.log
 |
 V
```

At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log

The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.

The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571

Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```

Reviewed By: siying

Differential Revision: D34235236

Pulled By: riversand963

fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
2022-02-17 15:37:59 -08:00
.circleci Remove pyenv installation and use deps from S3 (#9406) 2022-01-21 09:33:24 -08:00
.github/workflows Add (& fix) some simple source code checks (#8821) 2021-09-07 21:19:27 -07:00
buckifier Update TARGETS and related scripts (#9310) 2021-12-17 11:51:51 -08:00
build_tools Use fcntl(F_FULLFSYNC) on OS X (#9356) 2022-01-18 20:23:11 -08:00
cache Fix unity build with SUPPORT_CLOCK_CACHE (#9309) 2021-12-17 14:15:07 -08:00
cmake gcc-11 and cmake related cleanup (#9286) 2021-12-17 17:04:35 -08:00
coverage Remove asan_symbolize.py for internal asan build (#8737) 2021-09-07 15:39:11 -07:00
db Fix a silent data loss for write-committed txn (#9571) 2022-02-17 15:37:59 -08:00
db_stress_tool Test correctness with WAL disabled in non-txn blackbox crash tests (#9338) 2022-01-05 16:23:37 -08:00
docs New blog post for Ribbon filter (#8992) 2021-12-28 21:54:39 -08:00
env Add to HISTORY and minor loose ends from #9294, #9254 (#9386) 2022-01-21 13:04:19 -08:00
examples Remove using namespace (#9369) 2022-01-12 09:31:12 -08:00
file Fix a bug causing duplicate trailing entries in WritableFile (buffered IO) (#9236) 2021-12-13 09:00:36 -08:00
fuzz Remove using namespace (#9369) 2022-01-12 09:31:12 -08:00
hdfs Make the Env class Customizable (#9293) 2022-01-04 16:45:49 -08:00
include/rocksdb Update version to 6.29.1 2022-02-15 19:52:00 -08:00
java Add support for Apple Silicon to RocksJava (#9254) 2022-01-12 17:20:58 -08:00
logging Use system-wide thread ID in info log lines (#9164) 2021-11-12 19:46:06 -08:00
memory Fix compilation error when building static_lib (#9377) 2022-01-12 09:04:01 -08:00
memtable Remove using namespace (#9369) 2022-01-12 09:31:12 -08:00
microbench Skip directory fsync for filesystem btrfs (#8903) 2021-11-03 12:21:27 -07:00
monitoring Restore Regex support for ObjectLibrary::Register, rename new APIs to allow old one to be deprecated in the future (#9362) 2022-01-11 06:33:48 -08:00
options Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363) 2022-01-18 17:31:03 -08:00
plugin Add initial CMake support to plugin (#9214) 2021-11-30 17:16:53 -08:00
port Add to HISTORY and minor loose ends from #9294, #9254 (#9386) 2022-01-21 13:04:19 -08:00
table Fix major bug with MultiGet, DeleteRange, and memtable Bloom (#9453) 2022-01-31 11:32:04 -08:00
test_util Restore Regex support for ObjectLibrary::Register, rename new APIs to allow old one to be deprecated in the future (#9362) 2022-01-11 06:33:48 -08:00
third-party Remove using namespace (#9369) 2022-01-12 09:31:12 -08:00
tools Fix^2 prefix extractor testing in crash test (#9463) 2022-01-31 11:32:04 -08:00
trace_replay Added TraceOptions::preserve_write_order (#9334) 2021-12-28 15:04:26 -08:00
util Fix major bug with MultiGet, DeleteRange, and memtable Bloom (#9453) 2022-01-31 11:32:04 -08:00
utilities Fix a silent data loss for write-committed txn (#9571) 2022-02-17 15:37:59 -08:00
.clang-format A script that automatically reformat affected lines 2014-01-14 12:21:24 -08:00
.gitignore gitignore cmake-build-* for CLion integration (#7933) 2021-02-19 13:43:15 -08:00
.lgtm.yml Create lgtm.yml for LGTM.com C/C++ analysis (#4058) 2018-06-26 12:43:04 -07:00
.travis.yml Re-enable 390x+cmake* Travis jobs (#9110) 2021-11-03 20:30:15 -07:00
.watchmanconfig Added .watchmanconfig file to rocksdb repo (#5593) 2019-07-19 15:00:33 -07:00
appveyor.yml Remove VS2017 from Appveyor CI (#9417) 2022-01-21 16:16:00 -08:00
AUTHORS Update RocksDB Authors File 2017-10-18 14:42:10 -07:00
CMakeLists.txt Use fcntl(F_FULLFSYNC) on OS X (#9356) 2022-01-18 20:23:11 -08:00
CODE_OF_CONDUCT.md Adopt Contributor Covenant 2019-08-29 23:21:01 -07:00
CONTRIBUTING.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
COPYING Add GPLv2 as an alternative license. 2017-04-27 18:06:12 -07:00
DEFAULT_OPTIONS_HISTORY.md Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363) 2022-01-18 17:31:03 -08:00
defs.bzl Make testpilot recognize that these tests have coverage instrumentation 2020-03-20 11:23:23 -07:00
DUMP_FORMAT.md First version of rocksdb_dump and rocksdb_undump. 2015-06-19 16:24:36 -07:00
HISTORY.md Fix a silent data loss for write-committed txn (#9571) 2022-02-17 15:37:59 -08:00
INSTALL.md Update installation instructions (#8158) 2021-04-06 16:02:04 -07:00
issue_template.md Add Google Group to Issue Template 2020-01-28 14:40:37 -08:00
LANGUAGE-BINDINGS.md Update branch name to "main" in README/LANGUAGE_BINDINGS (#8727) 2021-09-01 15:26:34 -07:00
LICENSE.Apache Change RocksDB License 2017-07-15 16:11:23 -07:00
LICENSE.leveldb Add back the LevelDB license file 2017-07-16 18:42:18 -07:00
Makefile Add support for Apple Silicon to RocksJava (#9254) 2022-01-12 17:20:58 -08:00
PLUGINS.md Add ZenFS to plugin list (#8218) 2021-04-22 11:12:40 -07:00
README.md README: De-list slack channel, list Google group (#9387) 2022-01-18 08:19:48 -08:00
ROCKSDB_LITE.md Fix some typos in comments and docs. 2018-03-08 10:27:25 -08:00
src.mk Make MemoryAllocator into a Customizable class (#8980) 2021-12-17 04:20:47 -08:00
TARGETS Update TARGETS and related scripts (#9310) 2021-12-17 11:51:51 -08:00
thirdparty.inc Fix build jemalloc api (#5470) 2019-06-24 17:40:32 -07:00
USERS.md Update USERS.md (#8923) 2021-10-01 16:10:35 -07:00
Vagrantfile Adding CentOS 7 Vagrantfile & build script 2018-02-26 15:27:17 -08:00
WINDOWS_PORT.md Update branch name in WINDOWS_PORT.md (#8745) 2021-09-01 19:26:39 -07:00

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

CircleCI Status TravisCI Status Appveyor Build status PPC64le Build Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/main/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Questions and discussions are welcome on the RocksDB Developers Public Facebook group and email list on Google Groups.

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.