A library that provides an embeddable, persistent key-value store for fast storage.
Go to file
cheng-chang bdb7e544bd Skip WALs according to MinLogNumberToKeep when creating checkpoint (#7789)
Summary:
In a stress test failure, we observe that a WAL is skipped when creating checkpoint, although its log number >= MinLogNumberToKeep(). This might happen in the following case:

1. when creating the checkpoint, there are 2 column families: CF0 and CF1, and there are 2 WALs: 1, 2;
2. CF0's log number is 1, CF0's active memtable is empty, CF1's log number is 2, CF1's active memtable is not empty, WAL 2 is not empty, the sequence number points to WAL 2;
2. the checkpoint process flushes CF0, since CF0' active memtable is empty, there is no need to SwitchMemtable, thus no new WAL will be created, so CF0's log number is now 2, concurrently, some data is written to CF0 and WAL 2;
3. the checkpoint process flushes CF1, WAL 3 is created and CF1's log number is now 3, CF0's log number is still 2 because CF0 is not empty and WAL 2 contains its unflushed data concurrently written in step 2;
4.  the checkpoint process determines that WAL 1 and 2 are no longer needed according to [live_wal_files[i]->StartSequence() >= *sequence_number](https://github.com/facebook/rocksdb/blob/master/utilities/checkpoint/checkpoint_impl.cc#L388), so it skips linking them to the checkpoint directory;
5. but according to `MinLogNumberToKeep()`, WAL 2 still needs to be kept because CF0's log number is 2.

If the checkpoint is reopened in read-only mode, and only read from the snapshot with the initial sequence number, then there will be no data loss or data inconsistency.

But if the checkpoint is reopened and read from the most recent sequence number, suppose in step 3, there are also data concurrently written to CF1 and WAL 3, then the most recent sequence number refers to the latest entry in WAL 3, so the data written in step 2 should also be visible, but since WAL 2 is discarded, those data are lost.

When tracking WAL in MANIFEST is enabled, when reopening the checkpoint, since WAL 2 is still tracked in MANIFEST as alive, but it's missing from the checkpoint directory, a corruption will be reported.

This PR makes the checkpoint process to only skip a WAL if its log number < `MinLogNumberToKeep`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7789

Test Plan: watch existing tests to pass.

Reviewed By: ajkr

Differential Revision: D25662346

Pulled By: cheng-chang

fbshipit-source-id: 136471095baa01886cf44809455cf855f24857a0
2020-12-23 11:33:26 -08:00
.circleci Migrate away from Travis+Linux+amd64 (#7791) 2020-12-22 00:20:57 -08:00
.github/workflows Update clang-format-diff.py (#7609) 2020-11-04 16:09:01 -08:00
buckifier Fix use of positional args in BUCK rules (#7760) 2020-12-09 19:25:31 -08:00
build_tools Migrate away from Travis+Linux+amd64 (#7791) 2020-12-22 00:20:57 -08:00
cache Fix typos in comments (#7687) 2020-11-19 13:32:50 -08:00
cmake Add find_dependency() in cmake config file. (#6791) 2020-05-12 21:18:29 -07:00
coverage Find the correct gcov (#6904) 2020-06-01 16:33:05 -07:00
db Remove flaky, redundant, and dubious DBTest.SparseMerge (#7800) 2020-12-23 11:08:12 -08:00
db_stress_tool Inject the random write error to stress test (#7653) 2020-12-17 11:52:28 -08:00
docs Update github-pages to v207 (#7235) 2020-08-12 09:26:24 -07:00
env Eliminate possible race between LockFile() vs UnlockFile() (#7721) 2020-12-10 09:35:11 -08:00
examples Bring the Configurable options together (#5753) 2020-09-14 17:01:01 -07:00
file Add more tests for assert status checked (#7524) 2020-12-22 23:45:58 -08:00
fuzz Update SstFileWriter fuzzer to iterate and check all key-value pairs (#7761) 2020-12-11 16:09:10 -08:00
hdfs fix build with 'USE_HDFS' on windows (#6950) 2020-06-12 16:21:50 -07:00
include/rocksdb Range Locking: Implementation of range locking (#7506) 2020-12-22 19:12:36 -08:00
java Fix various small build issues, Java API naming (#7776) 2020-12-18 16:12:26 -08:00
logging Remove unused includes (#7604) 2020-10-28 23:22:27 -07:00
memory slightly improve jemalloc allocator API header (#7592) 2020-10-28 13:47:12 -07:00
memtable Test for LoadLatestOptions (#7554) 2020-10-14 22:28:55 -07:00
monitoring Remove unused includes (#7604) 2020-10-28 23:22:27 -07:00
options Hack to load OPTIONS file for read_amp_bytes_per_bit (#7659) 2020-11-13 11:52:50 -08:00
port Fix various small build issues, Java API naming (#7776) 2020-12-18 16:12:26 -08:00
table Add more tests for assert status checked (#7524) 2020-12-22 23:45:58 -08:00
test_util Add further tests to ASSERT_STATUS_CHECKED (2) (#7698) 2020-12-09 21:21:16 -08:00
third-party Fix Compilation on ppc64le using Clang 11 (#7713) 2020-12-01 11:21:44 -08:00
tools Update regression_test.sh to run multireadrandom benchmark (#7802) 2020-12-23 11:26:12 -08:00
trace_replay Add tests in ASSERT_STATUS_CHECKED (#7793) 2020-12-22 10:31:13 -08:00
util Add more tests for assert status checked (#7524) 2020-12-22 23:45:58 -08:00
utilities Skip WALs according to MinLogNumberToKeep when creating checkpoint (#7789) 2020-12-23 11:33:26 -08:00
.clang-format A script that automatically reformat affected lines 2014-01-14 12:21:24 -08:00
.gitignore Fuzzing RocksDB (#7685) 2020-11-17 12:56:48 -08:00
.lgtm.yml Create lgtm.yml for LGTM.com C/C++ analysis (#4058) 2018-06-26 12:43:04 -07:00
.travis.yml Migrate away from Travis+Linux+amd64 (#7791) 2020-12-22 00:20:57 -08:00
.watchmanconfig Added .watchmanconfig file to rocksdb repo (#5593) 2019-07-19 15:00:33 -07:00
appveyor.yml Remove 2019 from appveyor (#7038) 2020-06-29 14:31:41 -07:00
AUTHORS Update RocksDB Authors File 2017-10-18 14:42:10 -07:00
CMakeLists.txt Range Locking: Implementation of range locking (#7506) 2020-12-22 19:12:36 -08:00
CODE_OF_CONDUCT.md Adopt Contributor Covenant 2019-08-29 23:21:01 -07:00
CONTRIBUTING.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
COPYING Add GPLv2 as an alternative license. 2017-04-27 18:06:12 -07:00
DEFAULT_OPTIONS_HISTORY.md options.delayed_write_rate use the rate of rate_limiter by default. 2017-05-24 09:58:24 -07:00
defs.bzl Make testpilot recognize that these tests have coverage instrumentation 2020-03-20 11:23:23 -07:00
DUMP_FORMAT.md First version of rocksdb_dump and rocksdb_undump. 2015-06-19 16:24:36 -07:00
HISTORY.md Update release version to 6.16 (#7782) 2020-12-19 12:39:21 -08:00
INSTALL.md Update the version of the dependencies used by the RocksJava static build (#4761) 2018-12-18 20:25:43 -08:00
issue_template.md Add Google Group to Issue Template 2020-01-28 14:40:37 -08:00
LANGUAGE-BINDINGS.md Add RestoreDBFromLatestBackup to C API, add new C# package (#7092) 2020-07-08 11:56:41 -07:00
LICENSE.Apache Change RocksDB License 2017-07-15 16:11:23 -07:00
LICENSE.leveldb Add back the LevelDB license file 2017-07-16 18:42:18 -07:00
Makefile Add more tests for assert status checked (#7524) 2020-12-22 23:45:58 -08:00
README.md Fix the CI badge for ppc64le Jenkins (#7561) 2020-10-16 09:00:56 -07:00
ROCKSDB_LITE.md Fix some typos in comments and docs. 2018-03-08 10:27:25 -08:00
src.mk Range Locking: Implementation of range locking (#7506) 2020-12-22 19:12:36 -08:00
TARGETS Range Locking: Implementation of range locking (#7506) 2020-12-22 19:12:36 -08:00
thirdparty.inc Fix build jemalloc api (#5470) 2019-06-24 17:40:32 -07:00
USERS.md add ArangoDB to USERS.md, and fix typos in that file (#7675) 2020-11-16 18:29:51 -08:00
Vagrantfile Adding CentOS 7 Vagrantfile & build script 2018-02-26 15:27:17 -08:00
WINDOWS_PORT.md #5145 , rename port/dirent.h to port/port_dirent.h to avoid compile err when use port dir as header dir output (#5152) 2019-04-04 11:38:19 -07:00

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

CircleCI Status TravisCI Status Appveyor Build status PPC64le Build Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/master/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/ and https://rocksdb.slack.com/

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.