A library that provides an embeddable, persistent key-value store for fast storage.
Go to file
Mikhail Antonov 7fe3b32896 Added support for differential snapshots
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).

This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.

From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".

This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.

For now, what's done here according to initial discussions:

Preserving deletes:
 - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
 - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
 - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.

Iterator changes:
 - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.

TableCache changes:
 - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.

What's left:

 - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999

Differential Revision: D6175602

Pulled By: mikhail-antonov

fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
2017-11-01 18:56:43 -07:00
buckifier rocksdb: make buildable on aarch64 2017-08-13 17:13:54 -07:00
build_tools fix lite build 2017-10-17 08:57:09 -07:00
cache Fix coverity uninitialized fields warnings in lru_cache 2017-10-27 11:26:43 -07:00
cmake CMake: Add support for CMake packages 2017-08-28 17:14:37 -07:00
coverage Fix /bin/bash shebangs 2017-08-03 15:56:46 -07:00
db Added support for differential snapshots 2017-11-01 18:56:43 -07:00
docs Blog post for 5.8 release 2017-09-28 10:14:09 -07:00
env Make writable_file_max_buffer_size dynamic 2017-10-31 13:56:35 -07:00
examples Pinnableslice examples and blog post 2017-08-24 12:26:07 -07:00
hdfs Revert "comment out unused parameters" 2017-07-21 18:26:26 -07:00
include/rocksdb Added support for differential snapshots 2017-11-01 18:56:43 -07:00
java added missing subcodes and improved error message for missing enum values 2017-10-23 16:42:07 -07:00
memtable Enable MSVC W4 with a few exceptions. Fix warnings and bugs 2017-10-19 10:57:12 -07:00
monitoring Fix unused var warnings in Release mode 2017-10-23 14:27:04 -07:00
options Added support for differential snapshots 2017-11-01 18:56:43 -07:00
port Make writable_file_max_buffer_size dynamic 2017-10-31 13:56:35 -07:00
table TableProperty::oldest_key_time defaults to 0 2017-10-27 15:00:05 -07:00
third-party Enable MSVC W4 with a few exceptions. Fix warnings and bugs 2017-10-19 10:57:12 -07:00
tools db_stress snapshot compatibility with reopens 2017-10-31 01:26:08 -07:00
util Make writable_file_max_buffer_size dynamic 2017-10-31 13:56:35 -07:00
utilities WritePrepared Txn: Optimize for recoverable state 2017-11-01 17:26:46 -07:00
.clang-format A script that automatically reformat affected lines 2014-01-14 12:21:24 -08:00
.gitignore Remove leftover references to phutil_module_cache 2017-08-23 12:12:21 -07:00
.travis.yml fix lite build 2017-10-17 08:57:09 -07:00
appveyor.yml Add -DPORTABLE=1 to MSVC CI build 2017-08-31 16:42:48 -07:00
AUTHORS Update RocksDB Authors File 2017-10-18 14:42:10 -07:00
CMakeLists.txt Enable cacheline_aligned_alloc() to allocate from jemalloc if enabled. 2017-10-27 13:27:12 -07:00
CONTRIBUTING.md Remove the licensing description in CONTRIBUTING.md 2017-07-16 15:57:18 -07:00
COPYING Add GPLv2 as an alternative license. 2017-04-27 18:06:12 -07:00
DEFAULT_OPTIONS_HISTORY.md options.delayed_write_rate use the rate of rate_limiter by default. 2017-05-24 09:58:24 -07:00
DUMP_FORMAT.md First version of rocksdb_dump and rocksdb_undump. 2015-06-19 16:24:36 -07:00
HISTORY.md Added support for differential snapshots 2017-11-01 18:56:43 -07:00
INSTALL.md Default one to rocksdb:x64-windows 2017-09-28 16:12:24 -07:00
issue_template.md Add a template for issues 2017-09-29 11:41:28 -07:00
LANGUAGE-BINDINGS.md add Erlang to the list of language bindings 2017-08-28 16:43:16 -07:00
LICENSE.Apache Change RocksDB License 2017-07-15 16:11:23 -07:00
LICENSE.leveldb Add back the LevelDB license file 2017-07-16 18:42:18 -07:00
Makefile PinnableSlice move assignment 2017-10-12 18:28:24 -07:00
README.md Appveyor badge to show master branch 2016-07-26 13:54:08 -07:00
ROCKSDB_LITE.md Optimistic Transactions 2015-05-29 14:36:35 -07:00
src.mk PinnableSlice move assignment 2017-10-12 18:28:24 -07:00
TARGETS PinnableSlice move assignment 2017-10-12 18:28:24 -07:00
thirdparty.inc Enable cacheline_aligned_alloc() to allocate from jemalloc if enabled. 2017-10-27 13:27:12 -07:00
USERS.md Add LogDevice to USERS.md 2017-09-25 15:56:40 -07:00
Vagrantfile Update Vagrant file (test internal phabricator workflow) 2016-10-28 15:39:19 -07:00
WINDOWS_PORT.md Commit both PR and internal code review changes 2015-07-07 16:58:20 -07:00

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

Build Status Build status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it specially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/master/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/