Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
Summary:
That way we can see when this graph goes up and be happy.
Couple of changes:
1. title
2. fix db_bench to delete column families before deleting the DB. this was asserting when compiled in debug mode
3. don't sync manifest when disableDataSync. We discussed this offline. I can move it to separate diff if you'd like
Test Plan: ran it
Reviewers: sdong, yhchiang, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22815
Summary: as title
Test Plan: no
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22761
Summary: also fix HISTORY.md
Test Plan: make all check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22437
Summary:
I will move compression related options in a separate diff since this
diff is already pretty lengthy.
I guess I will also need to change JNI accordingly :(
Test Plan: make all check
Reviewers: yhchiang, igor, sdong
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D21915
Summary:
Add WriteBatchWithIndex so that a user can query data out of a WriteBatch, to support MongoDB's read-its-own-write.
WriteBatchWithIndex uses a skiplist to store the binary index. The index stores the offset of the entry in the write batch. When searching for a key, the key for the entry is read by read the entry from the write batch from the offset.
Define a new iterator class for querying data out of WriteBatchWithIndex. A user can create an iterator of the write batch for one column family, seek to a key and keep calling Next() to see next entries.
I will add more unit tests if people are OK about this API.
Test Plan:
make all check
Add unit tests.
Reviewers: yhchiang, igor, MarkCallaghan, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb, xjin
Differential Revision: https://reviews.facebook.net/D21381
Summary: I was bit by this when developing SpatialDB. In case all files are at level 0, CompactRange() will output the compacted files to level 0. This is not ideal, since read amp. is much better at level 1 and higher.
Test Plan: Compacted data in SpatialDB, read manifest using ldb, verified that files are now at level 1 instead of 0.
Reviewers: sdong, ljin, yhchiang, dhruba
Reviewed By: dhruba
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D20901
Summary: We have quite some properties that are integers and we are adding more. Add a function to directly return them as an integer, instead of a string
Test Plan: Add several unit test checks
Reviewers: yhchiang, igor, dhruba, haobo, ljin
Reviewed By: ljin
Subscribers: yoshinorim, leveldb
Differential Revision: https://reviews.facebook.net/D20637
Summary: Add a DB property of estimated number of live keys, by adding number of entries of all mem tables and all files, subtracted by all deletions in all files.
Test Plan: Add the case in unit tests
Reviewers: hobbymanyp, ljin
Reviewed By: ljin
Subscribers: MarkCallaghan, yoshinorim, leveldb, igor, dhruba
Differential Revision: https://reviews.facebook.net/D20631
Summary:
Make StatisticsImpl being able to forward stats to provided statistics
implementation. The main purpose is to allow us to collect internal
stats in the future even when user supplies custom statistics
implementation. It avoids intrumenting 2 sets of stats collection code.
One immediate use case is tuning advisor, which needs to collect some
internal stats, users may not be interested.
Test Plan:
ran db_bench and see stats show up at the end of run
Will run make all check since some tests rely on statistics
Reviewers: yhchiang, sdong, igor
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D20145
Summary:
All public headers need to be under `include/rocksdb` directory. Otherwise, clients include our header files like this:
#include <rocksdb/db.h>
#include <utilities/backupable_db.h> // still our public header!
Also, internally, we include:
#include "utilities/backupable/backupable_db.h" // internal header
#include "utilities/backupable_db.h" // public header
which is confusing.
This way, when we install rocksdb as a system library, we can just copy `include/rocksdb` directory to system's header files. We can't really copy `utilities` directory to system's header files.
Test Plan: compiles
Reviewers: dhruba, ljin, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20409
Summary: Updating HISTORY.md to reflect changes in PlainTable index
Test Plan: no need
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20229
Summary:
Since we have a lot of options for PlainTable, add a struct PlainTableOptions
to manage them
Test Plan: make all check
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20175
Summary:
This patch adds a target size parameter in options.db_paths and universal compaction will base it to determine which DB path to place a new file.
Level-style stays the same.
Test Plan: Add new unit tests
Reviewers: ljin, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba, igor, leveldb
Differential Revision: https://reviews.facebook.net/D19869
Summary:
Seems like NewTotalOrderPlainTableFactory is useless and is semantically incorrect.
Total order mode indicator is prefix_extractor == nullptr,
but NewTotalOrderPlainTableFactory doesn't set it to be nullptr. That's why some tests
in plain_table_db_tests is incorrect.
Test Plan: make all check
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19587
Summary:
Add option and plugin rate limiter for PosixWritableFile. The rate
limiter only applies to flush and compaction. WAL and MANIFEST are
excluded from this enforcement.
Test Plan: db_test
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19425
Summary: Update HISTORY.md to include TimeOut write API and compaction update.
Test Plan: n/a
Reviewers: ljin, sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19449
Summary:
In this patch, we enhance HashLinkList memtable to reduce performance outliers when a bucket contains too many entries. We switch to skip list for this case to enable binary search.
Add threshold_use_skiplist parameter to determine when a bucket needs to switch to skip list.
The new data structure is documented in comments in the codes.
Test Plan:
make all check
set threshold_use_skiplist in several tests
Reviewers: yhchiang, haobo, ljin
Reviewed By: yhchiang, ljin
Subscribers: nkg-, xjin, dhruba, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D19299
Summary:
Add one entry to HISTORY.md that is failed to be added.
Move two other ones to the right location.
Test Plan: Only document
Reviewers: haobo, ljin
Subscribers: dhruba, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D19233
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.
This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.
There is one test case that relied on didIO information. I'll try to find another way to implement it.
Test Plan: make check
Reviewers: sdong, haobo, yhchiang, ljin, dhruba
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19161
Summary: Add two parameters of hash linked list to log distribution of number of entries across all buckets, and a sample row when there are too many entries in one single bucket.
Test Plan: Turn it on in plain_table_db_test and see the logs.
Reviewers: haobo, ljin
Reviewed By: ljin
Subscribers: leveldb, nkg-, dhruba, yhchiang
Differential Revision: https://reviews.facebook.net/D19095
Summary: The new table factory is used if users want to convert a DB from one table format to the other. A user can use this table to open a DB written using one table format and write new files to another table format.
Test Plan: add a unit test
Reviewers: haobo, igor
Reviewed By: igor
Subscribers: dhruba, ljin, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D19017
Summary:
Introducing new compaction style -- FIFO.
FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.
FIFO compaction style is suited for storing high-frequency event logs.
Test Plan: Added a unit test
Reviewers: dhruba, haobo, sdong
Reviewed By: dhruba
Subscribers: alberts, leveldb
Differential Revision: https://reviews.facebook.net/D18765
Summary:
Materialize the hash index to avoid the soaring cpu/flash usage
when initializing the database.
Test Plan: existing unit tests passed
Reviewers: sdong, haobo
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18339
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.
Here's description of task #4296714:
I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.
For example, this table has 3155 entries and 1391828 deleted keys.
The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.
Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.
In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.
Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests
Reviewers: sdong, dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18579
Summary:
TLB page allocation errors are now logged to info logs, instead of stderr.
In order to do that, mem table rep's factory functions take a info logger now.
Test Plan: make all check
Reviewers: haobo, igor, yhchiang
Reviewed By: yhchiang
CC: leveldb, yhchiang, dhruba
Differential Revision: https://reviews.facebook.net/D18471
Summary:
also add an override option total_order_iteration if you want to use full
iterator with prefix_extractor
Test Plan: make all check
Reviewers: igor, haobo, sdong, yhchiang
Reviewed By: haobo
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D17805
Summary: XCode for some reason injects `#define DEBUG 1` into our code, which makes compile fail because we use `DEBUG` keyword for other stuff. This diff fixes the issue by renaming `DEBUG` to `DEBUG_LEVEL`.
Test Plan: compiles
Reviewers: dhruba, haobo, sdong, yhchiang, ljin
Reviewed By: yhchiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17709
Summary: Update HISTORY.md to make existing items to 2.8 release and add something I think is missing.
Test Plan: N/A
Reviewers: haobo, igor, ljin, dhruba, yhchiang, xjin
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17517
Summary: The previous change D15087 changed existing compaction filter, which makes the commonly used class not backward compatible. Revert the older interface. Use a new interface for V2 instead.
Test Plan: make all check
Reviewers: haobo, yhchiang, igor
CC: danguo, dhruba, ljin, igor, leveldb
Differential Revision: https://reviews.facebook.net/D17223
Summary: Talked to <insert internal project name> folks and they found it really scary that they won't be able to roll back once they upgrade to 2.8. We should fix this.
Test Plan: make check
Reviewers: haobo, ljin
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17343
Summary:
This diff adds a new CompactionFilterV2 API that roll up the
decisions of kv pairs during compactions. These kv pairs must share the
same key prefix. They are buffered inside the db.
typedef std::vector<Slice> SliceVector;
virtual std::vector<bool> Filter(int level,
const SliceVector& keys,
const SliceVector& existing_values,
std::vector<std::string>* new_values,
std::vector<bool>* values_changed
) const = 0;
Application can override the Filter() function to operate
on the buffered kv pairs. More details in the inline documentation.
Test Plan:
make check. Added unit tests to make sure Keep, Delete,
Change all works.
Reviewers: haobo
CCs: leveldb
Differential Revision: https://reviews.facebook.net/D15087
Summary:
Everytime a client opens a DB, we do a sanity check that:
* checks the existance of all the necessary files
* verifies that file sizes are correct
Some of the code was stolen from https://reviews.facebook.net/D16935
Test Plan: added a unit test
Reviewers: dhruba, haobo, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17097
Summary: Added a function/command to check the consistency of live files' meta data
Test Plan:
Manual test (size mismatch, file not exist).
Command test script.
Reviewers: haobo
Reviewed By: haobo
CC: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D16935
Summary:
I'm cleaning up some code preparing for the big diff review tomorrow. This is the first part of the cleanup.
Changes are mostly cosmetic. The goal is to decrease amount of code difference between columnfamilies and master branch.
This diff also fixes race condition when dropping column family.
Test Plan: Ran db_stress with variety of parameters
Reviewers: dhruba, haobo
Differential Revision: https://reviews.facebook.net/D16833
Summary: Add a function to Env so that users can query the waiting queue length of each thread pool
Test Plan: add a test in env_test
Reviewers: haobo
Reviewed By: haobo
CC: dhruba, igor, yhchiang, ljin, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D16755
Summary:
(1) Fix SanitizeOptions() to also check HashLinkList. The current
dynamic case just happens to work because the 2 classes have the same
layout.
(2) Do not delete SliceTransform object in HashSkipListFactory and
HashLinkListFactory destructor. Reason: SanitizeOptions() enforces
prefix_extractor and SliceTransform to be the same object when
Hash**Factory is used. This makes the behavior strange: when
Hash**Factory is used, prefix_extractor will be released by RocksDB. If
other memtable factory is used, prefix_extractor should be released by
user.
Test Plan: db_bench && make asan_check
Reviewers: haobo, igor, sdong
Reviewed By: igor
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D16587
Summary:
If verify_checksums_in_compaction is true, compaction will verify checksums. This is default.
If it's false, compaction doesn't verify checksums. This is useful for in-memory workloads.
Test Plan: corruption_test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16695
Summary:
(1) Report corruption if backup meta file has tailing data that was not read. This should fix: https://github.com/facebook/rocksdb/issues/81 (also, @sdong reported similar issue)
(2) Don't use OS buffer when copying file to backup directory. We don't need the file in cache since we won't be reading it twice
(3) Don't delete newer backups when somebody tries to backup the diverged DB (restore from older backup, add new data, try to backup). Rather, just fail the new backup.
Test Plan: backupable_db_test
Reviewers: ljin, dhruba, sdong
Reviewed By: ljin
CC: leveldb, sdong
Differential Revision: https://reviews.facebook.net/D16287
Summary:
This diff does two things:
(1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8#
(2) Turn off mmap writes for all writes to log and manifest files
(2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing.
Test Plan:
Added unit tests from LevelDB
Actually recovered a "corrupted" MANIFEST file.
Reviewers: dhruba, haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16119
Summary:
* Add ColumnFamilyHandle::GetID() function. Client needs to know column family's ID to be able to construct WriteBatch
* Handle WriteBatch::Handler failure gracefully. Since WriteBatch is not a very smart function (it takes raw CF id), client can add data to WriteBatch for column family that doesn't exist. In that case, we need to gracefully return failure status from DB::Write(). To do that, I added a return Status to WriteBatch functions PutCF, DeleteCF and MergeCF.
Test Plan: Added test to column_family_test
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16323
Summary:
This is not a generic thread local implementation in the sense that it
only takes pointer. But it does support multiple instances per thread
and lets user plugin function to perform cleanup when thread exits or an
instance gets destroyed.
Test Plan: unit test for now
Reviewers: haobo, igor, sdong, dhruba
Reviewed By: igor
CC: leveldb, kailiu
Differential Revision: https://reviews.facebook.net/D16131
Summary: Added a bit more information to compaction context, requested by internal team at FB.
Test Plan: Modified CompactionFilter test to make sure is_manual_compaction is properly set.
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16095
Summary:
This is not column-family related diff. It is in columnfamily branch because the change is significant and we want to push it with next major release (3.0).
It removes the leveldb notion of one thread pool and expands it to two thread pools by default (HIGH and LOW). Flush process is removed from compaction process and all flush threads are executed on HIGH thread pool, since we don't want long-running compactions to influence flush latency.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15987
Summary:
Easy thing goes first. This patch moves arena to internal dir; based
on which, the coming patch will deal with memtable_rep.
Test Plan: make check
Reviewers: haobo, sdong, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15615
Summary:
* Add a change log for rocksdb releases.
* Remove the hacky parts of make_new_version.sh, which are either
no longer useful or will be done in our dedicated 3rd-party release
tool.
Test Plan: N/A
Reviewers: igor, haobo, sdong, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15543