Summary:
We have addded new stats and perf_context for measuring the merge and filter operation time consumption.
We have bounded all the merge operations within the GUARD statment and collected the total time for these operations in the DB.
Test Plan: WIP
Reviewers: rven, yhchiang, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D34377
Summary:
Report elapsed time of a thread operation in micros in ThreadStatus
instead of start time of a thread operation in seconds since the
Epoch, 1970-01-01 00:00:00 (UTC).
Test Plan:
./db_bench --benchmarks=fillrandom --num=100000 --threads=40 \
--max_background_compactions=10 --max_background_flushes=3 \
--thread_status_per_interval=1000 --key_size=16 --value_size=1000 \
--num_column_families=10
Sample Output:
ThreadID ThreadType cfName Operation ElapsedTime Stage State
140667724562496 High Pri column_family_name_000002 Flush 772.419 ms FlushJob::WriteLevel0Table
140667728756800 High Pri default Flush 617.845 ms FlushJob::WriteLevel0Table
140667732951104 High Pri column_family_name_000005 Flush 772.078 ms FlushJob::WriteLevel0Table
140667875557440 Low Pri column_family_name_000008 Compaction 1409.216 ms CompactionJob::Install
140667737145408 Low Pri
140667749728320 Low Pri
140667816837184 Low Pri column_family_name_000007 Compaction 1071.815 ms CompactionJob::ProcessKeyValueCompaction
140667787477056 Low Pri column_family_name_000009 Compaction 772.516 ms CompactionJob::ProcessKeyValueCompaction
140667741339712 Low Pri
140667758116928 Low Pri column_family_name_000004 Compaction 620.739 ms CompactionJob::ProcessKeyValueCompaction
140667753922624 Low Pri
140667842003008 Low Pri column_family_name_000006 Compaction 1260.079 ms CompactionJob::ProcessKeyValueCompaction
140667745534016 Low Pri
Reviewers: sdong, igor, rven
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35769
Summary: There is no alternative to GetLiveFiles() function
Test Plan: none
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35805
Summary: Add a DB property for number of deletions in memtables. It can sometimes help people debug slowness because of too many deletes.
Test Plan: Add test cases.
Reviewers: rven, yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D35247
Summary: This feature is going to be useful for mongodb+rocksdb. I'll expose it through mongo's API.
Test Plan: added new unit test. also will run TSAN on the new unit test
Reviewers: meyering, sdong
Reviewed By: meyering, sdong
Subscribers: meyering, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35307
Summary: WriteBatch and WriteBatchWithIndex now both inherit from a common abstract base class. This makes it easier to write code that is agnostic toward the implementation of the particular write batch. In particular, I plan on utilizing this abstraction to allow transactions to support using either implementation of a write batch.
Test Plan: modified existing WriteBatchWithIndex tests to test new functions. Running all tests.
Reviewers: igor, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34017
Summary: It is no longer used by the implementation, so we should also remove it from the public API.
Test Plan: make check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34971
Summary: Mark function as override.
Test Plan: USE_CLANG=1 make -j32 check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35163
Summary:
On RocksDB, when there are multiple instances doing
flushes/compactions in the background, the close call takes a long time
because the flushes/compactions need to complete before the database can
shut down. If another instance is using the background threads and the compaction for this instance is in the queue since it has been scheduled, we still cannot shutdown. We now remove the scheduled background tasks which have not yet started running, so that shutdown is speeded up.
Test Plan: DB Test added.
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33741
Summary:
Change the way options.compression_per_level is used when options.level_compaction_dynamic_level_bytes=true so that options.compression_per_level[1] determines compression for the level L0 is merged to, options.compression_per_level[2] to the level after that, etc.
Test Plan: run all tests
Reviewers: rven, yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D34431
Summary:
Provide an API which enables users to infor Rocksdb that it is
shutting down.
Test Plan: db_test
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34617
Summary: Allow GetThreadList() to report the start time of the current operation.
Test Plan:
./db_bench --benchmarks=fillrandom --num=100000 --threads=40 \
--max_background_compactions=10 --max_background_flushes=3 \
--thread_status_per_interval=1000 --key_size=16 --value_size=1000 \
--num_column_families=10
Sample output:
ThreadID ThreadType cfName Operation OP_StartTime State
140338840797248 High Pri column_family_name_000003 Flush 2015/03/09-17:49:59
140338844991552 High Pri column_family_name_000004 Flush 2015/03/09-17:49:59
140338849185856 Low Pri
140338983403584 Low Pri
140339008569408 Low Pri
140338861768768 Low Pri
140338924683328 Low Pri
140338899517504 Low Pri
140338853380160 Low Pri
140338882740288 Low Pri
140338865963072 High Pri column_family_name_000006 Flush 2015/03/09-17:49:59
140338954043456 Low Pri
140338857574464 Low Pri
Reviewers: igor, rven, sdong
Reviewed By: sdong
Subscribers: lgalanis, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34689
Summary:
Add --thread_status_per_interval to db_bench, which allows
db_bench to optionally enable print the current thread status
periodically.
Test Plan:
./db_bench --benchmarks=fillrandom --num=100000 --threads=40 --max_background_compactions=10 --max_background_flushes=3 --thread_status_per_interval=1000 --key_size=16 --value_size=1000 --num_column_families=10
Sample output:
ThreadID ThreadType dbName cfName Operation State
140281571770432 Low Pri
140281575964736 High Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000001 Flush
140281710182464 Low Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000008 Compaction
140281638879296 Low Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000007 Compaction
140281592741952 Low Pri
140281580159040 High Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000002 Flush
140281676628032 Low Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000006 Compaction
140281584353344 Low Pri
140281622102080 Low Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000009 Compaction
140281605324864 Low Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000004 Compaction
140281601130560 High Pri /tmp/rocksdbtest-5297/dbbench default Flush
140281596936256 Low Pri
140281588547648 Low Pri
Reviewers: igor, rven, sdong
Reviewed By: rven
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D34515
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
Summary:
This diff fixes a bug introduced by D28521. Read-only backup engine can delete a backup that is later than the latest -- we never check the condition.
I also added a bunch of logging that will help with debugging cases like this in the future.
See more discussion at t6218248.
Test Plan: Added a unit test that was failing before the change. Also, see new LOG file contents: https://phabricator.fb.com/P19738984
Reviewers: benj, sanketh, sumeet, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33897
Summary:
Summary:
Added a new option to ColumnFamllyOptions - optimize_filters_for_hits. This option can be used in the case where most
accesses to the store are key hits and we dont need to optimize performance for key misses.
This is useful when you have a very large database and most of your lookups succeed. The option allows the store to
not store and use filters in the last level (the largest level which contains data). These filters can take a large amount of
space for large databases (in memory and on-disk). For the last level, these filters are only useful for key misses and not
for key hits. If we are not optimizing for key misses, we can choose to not store these filters for that level.
This option is only provided for BlockBasedTable. We skip the filters when we are compacting
Test Plan:
1. Modified db_test toalso run tests with an additonal option (skip_filters_on_last_level)
2. Added another unit test to db_test which specifically tests that filters are being skipped
Reviewers: rven, igor, sdong
Reviewed By: sdong
Subscribers: lgalanis, yoshinorim, MarkCallaghan, rven, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33717
Summary: mac compile is fixed in fbson, so it can be returned back from 7ce1b2c
Test Plan:
make all check
make valgrind_check
Reviewers: golovachalexander, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D33855
Summary:
When using latest clang (3.6 or 3.7/trunck) rocksdb is failing with many errors. Almost all of them are missing override errors. This diff adds missing override keyword. No manual changes.
Prerequisites: bear and clang 3.5 build with extra tools
```lang=bash
% USE_CLANG=1 bear make all # generate a compilation database http://clang.llvm.org/docs/JSONCompilationDatabase.html
% clang-modernize -p . -include . -add-override
% make format
```
Test Plan:
Make sure all tests are passing.
```lang=bash
% #Use default fb code clang.
% make check
```
Verify less error and no missing override errors.
```lang=bash
% # Have trunk clang present in path.
% ROCKSDB_NO_FBCODE=1 CC=clang CXX=clang++ make
```
Reviewers: igor, kradhakrishnan, rven, meyering, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34077
Summary:
The "const" attribute does not make sense on a return type,
and provokes a warning/error from gcc -W -Wextra.
Test Plan:
Run "make EXTRA_CXXFLAGS='-W -Wextra'" and see fewer errors.
Reviewers: ljin, sdong, igor.sugak, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D33759
Summary:
In mongo, we currently have a single column family and I'd like to support setting rocksdb::Options from string. This diff provides an option to GetOptionsFromString()
There's one more problem. Currently GetColumnFamilyOptionsFromString() overwrites block_based options. In mongo I set default values for block_cache and some other values of BlockBasedTableOptions and I don't want them reset to default with GetOptionsFromString().
Test Plan: added unit test
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33729
Summary: Add a DB property about live versions. It can be helpful to figure out whether there are files not live but not yet deleted, in some use cases.
Test Plan: make all check
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33327
Summary: Replaced rapidjson with fbson
Test Plan:
make all check
make valgrind_check
Reviewers: golovachalexander, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D32733
Summary:
This is a diff for managed iterator. A managed iterator
is a wrapper around an iterator which saves the options for that
iterator as well as the current key/value so that the underlying iterator
and its associated memory can be released when it is aged out
automatically or on the request of the user. Will provide the automatic release as a follow-up diff.
Test Plan: Managed* tests in db_test and XF tests for managed iterator
Reviewers: igor, yhchiang, anthony, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31401
Summary: Remember whole key or prefix filtering on/off in SST files. If user opens the DB with a different setting that cannot be satisfied while reading the SST file, ignore the bloom filter.
Test Plan: Add a unit test for it
Reviewers: yhchiang, igor, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D32889
Summary: Add counters in perf context to allow users to figure out how time spent on waiting for DB mutex
Test Plan: Add a test and run it.
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D33177
Summary:
- In statistics.h , added tickers.
- In version_set.cc,
-- Added a getter method for hit_file_level_ in the class FilePicker
-- Added a line in the Get() method in case of a found, increment the corresponding counters based on the level of the file respectively.
Corresponding task: https://our.intern.facebook.com/intern/tasks/?s=506100481&t=5952818
Personal fork: 0c3f2e3600
Test Plan:
In terminal,
```
make -j32 db_test
ROCKSDB_TESTS=L0L1L2AndUpHitCounter ./db_test
```
Or to use debugger,
```
make -j32 db_test
export ROCKSDB_TESTS=L0L1L2AndUpHitCounter
gdb db_test
```
Reviewers: rven, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D32205
Summary:
This change adds LogHeader provision to the logger. For the rolling logger
implementation, the headers are copied over to the new log file every time
there is a log roll over.
Test Plan: Added a unit test to test the rolling log case.
Reviewers: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D32817
Summary:
Expose GetID to ColumnFamilyHandle interface so that we can save column
family data by id instead of name.
Test Plan: Testing in MySQL on Rocks.
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D32943
Summary:
Add a counter for collecting the wait time on db mutex.
Also add MutexWrapper and CondVarWrapper for measuring wait time.
Test Plan:
./db_test
export ROCKSDB_TESTS=MutexWaitStats
./db_test
verify stats output using db_bench
make clean
make release
./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10
Sample output:
rocksdb.db.mutex.wait.micros COUNT : 7546866
Reviewers: MarkCallaghan, rven, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D32787
Summary:
Add ThreadStatus::GetOperationName() and ThreadStatus::GetStateName(),
two utility functions that help interpreting ThreadStatus.
Test Plan: ./thread_list_test
Reviewers: sdong, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D32793
Summary: I'm moving mongo to a single column family, so I need DeltaBase iterator with default column family.
Test Plan: Added unit test
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D32589
Summary:
This Diff provides the implementation of the cross functional
test infrastructure. This provides the ability to test a single feature
with every existing regression test in order to identify issues with
interoperability between features.
Test Plan:
Reference implementation of inplace update support cross
functional test. Able to find interoperability issues with inplace
support and ran all of db_test. Will add separate diff for those changes.
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D32247
Summary: Add CappedFixTransform, which is the same as fixed length prefix extractor, except that when slice is shorter than the fixed length, it will use the full key.
Test Plan:
Add a test case for
db_test
options_test
and a new test
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: MarkCallaghan, leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D31887
Summary: This adds a listener for compactions, and gives some useful statistics on each compaction pass.
Test Plan: Unit tests.
Reviewers: sdong, igor, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D31641
Summary: We added a new format version. Reflect that in the comments.
Test Plan: none
Reviewers: sdong, rven, yhchiang, MarkCallaghan
Reviewed By: MarkCallaghan
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31629
Summary:
This diff adds BlockBasedTable format_version = 2. New format version brings better compressed block format for these compressions:
1) Zlib -- encode decompressed size in compressed block header
2) BZip2 -- encode decompressed size in compressed block header
3) LZ4 and LZ4HC -- instead of doing memcpy of size_t encode size as varint32. memcpy is very bad because the DB is not portable accross big/little endian machines or even platforms where size_t might be 8 or 4 bytes.
It does not affect format for snappy.
If you write a new database with format_version = 2, it will not be readable by RocksDB versions before 3.10. DB::Open() will return corruption in that case.
Test Plan:
Added a new test in db_test.
I will also run db_bench and verify VSIZE when block_cache == 1GB
Reviewers: yhchiang, rven, MarkCallaghan, dhruba, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31461
Summary:
In this diff I add another parameter to BlockBasedTableOptions that will let users specify block based table's format. This will greatly simplify block based table's format changes in the future.
First format change that this will support is encoding decompressed size in Zlib and BZip2 blocks. This diff is blocking https://reviews.facebook.net/D31311.
Test Plan: Added a unit tests. More tests to come as part of https://reviews.facebook.net/D31311.
Reviewers: dhruba, MarkCallaghan, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31383
Summary: Add comments in db.h to help users discover their options.
Test Plan: Compile
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: MarkCallaghan, yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31077
Summary:
Since https://reviews.facebook.net/D16119, we ignore partial tailing writes. Because of that, we no longer need skip_log_error_on_recovery.
The documentation says "Skip log corruption error on recovery (If client is ok with losing most recent changes)", while the option actually ignores any corruption of the WAL (not only just the most recent changes). This is very dangerous and can lead to DB inconsistencies. This was originally set up to ignore partial tailing writes, which we now do automatically (after D16119). I have digged up old task t2416297 which confirms my findings.
Test Plan: There was actually no tests that verified correct behavior of skip_log_error_on_recovery.
Reviewers: yhchiang, rven, dhruba, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30603
Summary:
Add structures for exposing events and operations. Event describes
high-level action about a thread such as doing compaciton or
doing flush, while an operation describes lower-level action
of a thread such as reading / writing a SST table, waiting for
mutex. Events and operations are designed to be independent.
One thread would typically involve in one event and one operation.
Code instrument will be in a separate diff.
Test Plan:
Add unit-tests in thread_list_test
make dbg -j32
./thread_list_test
export ROCKSDB_TESTS=ThreadList
./db_test
Reviewers: ljin, igor, sdong
Reviewed By: sdong
Subscribers: rven, jonahcohen, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29781
Summary:
Add support to allow nested config for block-based table factory. The format looks like this:
"write_buffer_size=1024;block_based_table_factory={block_size=4k};max_write_buffer_num=2"
Test Plan: unit test
Reviewers: yhchiang, rven, igor, ljin, jonahcohen
Reviewed By: jonahcohen
Subscribers: jonahcohen, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29223
Summary:
GetThreadList() feature depends on the thread creation and destruction, which is currently handled under Env.
This patch moves GetThreadList() feature under Env to better manage the dependency of GetThreadList() feature
on thread creation and destruction.
Renamed ThreadStatusImpl to ThreadStatusUpdater. Add ThreadStatusUtil, which is a static class contains
utility functions for ThreadStatusUpdater.
Test Plan: run db_test, thread_list_test and db_bench and verify the life cycle of Env and ThreadStatusUpdater is properly managed.
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: ljin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30057
Summary: This is a regression bug introduced by https://reviews.facebook.net/D24729 . max_total_wal_size would be off the target it should be more and more in the case that the a user holds the current super version after flush or compaction. This patch fixes it
Test Plan: make all check
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: ljin, yoshinorim, MarkCallaghan, hermanlee4, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29961
Summary:
Remove the use of exception in WriteBatch::Handler. Now the default
implementations of Put, Merge, and Delete in WriteBatch::Handler are no-op.
Test Plan:
Add three test cases in write_batch_test
./write_batch_test
Reviewers: sdong, igor
Reviewed By: sdong, igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29835
Summary:
Introduces a new class for managing write buffer memory across column
families. We supplement ColumnFamilyOptions::write_buffer_size with
ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer
instance that enforces memory limits before flushing out to disk.
Test Plan: Added SharedWriteBuffer unit test to db_test.cc
Reviewers: sdong, rven, ljin, igor
Reviewed By: igor
Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim
Differential Revision: https://reviews.facebook.net/D22581
Summary: Block Universal and FIFO compactions in ROCKSDB_LITE
Test Plan:
make shared_lib -j32
make OPT=-DROCKSDB_LITE shared_lib
Reviewers: ljin, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29589
Summary:
db_imp_readonly.o is one of the big obj file. If it's not a necessary
feature, we should probably block it in ROCKSDB_LITE.
1322704 Nov 24 16:55 db/db_impl_readonly.o
Test Plan:
make shared_lib -j32
make ROCKSDB_LITE shared_lib -j32
Reviewers: ljin, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29583
Summary: By providing default implementation of LinkFile, we don't break other implementations of Env.
Test Plan: none
Reviewers: rven, dhruba
Reviewed By: dhruba
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29355
Summary:
Add enable_thread_tracking to DBOptions to allow
tracking thread status related to the DB. Default is off.
Test Plan:
export ROCKSDB_TESTS=ThreadList
./db_test
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29289
Summary: Created a CompatibleOptions object that can be used as a LevelDB Options object and then converted to a RocksDB Options object using the ConvertOptions() method.
Test Plan: Unit test included in diff.
Reviewers: ljin
Reviewed By: ljin
Subscribers: sdong, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28893
Summary:
Add GetThreadList API, which allows developer to track the
status of each process. Currently, calling GetThreadList will
only get the list of background threads in RocksDB with their
thread-id and thread-type (priority) set. Will add more support
on this in the later diffs.
ThreadStatus currently has the following properties:
// An unique ID for the thread.
const uint64_t thread_id;
// The type of the thread, it could be ROCKSDB_HIGH_PRIORITY,
// ROCKSDB_LOW_PRIORITY, and USER_THREAD
const ThreadType thread_type;
// The name of the DB instance where the thread is currently
// involved with. It would be set to empty string if the thread
// does not involve in any DB operation.
const std::string db_name;
// The name of the column family where the thread is currently
// It would be set to empty string if the thread does not involve
// in any column family.
const std::string cf_name;
// The event that the current thread is involved.
// It would be set to empty string if the information about event
// is not currently available.
Test Plan:
./thread_list_test
export ROCKSDB_TESTS=GetThreadList
./db_test
Reviewers: rven, igor, sdong, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D25047
Summary:
The very last reference happens in DBImpl::GetOptions()
I built with both DBImpl::GetOptions() and ColumnFamilyData::options() commented out
Test Plan: make all check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29073
Summary: Store links to live files in directory on same disk
Test Plan:
Take snapshot and open it. Added a test GetSnapshotLink in
db_test.
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28713
Summary: This will speed up our import times
Test Plan: Added simple unit test just to get code coverage
Reviewers: sdong, ljin, yhchiang, rven, mohaps
Reviewed By: mohaps
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28869
Summary: So iOS size_t is 32-bit, so we need to static_cast<size_t> any uint64_t :(
Test Plan: TARGET_OS=IOS make static_lib
Reviewers: dhruba, ljin, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28743
Summary:
Improve the backup engine by not deleting the corrupted
backup when it is detected; instead leaving it to the client
to delete the corrupted backup.
Also add a BackupEngine::Open() call.
Test Plan:
Add check to CorruptionTest inside backupable_db_test
to check that the corrupt backups are not deleted. The previous
version of the code failed this test as backups were deleted,
but after the changes in this commit, this test passes.
Run make check to ensure that no other tests fail.
Reviewers: sdong, benj, sanketh, sumeet, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28521
Summary:
We need to turn on -Wshorten-64-to-32 for mobile. See D1671432 (internal phabricator) for details.
This diff turns on the warning flag and fixes all the errors. There were also some interesting errors that I might call bugs, especially in plain table. Going forward, I think it makes sense to have this flag turned on and be very very careful when converting 64-bit to 32-bit variables.
Test Plan: compiles
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: yhchiang
Subscribers: bobbaldwin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28689
Summary: I mistakenly changed the behavior to ++next_file_number_ instead of next_file_number_++, as it should have been: 344edbb044/db/version_set.h (L539)
Test Plan: none. not sure if this would break anything. It's just different behavior, so I'd rather not risk
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28557
Summary:
Refactor sst_dump to follow the same structure as ldb. Introduce a
SSTDump interface.
Test Plan: built sst_dump and tried it manually.
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28485
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
Summary: It turns out that -Wshadow has different rules for gcc than clang. Previous commit fixed clang. This commits fixes the rest of the warnings for gcc.
Test Plan: compiles
Reviewers: ljin, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28131
Summary: This patch makes it a contract that if an empty write batch is passed to DB::Write() and WriteOptions.sync = true, fsync is called to WAL.
Test Plan: A new unit test
Reviewers: ljin, rven, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D28365
Summary: as title
Test Plan: ./db_test
Reviewers: sdong, yhchiang, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28269
Summary: As tittle
Test Plan: N/A
Reviewers: ljin, yhchiang, rven, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D28299
Summary:
With the patch, thread pool size will be automatically increased if DB's options ask for more parallelism of compactions or flushes.
Too many users have been confused by the API. Change it to make it harder for users to make mistakes
Test Plan: Add two unit tests to cover the function.
Reviewers: yhchiang, rven, igor, MarkCallaghan, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D27555
Summary: Move header file so it can be referenced externally.
Test Plan: Rebuild.
Reviewers: ljin
Reviewed By: ljin
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D28095
Summary:
Make inplace_update_support and inplace_update_num_locks dynamic.
inplace_callback becomes immutable
We are almost free of references to cfd->options() in db_impl
Test Plan: unit test
Reviewers: igor, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D25293
Summary: Add an iterator that combines base_iterator of type Iterator* with delta iterator of type WBWIIterator*.
Test Plan: nothing yet. work in progress
Reviewers: ljin, igor
Reviewed By: igor
Subscribers: rven, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D24741
Summary: Support SeekToFirst(), SeekToLast() and Prev() in WBWIIterator, returned by WriteBatchWithIndex::NewIterator().
Test Plan: Write unit test cases to cover the case.
Reviewers: ljin, igor
Reviewed By: igor
Subscribers: rven, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D24765
Summary: With a new option, when accepting a new key, WriteBatchWithIndex will find an existing index of the same key, and replace the content of it.
Test Plan: Add a unit test case.
Reviewers: ljin, yhchiang, rven, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D24753
Summary: Allow accepting Options as a string of key/value pairs
Test Plan: unit test
Reviewers: yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24597
Summary:
This diff introduces the `lookahead` argument to `SkipListFactory()`. This is an
optimization for the tailing use case which includes many seeks. E.g. consider
the following operations on a skip list iterator:
Seek(x), Next(), Next(), Seek(x+2), Next(), Seek(x+3), Next(), Next(), ...
If `lookahead` is positive, `SkipListRep` will return an iterator which also
keeps track of the previously visited node. Seek() then first does a linear
search starting from that node (up to `lookahead` steps). As in the tailing
example above, this may require fewer than ~log(n) comparisons as with regular
skip list search.
Test Plan:
Added a new benchmark (`fillseekseq`) which simulates the usage pattern. It
first writes N records (with consecutive keys), then measures how much time it
takes to read them by calling `Seek()` and `Next()`.
$ time ./db_bench -num 10000000 -benchmarks fillseekseq -prefix_size 1 \
-key_size 8 -write_buffer_size $[1024*1024*1024] -value_size 50 \
-seekseq_next 2 -skip_list_lookahead=0
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.389 micros/op 2569047 ops/sec;
real 0m21.806s
user 0m12.106s
sys 0m9.672s
$ time ./db_bench [...] -skip_list_lookahead=2
[...]
DB path: [/dev/shm/rocksdbtest/dbbench]
fillseekseq : 0.153 micros/op 6540684 ops/sec;
real 0m19.469s
user 0m10.192s
sys 0m9.252s
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb, march, lovro
Differential Revision: https://reviews.facebook.net/D23997
Summary:
Before this diff, there are two places with rocksdb versions. After the diff:
1. we only have one source of truth for rocksdb version
2. we have a script that we can use to get the version that we can use in other compilations (java, go, etc).
Test Plan: make
Reviewers: yhchiang, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D24333
Summary:
make compaction related options changeable. Most of changes are tedious,
following the same convention: grabs MutableCFOptions at the beginning
of compaction under mutex, then pass it throughout the job and register
it in SuperVersion at the end.
Test Plan: make all check
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23349
Extended Built-in comparators with ReverseBytewiseComparator.
Reverse key handling is under certain conditions essential. E.g. while
using timestamp versioned data.
As native-comparators were not available using JAVA-API. Both built-in comparators
were exposed via JNI to be set upon database creation time.
Summary:
When creating a new iterator, instead of storing mapping from key to
bucket id for sorting, store only bucket id and read key from mmap file
based on the id. This reduces from 20 bytes per entry to only 4 bytes.
Test Plan: db_bench
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23757
Summary:
Using module to calculate hash makes lookup ~8% slower. But it has its
benefit: file size is more predictable, more space enffient
Test Plan: db_bench
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23691
Summary:
It was commented out in D22545 by accident. Keep the option in
ImmutableOptions for now. I can make it dynamic in
https://reviews.facebook.net/D23349
Test Plan: make release
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23865
Summary:
Previously, one single column family is given to WriteBatchWithIndex to index keys for all column families. An extra map from column family ID to comparator is maintained which can override the default comparator given in the constructor. A WriteBatchWithIndex::SetComparatorForCF() is added for user to add comparators per column family.
Also move more codes into anonymous namespace.
Test Plan: Add a unit test
Reviewers: ljin, igor
Reviewed By: igor
Subscribers: dhruba, leveldb, yhchiang
Differential Revision: https://reviews.facebook.net/D23355
Summary: It contrains the file size to be 4G max with int
Test Plan:
tried to grep instance and made sure other related variables are also
uint64
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23697
Summary:
MurmurHash becomes expensive when we do millions Get() a second in one
thread. Add this option to allow the first hash function to use identity
function as hash function. It results in QPS increase from 3.7M/s to
~4.3M/s. I did not observe improvement for end to end RocksDB
performance. This may be caused by other bottlenecks that I will address
in a separate diff.
Test Plan:
```
[ljin@dev1964 rocksdb] ./cuckoo_table_reader_test --enable_perf --file_dir=/dev/shm --write --identity_as_first_hash=0
==== Test CuckooReaderTest.WhenKeyExists
==== Test CuckooReaderTest.WhenKeyExistsWithUint64Comparator
==== Test CuckooReaderTest.CheckIterator
==== Test CuckooReaderTest.CheckIteratorUint64
==== Test CuckooReaderTest.WhenKeyNotFound
==== Test CuckooReaderTest.TestReadPerformance
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.272us (3.7 Mqps) with batch size of 0, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.138us (7.2 Mqps) with batch size of 10, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.142us (7.1 Mqps) with batch size of 25, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.142us (7.0 Mqps) with batch size of 50, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.144us (6.9 Mqps) with batch size of 100, # of found keys 125829120
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.201us (5.0 Mqps) with batch size of 0, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.121us (8.3 Mqps) with batch size of 10, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.123us (8.1 Mqps) with batch size of 25, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.121us (8.3 Mqps) with batch size of 50, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.112us (8.9 Mqps) with batch size of 100, # of found keys 104857600
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.251us (4.0 Mqps) with batch size of 0, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.107us (9.4 Mqps) with batch size of 10, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.099us (10.1 Mqps) with batch size of 25, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.100us (10.0 Mqps) with batch size of 50, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.116us (8.6 Mqps) with batch size of 100, # of found keys 83886080
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.189us (5.3 Mqps) with batch size of 0, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.095us (10.5 Mqps) with batch size of 10, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.096us (10.4 Mqps) with batch size of 25, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.098us (10.2 Mqps) with batch size of 50, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.105us (9.5 Mqps) with batch size of 100, # of found keys 73400320
[ljin@dev1964 rocksdb] ./cuckoo_table_reader_test --enable_perf --file_dir=/dev/shm --write --identity_as_first_hash=1
==== Test CuckooReaderTest.WhenKeyExists
==== Test CuckooReaderTest.WhenKeyExistsWithUint64Comparator
==== Test CuckooReaderTest.CheckIterator
==== Test CuckooReaderTest.CheckIteratorUint64
==== Test CuckooReaderTest.WhenKeyNotFound
==== Test CuckooReaderTest.TestReadPerformance
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.230us (4.3 Mqps) with batch size of 0, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.086us (11.7 Mqps) with batch size of 10, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.088us (11.3 Mqps) with batch size of 25, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.083us (12.1 Mqps) with batch size of 50, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.083us (12.1 Mqps) with batch size of 100, # of found keys 125829120
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.159us (6.3 Mqps) with batch size of 0, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.078us (12.8 Mqps) with batch size of 10, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.080us (12.6 Mqps) with batch size of 25, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.080us (12.5 Mqps) with batch size of 50, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.082us (12.2 Mqps) with batch size of 100, # of found keys 104857600
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.154us (6.5 Mqps) with batch size of 0, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.077us (13.0 Mqps) with batch size of 10, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.077us (12.9 Mqps) with batch size of 25, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.078us (12.8 Mqps) with batch size of 50, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.079us (12.6 Mqps) with batch size of 100, # of found keys 83886080
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.218us (4.6 Mqps) with batch size of 0, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.083us (12.0 Mqps) with batch size of 10, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.085us (11.7 Mqps) with batch size of 25, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.086us (11.6 Mqps) with batch size of 50, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.078us (12.8 Mqps) with batch size of 100, # of found keys 73400320
```
Reviewers: sdong, igor, yhchiang
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23451
Summary: as title
Test Plan:
make all check
I will think a way to set up stress test for this
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23055
Summary: as title
Test Plan: options_test
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23283
Summary: Correct some comments and typos in RocksDB.
Test Plan: Inspection
Reviewers: sdong, igor
Reviewed By: igor
Differential Revision: https://reviews.facebook.net/D23133
Summary:
Added the following statistics to BackupableDB:
1. Number of successful and failed backups in class BackupStatistics
2. Time taken to do a backup
3. Number of files in a backup
1 is implemented in the BackupStatistics class
2 and 3 are added in the BackupMeta and BackupInfo class
Test Plan:
1 can be tested using BackupStatistics::ToString(),
2 and 3 can be tested in the BackupInfo class
Reviewers: sdong, igor2, ljin, igor
Reviewed By: igor
Differential Revision: https://reviews.facebook.net/D22785
Summary: removed reference to options in WriteBatch and DBImpl::Get()
Test Plan: make all check
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23049
Summary:
all shared_ptrs are in immutable_options now. This will also make
options assignment a little cheaper
Test Plan: make release
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23001
Summary:
I found it is almost impossible to get rid of this function in a single
batch. I will take a step by step approach
Test Plan: make release
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22995
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
Summary:
1. Make filter_block.h a base class. Derive block_based_filter_block and full_filter_block. The previous one is the traditional filter block. The full_filter_block is newly added. It would generate a filter block that contain all the keys in SST file.
2. When querying a key, table would first check if full_filter is available. If not, it would go to the exact data block and check using block_based filter.
3. User could choose to use full_filter or tradional(block_based_filter). They would be stored in SST file with different meta index name. "filter.filter_policy" or "full_filter.filter_policy". Then, Table reader is able to know the fllter block type.
4. Some optimizations have been done for full_filter_block, thus it requires a different interface compared to the original one in filter_policy.h.
5. Actual implementation of filter bits coding/decoding is placed in util/bloom_impl.cc
Benchmark: base commit 1d23b5c470
Command:
db_bench --db=/dev/shm/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --write_buffer_size=134217728 --max_write_buffer_number=2 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --verify_checksum=false --max_background_compactions=4 --use_plain_table=0 --memtablerep=prefix_hash --open_files=-1 --mmap_read=1 --mmap_write=0 --bloom_bits=10 --bloom_locality=1 --memtable_bloom_bits=500000 --compression_type=lz4 --num=393216000 --use_hash_search=1 --block_size=1024 --block_restart_interval=16 --use_existing_db=1 --threads=1 --benchmarks=readrandom —disable_auto_compactions=1
Read QPS increase for about 30% from 2230002 to 2991411.
Test Plan:
make all check
valgrind db_test
db_stress --use_block_based_filter = 0
./auto_sanity_test.sh
Reviewers: igor, yhchiang, ljin, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D20979
Summary:
As a preparation to support updating some options dynamically, I'd like
to first introduce ImmutableOptions, which is a subset of Options that
cannot be changed during the course of a DB lifetime without restart.
ColumnFamily will keep both Options and ImmutableOptions. Any component
below ColumnFamily should only take ImmutableOptions in their
constructor. Other options should be taken from APIs, which will be
allowed to adjust dynamically.
I am yet to make changes to memtable and other related classes to take
ImmutableOptions in their ctor. That can be done in a seprate diff as
this one is already pretty big.
Test Plan: make all check
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D22545
Summary: as title
Test Plan: none
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22737
Summary: It only covers Open() with default column family right now
Test Plan: make release
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22467
Summary:
Before this diff, whenever we Write to non-existing column family, Write() would fail.
This diff adds an option to not fail a Write() when WriteBatch points to non-existing column family. MongoDB said this would be useful for them, since they might have a transaction updating an index that was dropped by another thread. This way, they don't have to worry about checking if all indexes are alive on every write. They don't care if they lose writes to dropped index.
Test Plan: added a small unit test
Reviewers: sdong, yhchiang, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22143
Summary: 1. db/db_impl.cc:2324 (DBImpl::BackgroundCompaction) should not raise bg_error_ when column family is dropped during compaction.
Test Plan: 1. db_stress
Reviewers: ljin, yhchiang, dhruba, igor, sdong
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22653
Summary: This implements a cache friendly version of Cuckoo Hash in which, in case of collission, we try to insert in next few locations. The size of the neighborhood to check is taken as an input parameter in builder and stored in the table.
Test Plan:
make check all
cuckoo_table_{db,reader,builder}_test
Reviewers: sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22455
Summary: No __thread for ios.
Test Plan: compile works for ios now
Reviewers: ljin, dhruba
Reviewed By: dhruba
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22491
Summary: also fix HISTORY.md
Test Plan: make all check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22437
Summary: Add a virtual function in table factory that will print table options
Test Plan: make release
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22149
Summary:
I will move compression related options in a separate diff since this
diff is already pretty lengthy.
I guess I will also need to change JNI accordingly :(
Test Plan: make all check
Reviewers: yhchiang, igor, sdong
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D21915
Summary:
Currently, PlainTable must use mmap_reads. When PlainTable is used but
allow_mmap_reads is not set, rocksdb will fail in flush.
This diff improve Options sanitization and add MmapReadRequired() to
TableFactory.
Test Plan:
export ROCKSDB_TESTS=PlainTableOptionsSanitizeTest
make db_test -j32
./db_test
Reviewers: sdong, ljin
Reviewed By: ljin
Subscribers: you, leveldb
Differential Revision: https://reviews.facebook.net/D21939
Summary:
Add WriteBatchWithIndex so that a user can query data out of a WriteBatch, to support MongoDB's read-its-own-write.
WriteBatchWithIndex uses a skiplist to store the binary index. The index stores the offset of the entry in the write batch. When searching for a key, the key for the entry is read by read the entry from the write batch from the offset.
Define a new iterator class for querying data out of WriteBatchWithIndex. A user can create an iterator of the write batch for one column family, seek to a key and keep calling Next() to see next entries.
I will add more unit tests if people are OK about this API.
Test Plan:
make all check
Add unit tests.
Reviewers: yhchiang, igor, MarkCallaghan, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb, xjin
Differential Revision: https://reviews.facebook.net/D21381
Summary: Bump up version after we've cut 3.4
Test Plan: N/A
Reviewers: yhchiang, ljin
Reviewed By: ljin
Subscribers: leveldb, dhruba, igor
Differential Revision: https://reviews.facebook.net/D22047
Summary: This is a linux-specific system call.
Test Plan: ran db_bench
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: haobo, leveldb
Differential Revision: https://reviews.facebook.net/D21183
Summary: as title
Test Plan: make release
Reviewers: sdong, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D21717
Summary: as title
Test Plan: make all check
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D21201
Summary:
Contains the following changes:
- Implementation of cuckoo_table_factory
- Adding cuckoo table into AdaptiveTableFactory
- Adding cuckoo_table_db_test, similar to lines of plain_table_db_test
- Minor fixes to Reader: When a key is found in the table, return the key found instead of the search key.
- Minor fixes to Builder: Add table properties that are required by Version::UpdateTemporaryStats() during Get operation. Don't define curr_node as a reference variable as the memory locations may get reassigned during tree.push_back operation, leading to invalid memory access.
Test Plan:
cuckoo_table_reader_test --enable_perf
cuckoo_table_builder_test
cuckoo_table_db_test
make check all
make valgrind_check
make asan_check
Reviewers: sdong, igor, yhchiang, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D21219
Summary: as title
Test Plan: make release
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D21549
Summary: This will help debugging
Test Plan: ran, observed output
Reviewers: yinwang
Reviewed By: yinwang
Differential Revision: https://reviews.facebook.net/D20937
Summary: We have quite some properties that are integers and we are adding more. Add a function to directly return them as an integer, instead of a string
Test Plan: Add several unit test checks
Reviewers: yhchiang, igor, dhruba, haobo, ljin
Reviewed By: ljin
Subscribers: yoshinorim, leveldb
Differential Revision: https://reviews.facebook.net/D20637
Summary:
Make StatisticsImpl being able to forward stats to provided statistics
implementation. The main purpose is to allow us to collect internal
stats in the future even when user supplies custom statistics
implementation. It avoids intrumenting 2 sets of stats collection code.
One immediate use case is tuning advisor, which needs to collect some
internal stats, users may not be interested.
Test Plan:
ran db_bench and see stats show up at the end of run
Will run make all check since some tests rely on statistics
Reviewers: yhchiang, sdong, igor
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D20145
Summary:
Contains:
- Implementation of TableReader based on Cuckoo Hashing
- Unittests for CuckooTableReader
- Performance test for TableReader
Test Plan:
make cuckoo_table_reader_test
./cuckoo_table_reader_test
make valgrind_check
make asan_check
Reviewers: yhchiang, sdong, igor, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20511
Summary:
User gets undefinied error since the definition is not exposed.
Also re-enable the db test with only upper bound check
Test Plan: db_test, rate_limit_test
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20403
Summary: I changed SpatialDB API so that we only specify list of indexes when we create the database. That way, whoever is querying the DB doesn't need to know the full list of indexes and their options.
Test Plan: spatial_db_test
Reviewers: yinwang
Reviewed By: yinwang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20571
Summary:
Contains the following changes in CuckooTableBuilder:
- Take an extra parameter in constructor to identify last level file.
- Implement a better way to identify if a bucket has been inserted into the tree already during BFS search.
- Minor typos
Test Plan:
make cuckoo_table_builder
./cuckoo_table_builder
make valgrind_check
Reviewers: sdong, igor, yhchiang, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20445
Summary:
This diff is adding spatial index support to RocksDB.
When creating the DB user specifies a list of spatial indexes. Spatial indexes can cover different areas and have different resolution (i.e. number of tiles). This is useful for supporting different zoom levels.
Each element inserted into SpatialDB has:
* a bounding box, which determines how will the element be indexed
* string blob, which will usually be WKB representation of the polygon (http://en.wikipedia.org/wiki/Well-known_text)
* feature set, which is a map of key-value pairs, where value can be int, double, bool, null or a string. FeatureSet will be a set of tags associated with geo elements (for example, 'road': 'highway' and similar)
* a list of indexes to insert the element in. For example, small river element will be inserted in index for high zoom level, while country border will be inserted in all indexes (including the index for low zoom level).
Each query is executed on single spatial index. Query guarantees that it will return all elements intersecting the specified bounding box, but it might also return some extra non-intersecting elements.
Test Plan: Added bunch of unit tests in spatial_db_test
Reviewers: dhruba, yinwang
Reviewed By: yinwang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20361
Summary:
All public headers need to be under `include/rocksdb` directory. Otherwise, clients include our header files like this:
#include <rocksdb/db.h>
#include <utilities/backupable_db.h> // still our public header!
Also, internally, we include:
#include "utilities/backupable/backupable_db.h" // internal header
#include "utilities/backupable_db.h" // public header
which is confusing.
This way, when we install rocksdb as a system library, we can just copy `include/rocksdb` directory to system's header files. We can't really copy `utilities` directory to system's header files.
Test Plan: compiles
Reviewers: dhruba, ljin, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20409
Summary:
Add a function GetOptions(), where based on four parameters users give: read/write amplification threshold, memory budget for mem tables and target DB size, it picks up a compaction style and parameters for them. Background threads are not touched yet.
One limit of this algorithm: since compression rate and key/value size are hard to predict, it's hard to predict level 0 file size from write buffer size. Simply make 1:1 ratio here.
Sample results: https://reviews.facebook.net/P477
Test Plan: Will add some a unit test where some sample scenarios are given and see they pick the results that make sense
Reviewers: yhchiang, dhruba, haobo, igor, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18741
Summary: Add a parameter path_id to DB::CompactRange(), to indicate where the output file should be placed to.
Test Plan: add a unit test
Reviewers: yhchiang, ljin
Reviewed By: ljin
Subscribers: xjin, igor, dhruba, MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D20085
Summary:
Suggested by @dhruba from the other diff, here is the improved
comments for parameters of the function
Test Plan: none
Reviewers: dhruba, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19623
Summary:
Adding option to save PlainTable index and bloom filter in SST file.
If there is no bloom block and/or index block, PlainTableReader builds
new ones. Otherwise PlainTableReader just use these blocks.
Test Plan: make all check
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19527
Summary:
Since we have a lot of options for PlainTable, add a struct PlainTableOptions
to manage them
Test Plan: make all check
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20175
Summary:
This patch adds a target size parameter in options.db_paths and universal compaction will base it to determine which DB path to place a new file.
Level-style stays the same.
Test Plan: Add new unit tests
Reviewers: ljin, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba, igor, leveldb
Differential Revision: https://reviews.facebook.net/D19869
Summary: Browsing through the code, looks like StatsLogger is not used at all!
Test Plan: compiles
Reviewers: ljin, sdong, yhchiang, dhruba
Reviewed By: dhruba
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19827
Summary: As title
Test Plan: Not needed for comment only.
Reviewers: yhchiang, ljin, MarkCallaghan
Reviewed By: MarkCallaghan
Subscribers: xjin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19887
Summary: Add a function to return the perf level. It is to allow a wrapper of DB to increase the perf level and restore the original perf level after finishing the function call.
Test Plan: Add a verification in db_test
Reviewers: yhchiang, igor, ljin
Reviewed By: ljin
Subscribers: xjin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19551
Summary:
Seems like NewTotalOrderPlainTableFactory is useless and is semantically incorrect.
Total order mode indicator is prefix_extractor == nullptr,
but NewTotalOrderPlainTableFactory doesn't set it to be nullptr. That's why some tests
in plain_table_db_tests is incorrect.
Test Plan: make all check
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19587
Summary:
This is a rough sketch of our new document API. Would like to get some thoughts and comments about the high-level architecture and API.
I didn't optimize for performance at all. Leaving some low-hanging fruit so that we can be happy when we fix them! :)
Currently, bunch of features are not supported at all. Indexes can be only specified when creating database. There is no query planner whatsoever. This will all be added in due time.
Test Plan: Added a simple unit test
Reviewers: haobo, yhchiang, dhruba, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18747
Summary:
Add option and plugin rate limiter for PosixWritableFile. The rate
limiter only applies to flush and compaction. WAL and MANIFEST are
excluded from this enforcement.
Test Plan: db_test
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19425
Summary:
A generic rate limiter that can be shared by threads and rocksdb
instances. Will use this to smooth out write traffic generated by
compaction and flush. This will help us get better p99 behavior on flash
storage.
Test Plan:
unit test output
==== Test RateLimiterTest.Rate
request size [1 - 1023], limit 10 KB/sec, actual rate: 10.374969 KB/sec, elapsed 2002265
request size [1 - 2047], limit 20 KB/sec, actual rate: 20.771242 KB/sec, elapsed 2002139
request size [1 - 4095], limit 40 KB/sec, actual rate: 41.285299 KB/sec, elapsed 2202424
request size [1 - 8191], limit 80 KB/sec, actual rate: 81.371605 KB/sec, elapsed 2402558
request size [1 - 16383], limit 160 KB/sec, actual rate: 162.541268 KB/sec, elapsed 3303500
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19359
Summary: stackable_db doesn't have GetLiveFilesMetaData() implemented. Add it
Test Plan: make all check
Reviewers: yhchiang, ljin, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19491
Summary:
This diff allows the I/O stats about Flush and Compaction to be reported
in a more accurate way. Instead of measuring the size of a file, it
measure I/O cost in per read / write basis.
Test Plan: make all check
Reviewers: sdong, igor, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19383
Summary:
This diff adds timeout_hint_us to WriteOptions. If it's non-zero, then
1) writes associated with this options MAY be aborted when it has been
waiting for longer than the specified time. If an abortion happens,
associated writes will return Status::TimeOut.
2) the stall time of the associated write caused by flush or compaction
will be limited by timeout_hint_us.
The default value of timeout_hint_us is 0 (i.e., OFF.)
The statistics of timeout writes will be recorded in WRITE_TIMEDOUT.
Test Plan:
export ROCKSDB_TESTS=WriteTimeoutAndDelayTest
make db_test
./db_test
Reviewers: igor, ljin, haobo, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18837
Summary:
In this patch, we allow RocksDB to support multiple DB paths internally.
No user interface is supported yet so this patch is silent to users.
Test Plan: make all check
Reviewers: igor, haobo, ljin, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18921
Summary:
In this patch, we enhance HashLinkList memtable to reduce performance outliers when a bucket contains too many entries. We switch to skip list for this case to enable binary search.
Add threshold_use_skiplist parameter to determine when a bucket needs to switch to skip list.
The new data structure is documented in comments in the codes.
Test Plan:
make all check
set threshold_use_skiplist in several tests
Reviewers: yhchiang, haobo, ljin
Reviewed By: yhchiang, ljin
Subscribers: nkg-, xjin, dhruba, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D19299
Summary: It seems to me that when ever function MemTableRep::GetIterator(const Slice& slice) is used, we can use MemTableRep::GetDynamicPrefixIterator() instead. Just delete it to simplify the codes.
Test Plan: make all check
Reviewers: yhchiang, ljin
Reviewed By: ljin
Subscribers: xjin, dhruba, haobo, leveldb
Differential Revision: https://reviews.facebook.net/D19281
Summary: Currently ldb tool dump keys either in ascii format or hex format - neither is ideal if the key has a binary structure and is not readable in ascii. This diff also allows LDB tool to be customized in ways beyond DB options.
Test Plan: verify that key formatter works with some simple db with binary key.
Reviewers: sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19209
Summary:
After evaluating options for JSON storage, I decided to implement our own. The reason is that we'll be able to optimize it better and we get to reduce unnecessary dependencies (which is what we'd get with folly).
I also plan to write a serializer/deserializer for JSONDocument with our own binary format similar to BSON. That way we'll store binary JSON format in RocksDB instead of the plain-text JSON. This means less storage and faster deserialization.
There are still some inefficiencies left here. I plan to optimize them after we develop a functioning DocumentDB. That way we can move and iterate faster.
Test Plan: added a unit test
Reviewers: dhruba, haobo, sdong, ljin, yhchiang
Reviewed By: haobo
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18831
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.
This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.
There is one test case that relied on didIO information. I'll try to find another way to implement it.
Test Plan: make check
Reviewers: sdong, haobo, yhchiang, ljin, dhruba
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19161
Summary:
Add a encoding feature of PlainTable to encode PlainTable's keys to save some bytes for the same prefixes.
The data format is documented in table/plain_table_factory.h
Test Plan: Add unit test coverage in plain_table_db_test
Reviewers: yhchiang, igor, dhruba, ljin, haobo
Reviewed By: haobo
Subscribers: nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D18735
Summary:
Currently, the in-memory hash index of blockbased table uses a precise hash map to track the prefix to block range mapping. In some use cases, especially when prefix itself is big, the memory overhead becomes a problem. This diff introduces a fixed hash bucket array that does not store the prefix and allows prefix collision, which is similar to the plaintable hash index, in order to reduce the memory consumption.
Just a quick draft, still testing and refining.
Test Plan: unit test and shadow testing
Reviewers: dhruba, kailiu, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19047
Summary:
This is minor, but if we put the writing talbe factory as the third parameter, when we add a new table format, we'll have a situation:
1) block based factory
2) plain table factory
3) output factory
4) new format factory
I think it makes more sense to have output as the first parameter.
Also, fixed a NewAdaptiveTableFactory() call in unit test
Test Plan: unit test
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19119