Summary:
This diff allows a Writer to join the next write batch group
without acquiring any locks. Waiting is performed via a per-Writer mutex,
so all of the non-leader writers never need to acquire the db mutex.
It is now possible to join a write batch group after the leader has been
chosen but before the batch has been constructed. This diff doesn't
increase parallelism, but reduces synchronization overheads.
For some CPU-bound workloads (no WAL, RAM-sized working set) this can
substantially reduce contention on the db mutex in a multi-threaded
environment. With T=8 N=500000 in a CPU-bound scenario (see the test
plan) this is good for a 33% perf win. Not all scenarios see such a
win, but none show a loss. This code is slightly faster even for the
single-threaded case (about 2% for the CPU-bound scenario below).
Test Plan:
1. unit tests
2. COMPILE_WITH_TSAN=1 make check
3. stress high-contention scenarios with db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=0 --num=$N -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000
Reviewers: sdong, igor, rven, ljin, yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D43887
Summary:
In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments:
* Less code == better icache hit rate, smaller builds, simpler code
* The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future.
* Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users.
Test Plan: make check
Reviewers: anthony, kradhakrishnan, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: sdong, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41937
Summary:
This fixes the following scenario we've hit:
- we reached max_total_wal_size, created a new wal and scheduled flushing all memtables corresponding to the old one,
- before the last of these flushes started its column family was dropped; the last background flush call was a no-op; no one removed the old wal from alive_logs_,
- hours have passed and no flushes happened even though lots of data was written; data is written to different column families, compactions are disabled; old column families are dropped before memtable grows big enough to trigger a flush; the old wal still sits in alive_logs_ preventing max_total_wal_size limit from kicking in,
- a few more hours pass and we run out disk space because of one huge .log file.
Test Plan: `make check`; backported the new test, checked that it fails without this diff
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40893
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
Summary:
Cleaning up log files can do heavy IO, since we call ftruncate() in the destructor. We don't want to call ftruncate() in user threads.
This diff moves cleaning to background threads (flush and compaction)
Test Plan: make check, will also run valgrind
Reviewers: yhchiang, rven, MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D36177
Summary:
Add a counter for collecting the wait time on db mutex.
Also add MutexWrapper and CondVarWrapper for measuring wait time.
Test Plan:
./db_test
export ROCKSDB_TESTS=MutexWaitStats
./db_test
verify stats output using db_bench
make clean
make release
./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10
Sample output:
rocksdb.db.mutex.wait.micros COUNT : 7546866
Reviewers: MarkCallaghan, rven, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D32787
Summary:
The very last reference happens in DBImpl::GetOptions()
I built with both DBImpl::GetOptions() and ColumnFamilyData::options() commented out
Test Plan: make all check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29073
Summary: Enforce the accessier naming convention in functions in version_set.h
Test Plan: make all check
Reviewers: ljin, yhchiang, rven, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D28143
Summary:
Rename Version::Builder to VersionBuilder and expose its definition to a header.
Make VerisonBuilder not reference Version or ColumnFamilyData, only working with VersionStorageInfo.
Add version_builder_test which has a simple test.
Test Plan: make all check
Reviewers: rven, yhchiang, igor, ljin
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D27969
Summary: Decoupling code that deals with archived log files outside of DBImpl. That will make this code easier to reason about and test. It will also make the code easier to improve, because an improver doesn't have to understand DBImpl code in entirety.
Test Plan: added test
Reviewers: ljin, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D27873
Summary:
Make compaction picker easier to test.
The basic idea is to separate a minimum subcomponent of Version to VersionStorageInfo, which just responsible to LSM tree. A stub VersionStorageInfo can then be easily created and passed into compaction picker so that we can check the outputs.
It now passes most tests. Still two things need to be done:
(1) deal with the FIFO compaction's file size.
(2) write an example test to make sure the interface can do the job.
Add a compaction_picker_test to make sure compaction picker codes can be easily unit tested.
Test Plan:
Pass all unit tests and compaction_picker_test
Reviewers: yhchiang, rven, igor, ljin
Reviewed By: ljin
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D27639
Summary: This diff just moves the write thread control out of the DBImpl. I will need this as I will control column family data concurrency by only accessing some data in the write thread. That way, we won't have to lock our accesses to column family hash table (mappings from IDs to CFDs).
Test Plan: make check
Reviewers: sdong, yhchiang, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23301
Summary: If we drop column family only from (single) write thread, we can be sure that nobody will drop the column family while we're writing (and our mutex is released). This greatly simplifies my patch that's getting rid of MakeRoomForWrite().
Test Plan: make check, but also running stress test
Reviewers: ljin, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22965
Summary:
Simply code by removing code path which does not use Arena
from NewInternalIterator
Test Plan:
make all check
make valgrind_check
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22395
Summary: Add a parameter path_id to DB::CompactRange(), to indicate where the output file should be placed to.
Test Plan: add a unit test
Reviewers: yhchiang, ljin
Reviewed By: ljin
Subscribers: xjin, igor, dhruba, MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D20085
Summary:
Introducing new compaction style -- FIFO.
FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.
FIFO compaction style is suited for storing high-frequency event logs.
Test Plan: Added a unit test
Reviewers: dhruba, haobo, sdong
Reviewed By: dhruba
Subscribers: alberts, leveldb
Differential Revision: https://reviews.facebook.net/D18765
Summary:
ReadFirstRecord() reads the actual log file from disk on every call. This diff introduces a cache layer on top of ReadFirstRecord(), which should significantly speed up repeated calls to GetUpdatesSince().
I also cleaned up some stuff, but the whole TransactionLogIterator could use some refactoring, especially if we see increased usage.
Test Plan: make check
Reviewers: haobo, sdong, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18387
Summary:
also add an override option total_order_iteration if you want to use full
iterator with prefix_extractor
Test Plan: make all check
Reviewers: igor, haobo, sdong, yhchiang
Reviewed By: haobo
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D17805
Summary:
Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile.
Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping)
Test Plan: compiles :)
Reviewers: dhruba, haobo, ljin, sdong, yhchiang
Reviewed By: yhchiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17835