Summary:
This turns out to be pretty bad because if we prioritize L0->L1 then L1 can grow artificially large, which makes L0->L1 more and more expensive. For example:
256MB @ L0 + 256MB @ L1 --> 512MB @ L1
256MB @ L0 + 512MB @ L1 --> 768MB @ L1
256MB @ L0 + 768MB @ L1 --> 1GB @ L1
....
256MB @ L0 + 10GB @ L1 --> 10.2GB @ L1
At some point we need to start compacting L1->L2 to speed up L0->L1.
Test Plan:
The performance improvement is massive for heavy write workload. This is the benchmark I ran: https://phabricator.fb.com/P19842671. Before this change, the benchmark took 47 minutes to complete. After, the benchmark finished in 2minutes. You can see full results here: https://phabricator.fb.com/P19842674
Also, we ran this diff on MongoDB on RocksDB on one replicaset. Before the change, our initial sync was so slow that it couldn't keep up with primary writes. After the change, the import finished without any issues
Reviewers: dynamike, MarkCallaghan, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38637
Summary: Dump db stats in WARN level
Test Plan: run db_bench and verify the LOG
Reviewers: igor, MarkCallaghan
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38691
Summary:
See https://gist.github.com/mdcallag/89ebb2b8cbd331854865 for the IO stats.
I added "Cumulative compaction:" and "Interval compaction:" lines. The IO rates
can be confusing. Rates fro per-level stats lines, Wr(MB/s) & Rd(MB/s), are computed
using the duration of the compaction job. If the job reads 10MB, writes 9MB and the job
(IO & merging) takes 1 second then the rates are 10MB/s for read and 9MB/s for writes.
The IO rates in the Cumulative compaction line uses the total uptime. The IO rates in the
Interval compaction line uses the interval uptime. So these Cumalative & Interval
compaction IO rates cannot be compared to the per-level IO rates. But both forms of
the rates are useful for debugging perf.
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D38667
Summary: Not sure why this fails on some compilers and doesn't on others.
Test Plan: none
Reviewers: meyering, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38673
Summary:
sync_file_range is not always asyncronous and thus can block writes if we do this for WAL in the foreground thread. See more here: http://yoshinorimatsunobu.blogspot.com/2014/03/how-syncfilerange-really-works.html
Some users don't want us to call sync_file_range on WALs. Some other do.
Thus, I'm adding a separate option wal_bytes_per_sync to control calling
sync_file_range on WAL files. bytes_per_sync will apply only to table
files now.
Test Plan: no more sync_file_range for WAL as evidenced by strace
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38253
Summary: As title. I spent some time thinking about it and I don't think there should be any issue with running manual compaction and flushes in parallel
Test Plan: make check works
Reviewers: rven, yhchiang, sdong
Reviewed By: yhchiang, sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38355
Summary: DBTest.DynamicLevelMaxBytesCompactRange needs to make sure L0 is not empty to properly cover the code paths we want to cover. However, current codes have a bug that might leave the condition not held. Improve the test to ensure it.
Test Plan: Run the test in an environment that is used to fail. Also run it many times.
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D38631
Summary: CompactRange() now is much more expensive for dynamic level base size as it goes through all the levels. Skip those not used levels between level 0 an base level.
Test Plan: Run all unit tests
Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37125
Summary:
Allow GetThreadList to report Flush properties, which includes:
* job id
* number of bytes that has been written since flush started.
* total size of input mem-tables
Test Plan:
./db_bench --threads=30 --num=1000000 --benchmarks=fillrandom --thread_status_per_interval=100 --value_size=1000
Sample output from db_bench which tracks same flush job
ThreadID ThreadType cfName Operation ElapsedTime Stage State OperationProperties
140213879898240 High Pri default Flush 5789 us FlushJob::WriteLevel0Table BytesMemtables 4112835 | BytesWritten 577104 | JobID 8 |
ThreadID ThreadType cfName Operation ElapsedTime Stage State OperationProperties
140213879898240 High Pri default Flush 30.634 ms FlushJob::WriteLevel0Table BytesMemtables 4112835 | BytesWritten 1734865 | JobID 8 |
Reviewers: rven, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38505
Summary:
When trying to compact entire database with SuggestCompactRange(), we'll first try the left-most files. This is pretty bad, because:
1) the left part of LSM tree will be overly compacted, but right part will not be touched
2) First compaction will pick up the left-most file. Second compaction will try to pick up next left-most, but this will not be possible, because there's a big chance that second's file range on N+1 level is already being compacted.
I observe both of those problems when running Mongo+RocksDB and trying to compact the DB to clean up tombstones. I'm unable to clean them up :(
This diff adds a bit of randomness into choosing a file. First, it chooses a file at random and tries to compact that one. This should solve both problems specified here.
Test Plan: make check
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38379
Summary: DBTest.DynamicLevelMaxBytesBase2 has a check that is not necessary and may fail. Remove it, and add two unrelated check.
Test Plan: Run the test
Reviewers: yhchiang, rven, kradhakrishnan, anthony, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D38457
Summary: Universal compactions with multiple levels should use file preallocation size based on file size if output level is not level 0
Test Plan: Run all tests.
Reviewers: igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D38439
Summary: Add --rate_limiter_bytes_per_sec to db_bench to allow rater limit to disk
Test Plan:
Run
./db_bench --benchmarks=fillseq --num=30000000 --rate_limiter_bytes_per_sec=3000000 --num_multi_db=8 -disable_wal
And see io_stats to have the rate limited.
Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D38385
Summary:
Fixed the following compile error in db/column_family.cc
db/column_family.cc:633:33: error: ‘ASSERT_GT’ was not declared in this scope
16:14:45 ASSERT_GT(listeners.size(), 0U);
Test Plan: make db_test
Reviewers: igor, sdong, rven
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38367
Summary:
Fixed a bug in EventListener::OnCompactionCompleted() that returns
incorrect list of input / output file names.
Test Plan: Extend existing test in listener_test.cc
Reviewers: sdong, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38349
Summary: This caused a crash of our MongoDB + RocksDB instance. PickCompactionBySize() sets its own parent_index. We never reset this parent_index when picking PickFilesMarkedForCompactionExperimental(). So we might end up doing SetupOtherInputs() with parent_index that was set by PickCompactionBySize, although we're using compaction calculated using PickFilesMarkedForCompactionExperimental.
Test Plan: Added a unit test that fails with assertion on master.
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38337
Summary:
Added a couple functions to WriteBatchWithIndex to make it easier to query the value of a key including reading pending writes from a batch. (This is needed for transactions).
I created write_batch_with_index_internal.h to use to store an internal-only helper function since there wasn't a good place in the existing class hierarchy to store this function (and it didn't seem right to stick this function inside WriteBatchInternal::Rep).
Since I needed to access the WriteBatchEntryComparator, I moved some helper classes from write_batch_with_index.cc into write_batch_with_index_internal.h/.cc. WriteBatchIndexEntry, ReadableWriteBatch, and WriteBatchEntryComparator are all unchanged (just moved to a different file(s)).
Test Plan: Added new unit tests.
Reviewers: rven, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38037
Summary: In new clang we need to add override to every overriden function
Test Plan: none
Reviewers: rven
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D38259
Summary: When reporting compaction that was started because of SuggestCompactRange() we should treat it as manual compaction.
Test Plan: none
Reviewers: yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38139
Summary:
Couple changes:
1. instead of SnapshotList, just take a vector of snapshots
2. don't take a separate parameter is_snapshots_supported. If there are snapshots in the list, that means they are supported. I actually think we should get rid of this notion of snapshots not being supported.
3. don't pass in mutable_cf_options as a parameter. Lifetime of mutable_cf_options is a bit tricky to maintain, so it's better to not pass it in for the whole compaction job. We only really need it when we install the compaction results.
Test Plan: make check
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D36627
Summary:
Fixes#6840824, running "make check" on centos6 hits
a deadlock in column_family_test
Test Plan:
seq 10000 | parallel --gnu --eta 't=/dev/shm/rdb-{}; rm -rf
$t; mkdir $t && export TEST_TMPDIR=$t; ./column_family_test > $t/log-{}'
Made the test deterministic by narrrowing the window for the flush.
Reviewers: igor, meyering
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38079
Summary: Optimize GetRange Function by checking the level of the files
Test Plan: pass make all check
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37977
Summary:
[noticed a new warning when building with the very latest gcc]
* db/memtablerep_bench.cc (FLAGS_env): Remove declaration
of unused varaible, to avoid this warning/error:
db/memtablerep_bench.cc:135:22: error: ‘FLAGS_env’ defined but not\
used [-Werror=unused-variable]
static rocksdb::Env* FLAGS_env = rocksdb::Env::Default();
^
Test Plan: compile
Reviewers: ljin, rven, igor.sugak, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37983
Summary:
The default, use one iter for the whole test, isn't good. This cost me
a few hours of debugging and a few days of tessting. For readonly
that isn't realistic and for read-write that keeps a lot of old sst files around.
I remove the option because nothing uses it and not calling gettimeofday per
loop iteration adds about 3% to QPS at 20 threads.
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37965
Summary:
CPU profiling reveals GetApproximateSizes as a bottleneck for performance. The current implementation is sub-optimal, it scans every file in every level to compute the result.
We can take advantage of the fact that all levels above 0 are sorted in the increasing order of key ranges and use binary search to locate the starting index. This can reduce the number of comparisons required to compute the result.
Test Plan: We have good test coverage. Run the tests.
Reviewers: sdong, igor, rven, dynamike
Subscribers: dynamike, maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37755
Summary: Before the fix we also marked the bottommost level for compaction. This is wrong because then RocksDB has N+1 levels instead of N as before the compaction.
Test Plan: SuggestCompactRangeTest in db_test
Reviewers: yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37869
Summary: Remove duplicate code. If this diff looks good, I will cleanup other call sites as well.
Test Plan: unit tests
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37761
Summary:
Added these events:
* Recovery start, finish and also when recovery creates a file
* Trivial move
* Compaction start, finish and when compaction creates a file
* Flush start, finish
Also includes small fix to EventLogger
Also added option ROCKSDB_PRINT_EVENTS_TO_STDOUT which is useful when we debug things. I've spent far too much time chasing LOG files.
Still didn't get sst table properties in JSON. They are written very deeply into the stack. I'll address in separate diff.
TODO:
* Write specification. Let's first use this for a while and figure out what's good data to put here, too. After that we'll write spec
* Write tools that parse and analyze LOGs. This can be in python or go. Good intern task.
Test Plan: Ran db_bench with ROCKSDB_PRINT_EVENTS_TO_STDOUT. Here's the output: https://phabricator.fb.com/P19811976
Reviewers: sdong, yhchiang, rven, MarkCallaghan, kradhakrishnan, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37521
Summary:
Make it build for CYGWIN.
Need to define "-std=gnu++11" instead of "-std=c++11" and use some replacement functions.
Test Plan: Build it and run some unit tests in CYGWIN
Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37605
Summary:
CompactRange for universal compaction with num_levels > 1 seems to have a bug. The unit test also has a bug so it doesn't capture the problem.
Fix it. Revert the compact range to the logic equivalent to num_levels=1. Always compact all files together.
It should also fix DBTest.IncreaseUniversalCompactionNumLevels. The issue was that options.write_buffer_size = 100 << 10 and options.write_buffer_size = 100 << 10 are not used in later test scenarios. So write_buffer_size of 4MB was used. The compaction trigger condition is not anymore obvious as expected.
Test Plan: Run the new test and all test suites
Reviewers: yhchiang, rven, kradhakrishnan, anthony, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37551
Summary:
Based on feedback from D37083.
Are all of these correct? In some spaces it seems like we're doing SetMaxPossibleForUserKey() although we want the smallest possible internal key for user key.
Test Plan: make check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37341
Summary: Reading CompactionPicker I noticed this dangerous substraction of two unsigned integers. We should assert to mark this as safe.
Test Plan: make check
Reviewers: anthony, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37041
Summary:
This diff implements a new `DB` method `PromoteL0` which moves all files in L0
to a given level skipping compaction, provided that the files have disjoint
ranges and all levels up to the target level are empty.
This method provides finer-grain control for trivial compactions, and it is
useful for bulk-loading pre-sorted keys. Compared to D34797, it does not change
the semantics of an existing operation, which can impact existing code.
PromoteL0 is designed to work well in combination with the proposed
`GetSstFileWriter`/`AddFile` interface, enabling to "design" the level structure
by populating one level at a time. Such fine-grained control can be very useful
for static or mostly-static databases.
Test Plan: `make check`
Reviewers: IslamAbdelRahman, philipp, MarkCallaghan, yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37107
Summary: Add more logging to help debugging issues.
Test Plan: Run test suites
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37401
Summary: To further distinguish the corruption cases were caused by storage media or in memory states when writing it, add a paranoid check after writing the file to iterate all the rows.
Test Plan: Add a new unit test for it
Reviewers: rven, igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37335
Summary:
A couple of times on Travis, we have had the thread status say that there were no compactions done and since we assert for it, the test failed.
We now fix this by waiting till compaction started.
Test Plan:
run DBTEST::*PreShutdown*
d=/tmp/j; rm -rf $d; seq 200 | parallel --gnu --eta 'd=/tmp/j/d-{}; mkdir -p $d; TEST_TMPDIR=$d ./db_test --gtest_filter=DBTest.PreShutdown* >& '$d'/log-{}'
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37545
Summary:
Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys).
All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart.
MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction.
Test Plan: added a new unit test
Reviewers: yhchiang, rven, MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37083
Summary:
The usage I'm fixing here caused trouble on Fedora 21 when
compiling with the current gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC):
db/write_controller_test.cc: In member function ‘virtual void rocksdb::WriteControllerTest_SanityTest_Test::TestBody()’:
db/write_controller_test.cc:23:165: error: converting ‘false’ to pointer type for argument 1 of ‘char testing::internal::IsNullLiteralHelper(testing::internal::Secret*)’ [-Werror=conversion-null]
ASSERT_EQ(false, controller.IsStopped());
^
This change was induced mechanically via:
git grep -l -E 'ASSERT_EQ\(false'|xargs perl -pi -e 's/ASSERT_EQ\(false, /ASSERT_FALSE(/'
git grep -l -E 'ASSERT_EQ\(true'|xargs perl -pi -e 's/ASSERT_EQ\(true, /ASSERT_TRUE(/'
Except for the three in utilities/backupable/backupable_db_test.cc for which
I ended up reformatting (joining lines) in the result.
As for why this problem is exhibited with that version of gcc, and none
of the others I've used (from 4.8.1 through gcc-5.0.0 and newer), I suspect
it's a bug in F21's gcc that has been fixed in gcc-5.0.0.
Test Plan:
"make" now succeed on Fedora 21
Reviewers: ljin, rven, igor.sugak, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37329
Summary: this is not used anywhere
Test Plan: compiles
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37053
Summary: If ExpandWhileOverlapping() we don't clear inputs. That's a bug introduced by my recent patch https://reviews.facebook.net/D36687. However, we have no tests covering ExpandWhileOverlapping(). I created a task t6771252 to add ExpandWhileOverlapping() tests.
Test Plan: make check
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37077
Summary: D36669 introduces a bug that trivial moved data is not going to specific level but the next level, which will incorrectly be level 1 for level 0 compaciton if base level is not level 1. Fixing it by appreciating the output level
Test Plan: Run all tests
Reviewers: MarkCallaghan, rven, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37119