Summary:
The new flag --cache_index_and_filter_blocks sets
BlockBasedTableOptions.cache_index_and_filter_blocks
Test Plan: make db_bench. Working on benchmarks with the new flag.
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D41481
1) Crash in env_win.cc that prevented db_test run to completion and some new tests
2) Fix new corruption tests in DBTest by allowing a shared trunction of files. Note that this is generally needed ONLY for tests.
3) Close database so WAL is closed prior to inducing corruption similar to what we did within Corruption tests.
Summary: Change the naming style of getter and setters according to Google C++ style in compaction.h file
Test Plan: Compilation success
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41265
Summary: Currently there is no test in the suite to test the case where
there are multiple WAL files and there is a corruption in one of them. We have
tests for single WAL file corruption scenarios. Added tests to mock
the scenarios for all combinations of recovery modes and corruption in
specified file locations.
Test Plan: Run make check
Reviewers: sdong igor
CC: leveldb@
Task ID: #7501229
Blame Rev:
Summary: Coverage test has been occasionally failing due to this timing check.
Test Plan: run test
Reviewers: yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41367
Summary: This change enables trivial move if all the input files are non onverlapping while doing Universal Compaction.
Test Plan: ./compaction_picker_test and db_test ran successfully with the new testcases.
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40875
Summary:
Fixed a bug in test ThreadStatusSingleCompaction where
SyncPoint traces are not cleared before the test begins
its second iteration.
Test Plan: db_test
Reviewers: sdong, anthony, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41337
Summary:
This diff reverts the following two previous diffs related to
DBIter::FindPrevUserKey(), which makes db_stress unstable.
We should bake a better fix for this.
* "Fix a comparison in DBIter::FindPrevUserKey()"
ec70fea4c4.
* "Fixed endless loop in DBIter::FindPrevUserKey()"
acee2b08a2.
Test Plan: db_stress
Reviewers: anthony, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41301
Summary:
This patch adds three test cases for ExpandWhileOverlapping()
to the compaction_picker_test test suite.
ExpandWhileOverlapping() only has an effect if the comparison
function for the internal keys allows for overlapping user
keys in different SST files on the same level. Thus, this
patch adds a comparator based on sequence numbers to
compaction_picker_test for the new test cases.
Test Plan:
- make compaction_picker_test && ./compaction_picker_test
-> All tests pass
- Replace body of ExpandWhileOverlapping() with `return true`
-> Compile and run ./compaction_picker_test as before
-> New tests fail
Reviewers: sdong, yhchiang, rven, anthony, IslamAbdelRahman, kradhakrishnan, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41277
Summary:
Two issues:
* the input keys to the compaction don't include sequence number.
* sequence number is set to max(seq_num), but it should be set to max(seq_num)+1, because the condition here is strictly-larger (i.e. we will only zero-out sequence number if the DB's sequence number is strictly greater than the key's sequence number): https://github.com/facebook/rocksdb/blob/master/db/compaction_job.cc#L830
Test Plan: make compaction_job_test && ./compaction_job_test
Reviewers: sdong, lovro
Reviewed By: lovro
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41247
Summary:
This fixes the following scenario we've hit:
- we reached max_total_wal_size, created a new wal and scheduled flushing all memtables corresponding to the old one,
- before the last of these flushes started its column family was dropped; the last background flush call was a no-op; no one removed the old wal from alive_logs_,
- hours have passed and no flushes happened even though lots of data was written; data is written to different column families, compactions are disabled; old column families are dropped before memtable grows big enough to trigger a flush; the old wal still sits in alive_logs_ preventing max_total_wal_size limit from kicking in,
- a few more hours pass and we run out disk space because of one huge .log file.
Test Plan: `make check`; backported the new test, checked that it fails without this diff
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40893
Summary: Make RocksDb build and run on Windows to be functionally
complete and performant. All existing test cases run with no
regressions. Performance numbers are in the pull-request.
Test plan: make all of the existing unit tests pass, obtain perf numbers.
Co-authored-by: Praveen Rao praveensinghrao@outlook.com
Co-authored-by: Sherlock Huang baihan.huang@gmail.com
Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com
Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com
Summary: Try to allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena, instead of calling malloc and free.
Test Plan: valgrind check
Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40929
Summary: We have a race in the way test works. We avoided the race by adding the
wait to the counter. I thought 1s was eternity, but that is not true in some
scenarios. Increasing the timeout to 10s and adding warnings.
Also, adding nosleep to avoid the case where the wakeup thread is waiting behind
the sleeping thread for scheduling.
Test Plan: Run make check
Reviewers: siying igorcanadi
CC: leveldb@
Task ID: #7312624
Blame Rev:
Summary:
When seek target is a merge key (`kTypeMerge`), `DBIter::FindNextUserEntry()`
advances the underlying iterator _past_ the current key (`saved_key_`); see
`MergeValuesNewToOld()`. However, `FindPrevUserKey()` assumes that `iter_`
points to an entry with the same user key as `saved_key_`. As a result,
`it->Seek(key) && it->Prev()` can cause the iterator to be positioned at the
_next_, instead of the previous, entry (new test, written by @lovro, reproduces
the bug).
This diff changes `FindPrevUserKey()` to also skip keys that are _greater_ than
`saved_key_`.
Test Plan: db_test
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba, lovro
Differential Revision: https://reviews.facebook.net/D40791
Summary: Make column_family_test runnable in ROCKSDB_LITE.
Test Plan: column_family_test
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40251
Summary: It's not really nice to call user's API with garbage data in new_value. This diff makes sure that new_value is empty before calling the merge operator.
Test Plan: Added assert to Merge operator in merge_test
Reviewers: sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40773
Summary:
Fixes task 7156865 where a compaction causes a hang in flush
memtable if CancelAllBackgroundWork was called prior to it.
Stack trace is in : https://phabricator.fb.com/P19848829
We end up waiting for a flush which will never happen because there are no background threads.
Test Plan: PreShutdownFlush
Reviewers: sdong, igor
Reviewed By: sdong, igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40617
Summary: #7124486: RocksDB's Iterator.SeekToLast should seek to the last key before iterate_upper_bound if presents
Test Plan: ./db_iter_test run successfully with the new testcase
Reviewers: rven, yhchiang, igor, anthony, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40425
Summary: Make table_properties_collector_test runnable in ROCKSDB_LITE
Test Plan: table_properties_collector_test
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40581
Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction.
Test Plan: make check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40527
Summary:
Implementation of a table-level row cache.
It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache.
Supports snapshots and merge operations.
Test Plan: Ran `make valgrind_check commit-prereq`
Reviewers: igor, philipp, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39849
Summary:
The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL.
1. RecoverAfterRestart (kTolerateCorruptedTailRecords)
This mocks the current recovery mode.
2. RecoverAfterCleanShutdown (kAbsoluteConsistency)
This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes.
3. RecoverPointInTime (kPointInTimeRecovery)
This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write.
4. RecoverAfterDisaster (kSkipAnyCorruptRecord)
This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible.
Test Plan:
(1) Run added unit test to cover all levels.
(2) Run make check.
Reviewers: leveldb, sdong, igor
Subscribers: yoshinorim, dhruba
Differential Revision: https://reviews.facebook.net/D38487
Summary: Fixing bad merge
Test Plan: make -j64 check (this is not enough to verify the fix)
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40521
Summary:
When there are multiple column families, the flush in
GetLiveFiles is not atomic, so that there are entries in the wal files
which are needed to get a consisten RocksDB. We now add the log files to
the checkpoint.
Test Plan:
CheckpointCF - This test forces more data to be written to
the other column families after the flush of the first column family but
before the second.
Reviewers: igor, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40323
Summary: CompressLevelCompaction() depends on Zlib. We should skip it when zlib is not present.
Test Plan: `make check` without zlib
Reviewers: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40401
Summary:
Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users.
This patch fails DB::Open() if we don't support the compression that is specified in the options.
Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails.
Reviewers: sdong, MarkCallaghan, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39687
Summary:
This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level
Changes in this patch
- Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction
- Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set
Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that should use force_bottommost_level_compaction to pass in a deterministic way
Test Plan:
make check
adding new tests
Reviewers: igor, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40059
Summary: Currently we dump DBOptions for each column family options we dump. This leads to duplicate lines in our LOG file. This diff fixes that.
Test Plan: Check out the LOG
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: IslamAbdelRahman, yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39729
Summary:
Universal compaction can involves in multiple levels. However,
the current implementation of bytes_readn and bytes_readnp1
(and some other stats with postfix `n` and `np1`) assumes compaction
can only have two levels.
This patch fixes this bug and redefines bytes_readn and bytes_readnp1:
* bytes_readnp1: the number of bytes read in the compaction output level.
* bytes_readn: the total number of bytes read minus bytes_readnp1
Test Plan: Add a test in compaction_job_stats_test
Reviewers: igor, sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40239
Summary:
So far, we benchmarked RocksDB by writing as fast as possible. With this change, we're able to limit our write throughput, which should help us better understand how RocksDB performes under varying write workloads.
Specifically, I'm currently interested in the shape of the graph that has write throughput on one axis and write rate on another. This should help us with designing our stall system, as we have started to do with D36351.
Test Plan:
$ ./db_bench --benchmarks=fillrandom --benchmark_write_rate_limit=1000000
fillrandom : 118.523 micros/op 8437 ops/sec; 0.9 MB/s
$ ./db_bench --benchmarks=fillrandom --benchmark_write_rate_limit=2000000
fillrandom : 59.136 micros/op 16910 ops/sec; 1.9 MB/s
Reviewers: MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39759
Summary:
This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters
Old CompactRange is still available but deprecated
Test Plan:
make all check
make rocksdbjava
USE_CLANG=1 make all
OPT=-DROCKSDB_LITE make release
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40209
Summary:
We go to great lengths to make sure MaybeScheduleFlushOrCompaction() is called outside of write thread. But anyway, it's still called in the mutex, so it's not that much cheaper.
This diff removes the "optimization" and cleans up the code a bit.
Test Plan: make check
Reviewers: rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40113
Summary: Block c_test in ROCKSDB_LITE as it's not supported in ROCKSDB_LITE.
Test Plan: c_test
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40257
Summary:
Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables.
To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>.
Test Plan: Add a test case
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40119
Summary:
This is part of an effort to better understand and optimize RocksDB stalls under high load. I added a feature to db_bench to periodically write QPS to CSV files. That way we can nicely see how our QPS changes in time (especially when DB is stalled) and can do a better job of evaluating our stall system (i.e. we want the QPS to be as constant as possible, as opposed to having bunch of stalls)
Cool part of CSV files is that we can easily graph them -- there are a bunch of tools available.
Test Plan:
Ran ./db_bench --report_interval_seconds=10 --benchmarks=fillrandom --num=10000000
and observed this in report.csv:
secs_elapsed,interval_qps
10,2725860
20,1980480
30,1863456
40,1454359
50,1460389
Reviewers: sdong, MarkCallaghan, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40047
Summary: Fixed false alarm of size comparison in compaction_job_stats_test
Test Plan: compaction_job_stats_test
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39921
Summary:
Reverting this diff https://reviews.facebook.net/D39999
Will add an option to force bottom most level compaction and then re submit it
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40041
Summary: If we don't have a compaction filter then we can skip compacting the bottom most level
Test Plan:
make check
added unit tests
Reviewers: yhchiang, sdong, igor
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39999
Summary:
We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch.
The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work
hard_rate_limit is deprecated.
options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up.
Test Plan: Add new unit tests in db_test
Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor
Reviewed By: igor
Subscribers: ikabiljo, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D36351
Summary: With experimental feature SuggestCompactRange() we don't restrict running two L0->L1 compactions in parallel. This diff fixes this.
Test Plan: added a unit test to reproduce the failure. fixed the unit test
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39981
Summary:
Adding largest sequence number to FlushJobInfo
and passing flushed file metadata to NotifyOnFlushCompleted which include alot of other values that we may want to expose in FlushJobInfo
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39927
Summary:
Add Env::GetThreadID(), which returns the ID of the current thread.
In addition, make GetThreadList() and InfoLog use same unique ID for the same thread.
Test Plan:
db_test
listener_test
Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39735
Summary:
Right now the level we pass to ReFitLevel is the maximum level with files (before compaction), there are multiple cases where this maximum level have changed after compaction
- all files where in L0 (now maximum level is L1)
- using kCompactionStyleUniversal (now maximum level in the last level)
- level_compaction_dynamic_level_bytes ??
We can handle each of these cases individually, but I felt it's safer to calculate max_level_with_files again if we want to do a ReFitLevel
Test Plan:
adding some tests
make -j64 check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: ott, dhruba
Differential Revision: https://reviews.facebook.net/D39663
Summary: Some test and benchmark codes don't build for CYGWIN. Fix it.
Test Plan: Build "make all" with TARGET_OS=Cygwin on cygwin and make sure it passes.
Reviewers: rven, yhchiang, anthony, igor, kradhakrishnan
Reviewed By: igor, kradhakrishnan
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39711
Summary:
When there are files marked for compaction after compactions, print extra messages to help debugging. Example:
2015/06/08-23:12:55.212855 7ff5013ff700 [default] [JOB 121] Generated table #75: 54 keys, 4807 bytes (need compaction)
2015/06/08-23:12:55.556194 7ff5013ff700 (Original Log Time 2015/06/08-23:12:55.556160) [default] compacted to: base level 1 max bytes base
10240 files[0 1 9 32 12 0 0 0] max score 0.96 (2 files need compaction), MB/sec: 0.0 rd, 0.1 wr, level 2, files in(1, 3) out(5) MB in(0.0,
0.0) out(0.0), read-write-amplify(11.3) write-amplify(5.7) OK, records in: 40, records dropped: 0
Test Plan:
Run test and see LOG files.
valgrind test DBTest.TablePropertiesNeedCompactTest
Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39771
Summary:
There is a hang during DB close in the following scenario:
a) a load with WAL disabled was done,
b) CancelAllBackgroundWork was called,
c) DB Close was called
This was because in that we will wait for a flush but we cannot do a
background flush because we have called CancelAllBackgroundWork which
marks the DB as shutting downn.
Test Plan: Added DBTest FlushOnDestroy
Reviewers: sdong
Reviewed By: sdong
Subscribers: yoshinorim, hermanlee4, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39747
Summary: We currently issue malloc and free inside DB mutex in GetSnapshot() and ReleaseSnapshot(). Move them out.
Test Plan:
Go through all tests
make valgrind_check
Reviewers: yhchiang, rven, IslamAbdelRahman, anthony, igor
Reviewed By: igor
Subscribers: maykov, hermanlee4, MarkCallaghan, yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39753
Summary:
Replacing the default value for compaction_filter_factory and compaction_filter_factory_v2 to be nullptr instead of DefaultCompactionFilterFactory / DefaultCompactionFilterFactoryV2
The reason for this is to be able to determine easily if we have compaction filter factory or not without depending on RTTI
Test Plan: make check
Reviewers: yoshinorim, ott, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39693
Summary: key_sizes claims that 3rd key is of length 8, but it's really only 3. This diff makes it length 8.
Test Plan: asan c_test works again.
Reviewers: sdong, yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39699
Summary:
It is experimental. Allow users to return from a call back function TablePropertiesCollector::NeedCompact(), based on the data in the file.
It can be used to allow users to suggest DB to clear up delete tombstones faster.
Test Plan: Add a unit test.
Reviewers: igor, yhchiang, kradhakrishnan, rven
Reviewed By: rven
Subscribers: yoshinorim, MarkCallaghan, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39585
Summary:
EventListener::OnFlushCompleted() now passes a structure instead
of a list of parameters. This minimizes the API change in the
future.
Test Plan:
listener_test
compact_files_test
example/compact_files_example
Reviewers: kradhakrishnan, sdong, IslamAbdelRahman, rven, igor
Reviewed By: rven, igor
Subscribers: IslamAbdelRahman, rven, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39543
Summary: I encountered an issue where the database hang, it looks like the mutex is not unlocked on return in ReFitLevel function
Test Plan: make -j64 check
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39609
Summary:
The type of smallest_output_key_prefix and largest_output_key_prefix
have been changed to std::string in https://reviews.facebook.net/D39537.
As a result, we shouldn't do smallest_output_key_prefix[0] = 0 in the
initialization.
Test Plan: compile db_test with tsan enabled and repeat DBTest.CompactionDeletionTrigger test to verify the tsan issue has been gone.
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39645
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
Summary:
Keys in RocksDB can be arbitrary byte strings. However, in the current
CompactionJobStats, smallest_output_key_prefix and largest_output_key_prefix
are of type char[] without having a length, which is insufficient to handle
non-null terminated strings.
This patch change their type to std::string.
Test Plan: compaction_job_stats_test
Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39537
Summary:
Add EventListener::OnTableFileDeletion(), which will be
called when a table file is deleted.
Test Plan: Extend three existing tests in db_test to verify the deleted files.
Reviewers: rven, anthony, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38931
Summary: We need to start doing some CI on Macs.
Test Plan: works now
Reviewers: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39489
Summary:
DBTest.MigrateToDynamicLevelMaxBytesBase with valgrind test is
extremely slow. Work it around by not having both threads running
everything non-stop.
Test Plan: Run the test with valgrind which used to take too long to finish and see it finish in reasonable time.
Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39477
Summary: Remove a TODO that has been done
Test Plan: make
Reviewers: sdong, igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39429
Summary:
Allow EventListener::OnCompactionCompleted to return CompactionJobStats,
which contains useful information about a compaction.
Example CompactionJobStats returned by OnCompactionCompleted():
smallest_output_key_prefix 05000000
largest_output_key_prefix 06990000
elapsed_time 42419
num_input_records 300
num_input_files 3
num_input_files_at_output_level 2
num_output_records 200
num_output_files 1
actual_bytes_input 167200
actual_bytes_output 110688
total_input_raw_key_bytes 5400
total_input_raw_value_bytes 300000
num_records_replaced 100
is_manual_compaction 1
Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats.
Reviewers: rven, igor, anthony, sdong
Reviewed By: sdong
Subscribers: tnovak, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38463
Summary:
Fixed the following compile warning in listener_test.cc:
db/listener_test.cc:214:8: error: 'OnTableFileCreated' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
14:16:46 void OnTableFileCreated(
Test Plan:
make listener_test
Reviewers: sdong, igor
Subscribers: leveldb
Summary:
Add EventListener::OnTableFileCreated(), which will be called
when a table file is created. This patch is part of the
EventLogger and EventListener integration.
Test Plan: Augment existing test in db/listener_test.cc
Reviewers: anthony, kradhakrishnan, rven, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38865
Summary: Add a stats counter for DB_WRITE back which was mistakenly removed.
Test Plan: augment GroupCommitTest
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39399
Summary:
We occasionally get write stalls (>1s Write() calls) on HDD under read load. The following timers explain almost all of the stalls:
- perf_context.db_mutex_lock_nanos
- perf_context.db_condition_wait_nanos
- iostats_context.open_time
- iostats_context.allocate_time
- iostats_context.write_time
- iostats_context.range_sync_time
- iostats_context.logger_time
In my experiments each of these occasionally takes >1s on write path under some workload. There are rare cases when Write() takes long but none of these takes long.
Test Plan: Added code to our application to write the listed timings to log for slow writes. They usually add up to almost exactly the time Write() call took.
Reviewers: rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: march, dhruba, tnovak
Differential Revision: https://reviews.facebook.net/D39177
Summary: In DB::CompactRange(), change parameter "reduce_level" to "change_level". Users can compact all data to the last level if needed. By doing it, users can migrate the DB to options.level_compaction_dynamic_level_bytes=true.
Test Plan: Add a unit test for it.
Reviewers: yhchiang, anthony, kradhakrishnan, igor, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39099
Summary:
DBImpl::notifying_events_ is a internal counter in DBImpl which is
used to prevent DB close when DB is notifying events. However, as
the current events all rely on either compaction or flush which
already have similar counters to prevent DB close, it is safe to
remove notifying_events_.
Test Plan:
listener_test
examples/compact_files_example
Reviewers: igor, anthony, kradhakrishnan, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39315
Summary: Broken by optimistic transaction diff. (I only built 'release' not 'static_lib' when testing).
Test Plan: build
Reviewers: yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39219
Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported.
Test Plan: Added a new test, but still need to look into stress testing.
Reviewers: yhchiang, igor, rven, sdong
Reviewed By: sdong
Subscribers: adamretter, MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D33435
Summary:
For transactions, we are using the memtables to validate that there are no write conflicts. But after flushing, we don't have any memtables, and transactions could fail to commit. So we want to someone keep around some extra history to use for conflict checking. In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit.
After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure). It seems like the best place for this is abstracted inside the memtable_list. I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much.
This diff adds a new parameter to control how much memtable history to keep around after flushing. However, it sounds like people aren't too fond of adding new parameters. So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers. This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit. (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached). So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit).
However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions.
Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests. Added testing in memtablelist_test and planning on adding more testing here.
Reviewers: sdong, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37443
Summary:
Rename EventLoggerHelpers EventHelpers, as it's going to include
all event-related helper functions instead of EventLogger only stuffs.
Test Plan: make
Reviewers: sdong, rven, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39093
Summary:
Compaction now boosts the size of deletion entries of a file only when
the number of deletion entries is greater than the number of non-deletion
entries in the file. The motivation here is that in a stable workload,
the number of deletion entries should be roughly equal to the number of
non-deletion entries. If we compensate the size of deletion entries in a
stable workload, the deletion compensation logic might introduce unwanted
effet which changes the shape of LSM tree.
Test Plan: db_test --gtest_filter="*Deletion*"
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38703
Summary:
Fixed a missing "}" at the end of the generated JSON Log
in EventLoggerHelpers::LogTableFileCreation.
Test Plan: db_bench
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38919
Summary: Removed an unused private variable in db_impl.h
Test Plan: make db_test
Reviewers: sdong, anthony, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38925
Summary: We have a bug where we don't report the last level's files as being compacted. This fixes it.
Test Plan: See the fix in action here: https://phabricator.fb.com/P19845738
Reviewers: MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38727
Summary:
This patch fixes the following two bugs on logging file deletion.
1. Previously, file deletion failure was only logged in INFO_LEVEL.
This patch changes it to ERROR_LEVEL and does some code clean.
2. EventLogger previously will always generate the same log on
table file deletion even when file deletion is not successful.
Now the resulting status of file deletion will also be logged.
Test Plan: make all check
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38817
Summary: Ensure ColumnFamilyOptions.num_levels >= 2 when level compaction is used.
Test Plan: Extend SanitizeOptions test in column_family_test
Reviewers: sdong, rven, anthony, krishnanm86, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38829
Summary: Avoid logging under mutex in DBImpl::WriteLevel0TableForRecovery().
Test Plan: make all check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38823
Summary:
Allow EventLogger to directly log from a JSONWriter. This allows
the JSONWriter to be shared by EventLogger and potentially EventListener,
which is an important step to integrate EventLogger and EventListener.
This patch also rewrites EventLoggerHelpers::LogTableFileCreation(),
which uses the new API to generate identical log.
Test Plan:
Run db_bench in debug mode and make sure the log is correct and no
assertions fail.
Reviewers: sdong, anthony, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38709
Summary:
This turns out to be pretty bad because if we prioritize L0->L1 then L1 can grow artificially large, which makes L0->L1 more and more expensive. For example:
256MB @ L0 + 256MB @ L1 --> 512MB @ L1
256MB @ L0 + 512MB @ L1 --> 768MB @ L1
256MB @ L0 + 768MB @ L1 --> 1GB @ L1
....
256MB @ L0 + 10GB @ L1 --> 10.2GB @ L1
At some point we need to start compacting L1->L2 to speed up L0->L1.
Test Plan:
The performance improvement is massive for heavy write workload. This is the benchmark I ran: https://phabricator.fb.com/P19842671. Before this change, the benchmark took 47 minutes to complete. After, the benchmark finished in 2minutes. You can see full results here: https://phabricator.fb.com/P19842674
Also, we ran this diff on MongoDB on RocksDB on one replicaset. Before the change, our initial sync was so slow that it couldn't keep up with primary writes. After the change, the import finished without any issues
Reviewers: dynamike, MarkCallaghan, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38637
Summary: Dump db stats in WARN level
Test Plan: run db_bench and verify the LOG
Reviewers: igor, MarkCallaghan
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38691
Summary:
See https://gist.github.com/mdcallag/89ebb2b8cbd331854865 for the IO stats.
I added "Cumulative compaction:" and "Interval compaction:" lines. The IO rates
can be confusing. Rates fro per-level stats lines, Wr(MB/s) & Rd(MB/s), are computed
using the duration of the compaction job. If the job reads 10MB, writes 9MB and the job
(IO & merging) takes 1 second then the rates are 10MB/s for read and 9MB/s for writes.
The IO rates in the Cumulative compaction line uses the total uptime. The IO rates in the
Interval compaction line uses the interval uptime. So these Cumalative & Interval
compaction IO rates cannot be compared to the per-level IO rates. But both forms of
the rates are useful for debugging perf.
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D38667
Summary: Not sure why this fails on some compilers and doesn't on others.
Test Plan: none
Reviewers: meyering, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38673
Summary:
sync_file_range is not always asyncronous and thus can block writes if we do this for WAL in the foreground thread. See more here: http://yoshinorimatsunobu.blogspot.com/2014/03/how-syncfilerange-really-works.html
Some users don't want us to call sync_file_range on WALs. Some other do.
Thus, I'm adding a separate option wal_bytes_per_sync to control calling
sync_file_range on WAL files. bytes_per_sync will apply only to table
files now.
Test Plan: no more sync_file_range for WAL as evidenced by strace
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38253
Summary: As title. I spent some time thinking about it and I don't think there should be any issue with running manual compaction and flushes in parallel
Test Plan: make check works
Reviewers: rven, yhchiang, sdong
Reviewed By: yhchiang, sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38355
Summary: DBTest.DynamicLevelMaxBytesCompactRange needs to make sure L0 is not empty to properly cover the code paths we want to cover. However, current codes have a bug that might leave the condition not held. Improve the test to ensure it.
Test Plan: Run the test in an environment that is used to fail. Also run it many times.
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D38631
Summary: CompactRange() now is much more expensive for dynamic level base size as it goes through all the levels. Skip those not used levels between level 0 an base level.
Test Plan: Run all unit tests
Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37125
Summary:
Allow GetThreadList to report Flush properties, which includes:
* job id
* number of bytes that has been written since flush started.
* total size of input mem-tables
Test Plan:
./db_bench --threads=30 --num=1000000 --benchmarks=fillrandom --thread_status_per_interval=100 --value_size=1000
Sample output from db_bench which tracks same flush job
ThreadID ThreadType cfName Operation ElapsedTime Stage State OperationProperties
140213879898240 High Pri default Flush 5789 us FlushJob::WriteLevel0Table BytesMemtables 4112835 | BytesWritten 577104 | JobID 8 |
ThreadID ThreadType cfName Operation ElapsedTime Stage State OperationProperties
140213879898240 High Pri default Flush 30.634 ms FlushJob::WriteLevel0Table BytesMemtables 4112835 | BytesWritten 1734865 | JobID 8 |
Reviewers: rven, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38505
Summary:
When trying to compact entire database with SuggestCompactRange(), we'll first try the left-most files. This is pretty bad, because:
1) the left part of LSM tree will be overly compacted, but right part will not be touched
2) First compaction will pick up the left-most file. Second compaction will try to pick up next left-most, but this will not be possible, because there's a big chance that second's file range on N+1 level is already being compacted.
I observe both of those problems when running Mongo+RocksDB and trying to compact the DB to clean up tombstones. I'm unable to clean them up :(
This diff adds a bit of randomness into choosing a file. First, it chooses a file at random and tries to compact that one. This should solve both problems specified here.
Test Plan: make check
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38379
Summary: DBTest.DynamicLevelMaxBytesBase2 has a check that is not necessary and may fail. Remove it, and add two unrelated check.
Test Plan: Run the test
Reviewers: yhchiang, rven, kradhakrishnan, anthony, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D38457
Summary: Universal compactions with multiple levels should use file preallocation size based on file size if output level is not level 0
Test Plan: Run all tests.
Reviewers: igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D38439
Summary: Add --rate_limiter_bytes_per_sec to db_bench to allow rater limit to disk
Test Plan:
Run
./db_bench --benchmarks=fillseq --num=30000000 --rate_limiter_bytes_per_sec=3000000 --num_multi_db=8 -disable_wal
And see io_stats to have the rate limited.
Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D38385
Summary:
Fixed the following compile error in db/column_family.cc
db/column_family.cc:633:33: error: ‘ASSERT_GT’ was not declared in this scope
16:14:45 ASSERT_GT(listeners.size(), 0U);
Test Plan: make db_test
Reviewers: igor, sdong, rven
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38367
Summary:
Fixed a bug in EventListener::OnCompactionCompleted() that returns
incorrect list of input / output file names.
Test Plan: Extend existing test in listener_test.cc
Reviewers: sdong, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38349
Summary: This caused a crash of our MongoDB + RocksDB instance. PickCompactionBySize() sets its own parent_index. We never reset this parent_index when picking PickFilesMarkedForCompactionExperimental(). So we might end up doing SetupOtherInputs() with parent_index that was set by PickCompactionBySize, although we're using compaction calculated using PickFilesMarkedForCompactionExperimental.
Test Plan: Added a unit test that fails with assertion on master.
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38337
Summary:
Added a couple functions to WriteBatchWithIndex to make it easier to query the value of a key including reading pending writes from a batch. (This is needed for transactions).
I created write_batch_with_index_internal.h to use to store an internal-only helper function since there wasn't a good place in the existing class hierarchy to store this function (and it didn't seem right to stick this function inside WriteBatchInternal::Rep).
Since I needed to access the WriteBatchEntryComparator, I moved some helper classes from write_batch_with_index.cc into write_batch_with_index_internal.h/.cc. WriteBatchIndexEntry, ReadableWriteBatch, and WriteBatchEntryComparator are all unchanged (just moved to a different file(s)).
Test Plan: Added new unit tests.
Reviewers: rven, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38037
Summary: In new clang we need to add override to every overriden function
Test Plan: none
Reviewers: rven
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D38259
Summary: When reporting compaction that was started because of SuggestCompactRange() we should treat it as manual compaction.
Test Plan: none
Reviewers: yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38139
Summary:
Couple changes:
1. instead of SnapshotList, just take a vector of snapshots
2. don't take a separate parameter is_snapshots_supported. If there are snapshots in the list, that means they are supported. I actually think we should get rid of this notion of snapshots not being supported.
3. don't pass in mutable_cf_options as a parameter. Lifetime of mutable_cf_options is a bit tricky to maintain, so it's better to not pass it in for the whole compaction job. We only really need it when we install the compaction results.
Test Plan: make check
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D36627
Summary:
Fixes#6840824, running "make check" on centos6 hits
a deadlock in column_family_test
Test Plan:
seq 10000 | parallel --gnu --eta 't=/dev/shm/rdb-{}; rm -rf
$t; mkdir $t && export TEST_TMPDIR=$t; ./column_family_test > $t/log-{}'
Made the test deterministic by narrrowing the window for the flush.
Reviewers: igor, meyering
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38079
Summary: Optimize GetRange Function by checking the level of the files
Test Plan: pass make all check
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37977
Summary:
[noticed a new warning when building with the very latest gcc]
* db/memtablerep_bench.cc (FLAGS_env): Remove declaration
of unused varaible, to avoid this warning/error:
db/memtablerep_bench.cc:135:22: error: ‘FLAGS_env’ defined but not\
used [-Werror=unused-variable]
static rocksdb::Env* FLAGS_env = rocksdb::Env::Default();
^
Test Plan: compile
Reviewers: ljin, rven, igor.sugak, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37983
Summary:
The default, use one iter for the whole test, isn't good. This cost me
a few hours of debugging and a few days of tessting. For readonly
that isn't realistic and for read-write that keeps a lot of old sst files around.
I remove the option because nothing uses it and not calling gettimeofday per
loop iteration adds about 3% to QPS at 20 threads.
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37965
Summary:
CPU profiling reveals GetApproximateSizes as a bottleneck for performance. The current implementation is sub-optimal, it scans every file in every level to compute the result.
We can take advantage of the fact that all levels above 0 are sorted in the increasing order of key ranges and use binary search to locate the starting index. This can reduce the number of comparisons required to compute the result.
Test Plan: We have good test coverage. Run the tests.
Reviewers: sdong, igor, rven, dynamike
Subscribers: dynamike, maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37755
Summary: Before the fix we also marked the bottommost level for compaction. This is wrong because then RocksDB has N+1 levels instead of N as before the compaction.
Test Plan: SuggestCompactRangeTest in db_test
Reviewers: yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37869
Summary: Remove duplicate code. If this diff looks good, I will cleanup other call sites as well.
Test Plan: unit tests
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37761
Summary:
Added these events:
* Recovery start, finish and also when recovery creates a file
* Trivial move
* Compaction start, finish and when compaction creates a file
* Flush start, finish
Also includes small fix to EventLogger
Also added option ROCKSDB_PRINT_EVENTS_TO_STDOUT which is useful when we debug things. I've spent far too much time chasing LOG files.
Still didn't get sst table properties in JSON. They are written very deeply into the stack. I'll address in separate diff.
TODO:
* Write specification. Let's first use this for a while and figure out what's good data to put here, too. After that we'll write spec
* Write tools that parse and analyze LOGs. This can be in python or go. Good intern task.
Test Plan: Ran db_bench with ROCKSDB_PRINT_EVENTS_TO_STDOUT. Here's the output: https://phabricator.fb.com/P19811976
Reviewers: sdong, yhchiang, rven, MarkCallaghan, kradhakrishnan, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37521
Summary:
Make it build for CYGWIN.
Need to define "-std=gnu++11" instead of "-std=c++11" and use some replacement functions.
Test Plan: Build it and run some unit tests in CYGWIN
Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37605
Summary:
CompactRange for universal compaction with num_levels > 1 seems to have a bug. The unit test also has a bug so it doesn't capture the problem.
Fix it. Revert the compact range to the logic equivalent to num_levels=1. Always compact all files together.
It should also fix DBTest.IncreaseUniversalCompactionNumLevels. The issue was that options.write_buffer_size = 100 << 10 and options.write_buffer_size = 100 << 10 are not used in later test scenarios. So write_buffer_size of 4MB was used. The compaction trigger condition is not anymore obvious as expected.
Test Plan: Run the new test and all test suites
Reviewers: yhchiang, rven, kradhakrishnan, anthony, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37551
Summary:
Based on feedback from D37083.
Are all of these correct? In some spaces it seems like we're doing SetMaxPossibleForUserKey() although we want the smallest possible internal key for user key.
Test Plan: make check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37341
Summary: Reading CompactionPicker I noticed this dangerous substraction of two unsigned integers. We should assert to mark this as safe.
Test Plan: make check
Reviewers: anthony, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37041