Summary: #7124486: RocksDB's Iterator.SeekToLast should seek to the last key before iterate_upper_bound if presents
Test Plan: ./db_iter_test run successfully with the new testcase
Reviewers: rven, yhchiang, igor, anthony, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40425
Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction.
Test Plan: make check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40527
Summary:
Implementation of a table-level row cache.
It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache.
Supports snapshots and merge operations.
Test Plan: Ran `make valgrind_check commit-prereq`
Reviewers: igor, philipp, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39849
Summary:
The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL.
1. RecoverAfterRestart (kTolerateCorruptedTailRecords)
This mocks the current recovery mode.
2. RecoverAfterCleanShutdown (kAbsoluteConsistency)
This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes.
3. RecoverPointInTime (kPointInTimeRecovery)
This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write.
4. RecoverAfterDisaster (kSkipAnyCorruptRecord)
This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible.
Test Plan:
(1) Run added unit test to cover all levels.
(2) Run make check.
Reviewers: leveldb, sdong, igor
Subscribers: yoshinorim, dhruba
Differential Revision: https://reviews.facebook.net/D38487
Summary:
When there are multiple column families, the flush in
GetLiveFiles is not atomic, so that there are entries in the wal files
which are needed to get a consisten RocksDB. We now add the log files to
the checkpoint.
Test Plan:
CheckpointCF - This test forces more data to be written to
the other column families after the flush of the first column family but
before the second.
Reviewers: igor, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40323
Summary: CompressLevelCompaction() depends on Zlib. We should skip it when zlib is not present.
Test Plan: `make check` without zlib
Reviewers: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40401
Summary:
Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users.
This patch fails DB::Open() if we don't support the compression that is specified in the options.
Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails.
Reviewers: sdong, MarkCallaghan, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39687
Summary:
This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level
Changes in this patch
- Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction
- Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set
Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that should use force_bottommost_level_compaction to pass in a deterministic way
Test Plan:
make check
adding new tests
Reviewers: igor, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40059
Summary:
This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters
Old CompactRange is still available but deprecated
Test Plan:
make all check
make rocksdbjava
USE_CLANG=1 make all
OPT=-DROCKSDB_LITE make release
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40209
Summary:
Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables.
To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>.
Test Plan: Add a test case
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40119
Summary:
Reverting this diff https://reviews.facebook.net/D39999
Will add an option to force bottom most level compaction and then re submit it
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40041
Summary: If we don't have a compaction filter then we can skip compacting the bottom most level
Test Plan:
make check
added unit tests
Reviewers: yhchiang, sdong, igor
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39999
Summary:
We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch.
The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work
hard_rate_limit is deprecated.
options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up.
Test Plan: Add new unit tests in db_test
Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor
Reviewed By: igor
Subscribers: ikabiljo, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D36351
Summary: With experimental feature SuggestCompactRange() we don't restrict running two L0->L1 compactions in parallel. This diff fixes this.
Test Plan: added a unit test to reproduce the failure. fixed the unit test
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39981
Summary:
Right now the level we pass to ReFitLevel is the maximum level with files (before compaction), there are multiple cases where this maximum level have changed after compaction
- all files where in L0 (now maximum level is L1)
- using kCompactionStyleUniversal (now maximum level in the last level)
- level_compaction_dynamic_level_bytes ??
We can handle each of these cases individually, but I felt it's safer to calculate max_level_with_files again if we want to do a ReFitLevel
Test Plan:
adding some tests
make -j64 check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: ott, dhruba
Differential Revision: https://reviews.facebook.net/D39663
Summary: Some test and benchmark codes don't build for CYGWIN. Fix it.
Test Plan: Build "make all" with TARGET_OS=Cygwin on cygwin and make sure it passes.
Reviewers: rven, yhchiang, anthony, igor, kradhakrishnan
Reviewed By: igor, kradhakrishnan
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39711
Summary:
There is a hang during DB close in the following scenario:
a) a load with WAL disabled was done,
b) CancelAllBackgroundWork was called,
c) DB Close was called
This was because in that we will wait for a flush but we cannot do a
background flush because we have called CancelAllBackgroundWork which
marks the DB as shutting downn.
Test Plan: Added DBTest FlushOnDestroy
Reviewers: sdong
Reviewed By: sdong
Subscribers: yoshinorim, hermanlee4, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39747
Summary:
It is experimental. Allow users to return from a call back function TablePropertiesCollector::NeedCompact(), based on the data in the file.
It can be used to allow users to suggest DB to clear up delete tombstones faster.
Test Plan: Add a unit test.
Reviewers: igor, yhchiang, kradhakrishnan, rven
Reviewed By: rven
Subscribers: yoshinorim, MarkCallaghan, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39585
Summary:
The type of smallest_output_key_prefix and largest_output_key_prefix
have been changed to std::string in https://reviews.facebook.net/D39537.
As a result, we shouldn't do smallest_output_key_prefix[0] = 0 in the
initialization.
Test Plan: compile db_test with tsan enabled and repeat DBTest.CompactionDeletionTrigger test to verify the tsan issue has been gone.
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39645
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
Summary:
Add EventListener::OnTableFileDeletion(), which will be
called when a table file is deleted.
Test Plan: Extend three existing tests in db_test to verify the deleted files.
Reviewers: rven, anthony, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38931
Summary:
DBTest.MigrateToDynamicLevelMaxBytesBase with valgrind test is
extremely slow. Work it around by not having both threads running
everything non-stop.
Test Plan: Run the test with valgrind which used to take too long to finish and see it finish in reasonable time.
Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39477
Summary: Add a stats counter for DB_WRITE back which was mistakenly removed.
Test Plan: augment GroupCommitTest
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39399
Summary: In DB::CompactRange(), change parameter "reduce_level" to "change_level". Users can compact all data to the last level if needed. By doing it, users can migrate the DB to options.level_compaction_dynamic_level_bytes=true.
Test Plan: Add a unit test for it.
Reviewers: yhchiang, anthony, kradhakrishnan, igor, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39099
Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported.
Test Plan: Added a new test, but still need to look into stress testing.
Reviewers: yhchiang, igor, rven, sdong
Reviewed By: sdong
Subscribers: adamretter, MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D33435
Summary:
For transactions, we are using the memtables to validate that there are no write conflicts. But after flushing, we don't have any memtables, and transactions could fail to commit. So we want to someone keep around some extra history to use for conflict checking. In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit.
After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure). It seems like the best place for this is abstracted inside the memtable_list. I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much.
This diff adds a new parameter to control how much memtable history to keep around after flushing. However, it sounds like people aren't too fond of adding new parameters. So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers. This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit. (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached). So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit).
However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions.
Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests. Added testing in memtablelist_test and planning on adding more testing here.
Reviewers: sdong, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37443
Summary: As title. I spent some time thinking about it and I don't think there should be any issue with running manual compaction and flushes in parallel
Test Plan: make check works
Reviewers: rven, yhchiang, sdong
Reviewed By: yhchiang, sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38355
Summary: DBTest.DynamicLevelMaxBytesCompactRange needs to make sure L0 is not empty to properly cover the code paths we want to cover. However, current codes have a bug that might leave the condition not held. Improve the test to ensure it.
Test Plan: Run the test in an environment that is used to fail. Also run it many times.
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D38631
Summary: CompactRange() now is much more expensive for dynamic level base size as it goes through all the levels. Skip those not used levels between level 0 an base level.
Test Plan: Run all unit tests
Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37125
Summary: DBTest.DynamicLevelMaxBytesBase2 has a check that is not necessary and may fail. Remove it, and add two unrelated check.
Test Plan: Run the test
Reviewers: yhchiang, rven, kradhakrishnan, anthony, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D38457
Summary: Universal compactions with multiple levels should use file preallocation size based on file size if output level is not level 0
Test Plan: Run all tests.
Reviewers: igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D38439
Summary:
Added a couple functions to WriteBatchWithIndex to make it easier to query the value of a key including reading pending writes from a batch. (This is needed for transactions).
I created write_batch_with_index_internal.h to use to store an internal-only helper function since there wasn't a good place in the existing class hierarchy to store this function (and it didn't seem right to stick this function inside WriteBatchInternal::Rep).
Since I needed to access the WriteBatchEntryComparator, I moved some helper classes from write_batch_with_index.cc into write_batch_with_index_internal.h/.cc. WriteBatchIndexEntry, ReadableWriteBatch, and WriteBatchEntryComparator are all unchanged (just moved to a different file(s)).
Test Plan: Added new unit tests.
Reviewers: rven, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38037
Summary: In new clang we need to add override to every overriden function
Test Plan: none
Reviewers: rven
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D38259
Summary: When reporting compaction that was started because of SuggestCompactRange() we should treat it as manual compaction.
Test Plan: none
Reviewers: yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38139
Summary: Before the fix we also marked the bottommost level for compaction. This is wrong because then RocksDB has N+1 levels instead of N as before the compaction.
Test Plan: SuggestCompactRangeTest in db_test
Reviewers: yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37869
Summary:
CompactRange for universal compaction with num_levels > 1 seems to have a bug. The unit test also has a bug so it doesn't capture the problem.
Fix it. Revert the compact range to the logic equivalent to num_levels=1. Always compact all files together.
It should also fix DBTest.IncreaseUniversalCompactionNumLevels. The issue was that options.write_buffer_size = 100 << 10 and options.write_buffer_size = 100 << 10 are not used in later test scenarios. So write_buffer_size of 4MB was used. The compaction trigger condition is not anymore obvious as expected.
Test Plan: Run the new test and all test suites
Reviewers: yhchiang, rven, kradhakrishnan, anthony, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37551
Summary:
This diff implements a new `DB` method `PromoteL0` which moves all files in L0
to a given level skipping compaction, provided that the files have disjoint
ranges and all levels up to the target level are empty.
This method provides finer-grain control for trivial compactions, and it is
useful for bulk-loading pre-sorted keys. Compared to D34797, it does not change
the semantics of an existing operation, which can impact existing code.
PromoteL0 is designed to work well in combination with the proposed
`GetSstFileWriter`/`AddFile` interface, enabling to "design" the level structure
by populating one level at a time. Such fine-grained control can be very useful
for static or mostly-static databases.
Test Plan: `make check`
Reviewers: IslamAbdelRahman, philipp, MarkCallaghan, yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37107
Summary: To further distinguish the corruption cases were caused by storage media or in memory states when writing it, add a paranoid check after writing the file to iterate all the rows.
Test Plan: Add a new unit test for it
Reviewers: rven, igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37335
Summary:
A couple of times on Travis, we have had the thread status say that there were no compactions done and since we assert for it, the test failed.
We now fix this by waiting till compaction started.
Test Plan:
run DBTEST::*PreShutdown*
d=/tmp/j; rm -rf $d; seq 200 | parallel --gnu --eta 'd=/tmp/j/d-{}; mkdir -p $d; TEST_TMPDIR=$d ./db_test --gtest_filter=DBTest.PreShutdown* >& '$d'/log-{}'
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37545
Summary:
Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys).
All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart.
MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction.
Test Plan: added a new unit test
Reviewers: yhchiang, rven, MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37083
Summary: D36669 introduces a bug that trivial moved data is not going to specific level but the next level, which will incorrectly be level 1 for level 0 compaciton if base level is not level 1. Fixing it by appreciating the output level
Test Plan: Run all tests
Reviewers: MarkCallaghan, rven, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37119
Summary:
Recent change of DBTest.DynamicLevelCompressionPerLevel2 has a bug that the second sync point is not enabled. Fix it. Also add an assert for that.
Also, flush compression is not tracked in the test. Add it.
Test Plan: Build everything
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37101
Summary: When commiting the sync point interface change, didn't resolve the new occurance of the old interface in rebase. Fix it.
Test Plan: Build and see it pass
Reviewers: igor, yhchiang, rven, anthony, kradhakrishnan
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37095
Summary:
Allow users to give a callback function with parameter using sync point, so more complicated verification can be done in tests.
Use it in DBTest.DynamicLevelCompressionPerLevel2 so that failures will be more easy to debug.
Test Plan: Run all tests. Run DBTest.DynamicLevelCompressionPerLevel2 with valgrind check.
Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D36999
Summary: https://reviews.facebook.net/D36963 made the debug build much faster and that triggered failures of CompactFilesOnLevelCompaction test. 3 out of 4 last tests on Jenkins failed. I'm disabling this test temporarily, since we likely know the reason why it's failing and there's already work in progress to address it -- https://reviews.facebook.net/D36225
Test Plan: none
Reviewers: sdong, rven, yhchiang, meyering
Reviewed By: meyering
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D36993
Summary:
The problem is that sometimes two memtables will be compacted together into a single file. In that case, our assertion
ASSERT_EQ(NumTableFilesAtLevel(0), 5);
fails because same amount of data is in 4 files instead of 5. We should wait for flush so that we prevent two memtables merging into a single file.
Test Plan: `for i in `seq 20`; do mrtest FIFOCompactionTest; done` -- fails at least once before. fails zero times after.
Reviewers: rven
Reviewed By: rven
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D36939
Summary:
The goal of this diff is to make Compaction class easier to use. This should also make new compaction algorithms easier to write (like CompactFiles from @yhchiang and dynamic leveled and multi-leveled universal from @sdong).
Here are couple of things demonstrating that Compaction class is hard to use:
1. we have two constructors of Compaction class
2. there's this thing called grandparents_, but it appears to only be setup for leveled compaction and not compactfiles
3. it's easy to introduce a subtle and dangerous bug like this: D36225
4. SetupBottomMostLevel() is hard to understand and it shouldn't be. See this comment: afbafeaeae/db/compaction.cc (L236-L241). It also made it harder for @yhchiang to write CompactFiles, as evidenced by this: afbafeaeae/db/compaction_picker.cc (L204-L210)
The problem is that we create Compaction object, which holds a lot of state, and then pass it around to some functions. After those functions are done mutating, then we call couple of functions on Compaction object, like SetupBottommostLevel() and MarkFilesBeingCompacted(). It is very hard to see what's happening with all that Compaction's state while it's travelling across different functions. If you're writing a new PickCompaction() function you need to try really hard to understand what are all the functions you need to run on Compaction object and what state you need to setup.
My proposed solution is to make important parts of Compaction immutable after construction. PickCompaction() should calculate compaction inputs and then pass them onto Compaction object once they are finalized. That makes it easy to create a new compaction -- just provide all the parameters to the constructor and you're done. No need to call confusing functions after you created your object.
This diff doesn't fully achieve that goal, but it comes pretty close. Here are some of the changes:
* have one Compaction constructor instead of two.
* inputs_ is constant after construction
* MarkFilesBeingCompacted() is now private to Compaction class and automatically called on construction/destruction.
* SetupBottommostLevel() is gone. Compaction figures it out on its own based on the input.
* CompactionPicker's functions are not passing around Compaction object anymore. They are only passing around the state that they need.
Test Plan:
make check
make asan_check
make valgrind_check
Reviewers: rven, anthony, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: sdong, yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D36687
Summary: Now trivial move is only triggered when moving from level n to n+1. With dynamic level base, it is possible that file is moved from level 0 to level n, while levels from 1 to n-1 are empty. Extend trivial move to this case.
Test Plan: Add a more unit test of sequential loading. Non-trivial compaction happened without the patch and now doesn't happen.
Reviewers: rven, yhchiang, MarkCallaghan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba, IslamAbdelRahman
Differential Revision: https://reviews.facebook.net/D36669
Summary: Now we add warnings when user configures compression and the compression is not supported.
Test Plan:
Configured compression to non-supported values. Observed messages in my log:
2015/03/26-12:17:57.586341 7ffb8a496840 [WARN] Compression type chosen for level 2 is not supported: LZ4. RocksDB will not compress data on level 2.
2015/03/26-12:19:10.768045 7f36f15c5840 [WARN] Compression type chosen is not supported: LZ4. RocksDB will not compress data.
Reviewers: rven, sdong, yhchiang
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35979
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
Summary: In some db_test tests sync points are not cleared which will cause unexpected results in the next tests. Clean them up in test cleaning up.
Test Plan:
Run the same tests that used to fail:
build using USE_CLANG=1 and run
./db_test --gtest_filter="DBTest.CompressLevelCompaction:*DBTestUniversalCompactionParallel*"
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D36429
Summary:
This diff fixes a crash found when an empty database is opened in readonly mode.
We now check the number of levels before we open the DB as a compacted DB.
Test Plan: DBTest.EmptyCompactedDB
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D36327
Summary:
After recent change of DBTest.DynamicCompactionOptions, occasionally hit another non-deterministic case where L0 showdown is triggered while timeout should not triggered for hard limit.
Fix it by increasing L0 slowdown trigger at the same time.
Test Plan: Run the failed test.
Reviewers: igor, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D36219
Summary:
With this change, we use L1 and up to store compaction outputs in universal compaction.
The compaction pick logic stays the same. Outputs are stored in the largest "level" as possible.
If options.num_levels=1, it behaves all the same as now.
Test Plan:
1) convert most of existing unit tests for universal comapaction to include the option of one level and multiple levels.
2) add a unit test to cover parallel compaction in universal compaction and run it in one level and multiple levels
3) add unit test to migrate from multiple level setting back to one level setting
4) add a unit test to insert keys to trigger multiple rounds of compactions and verify results.
Reviewers: rven, kradhakrishnan, yhchiang, igor
Reviewed By: igor
Subscribers: meyering, leveldb, MarkCallaghan, dhruba
Differential Revision: https://reviews.facebook.net/D34539
Summary:
Cleaning up log files can do heavy IO, since we call ftruncate() in the destructor. We don't want to call ftruncate() in user threads.
This diff moves cleaning to background threads (flush and compaction)
Test Plan: make check, will also run valgrind
Reviewers: yhchiang, rven, MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D36177
Summary:
We have addded new stats and perf_context for measuring the merge and filter operation time consumption.
We have bounded all the merge operations within the GUARD statment and collected the total time for these operations in the DB.
Test Plan: WIP
Reviewers: rven, yhchiang, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D34377
Summary:
Improve ThreadStatusSingleCompaction in two ways:
1. Use SYNC_POINT to ensure compaction won't happen
before the test finishes its "Put Phase" instead of
using sleep.
2. In Put Phase, it continues until we have sufficient
number of L0 files. Note that during the put phase,
there won't be any compaction that consumes L0 files
because of item 1.
Test Plan: ./db_test --gtest_filter="*ThreadStatusSingleCompaction*"
Reviewers: sdong, igor, rven
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35727
Summary: DBTest doesn't clean up wal directory. It might cause failure after a failure test run. Fix it.
Test Plan:
Run unit tests
Try open DB with non-empty db_path/wal.
Reviewers: rven, yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35559
Summary: This is a simple change to make db_test::MultiThreadedDBTest as value parameterized test. There is a value of creating a separate set of such tests later.
Test Plan:
```lang=bash
% make db_test
% ./make db_test
```
Also with the following command I can execute all db_test in 2:37.87 on my box
```
% ./db_test --gtest_list_tests | sed 's/\# GetParam.*//' | tr -d ' ' | env time parallel --gnu --eta --joblog=LOG -- 'TEST_TMPDIR=/dev/shm/rocksdb-{} ./db_test --gtest_filter="*{}"'
```
Reviewers: igor, rven, meyering, sdong
Reviewed By: meyering
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35361
Summary: Add a DB property for number of deletions in memtables. It can sometimes help people debug slowness because of too many deletes.
Test Plan: Add test cases.
Reviewers: rven, yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D35247
Summary:
Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different.
In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest.
There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides.
```lang=bash
% cat ~/transform
#!/bin/sh
files=$(git ls-files '*test\.cc')
for file in $files
do
if grep -q "rocksdb::test::RunAllTests()" $file
then
if grep -Eq '^class \w+Test {' $file
then
perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file
perl -pi -e 's/^(TEST)/${1}_F/g' $file
fi
perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file
perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file
fi
done
% sh ~/transform
% make format
```
Second iteration of this diff contains only scripted changes.
Third iteration contains manual changes to fix last errors and make it compilable.
Test Plan:
Build and notice no errors.
```lang=bash
% USE_CLANG=1 make check -j55
```
Tests are still testing.
Reviewers: meyering, sdong, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35157
Summary:
gtest does not use exceptions to fail a unit test by design, and `ASSERT*`s are implemented using `return`. As a consequence we cannot use `ASSERT*` in a function that does not return `void` value ([[ https://code.google.com/p/googletest/wiki/AdvancedGuide#Assertion_Placement | 1]]), and have to fix our existing code. This diff does this in a generic way, with no manual changes.
In order to detect all existing `ASSERT*` that are used in functions that doesn't return void value, I change the code to generate compile errors for such cases.
In `util/testharness.h` I defined `EXPECT*` assertions, the same way as `ASSERT*`, and redefined `ASSERT*` to return `void`. Then executed:
```lang=bash
% USE_CLANG=1 make all -j55 -k 2> build.log
% perl -naF: -e 'print "-- -number=".$F[1]." ".$F[0]."\n" if /: error:/' \
build.log | xargs -L 1 perl -spi -e 's/ASSERT/EXPECT/g if $. == $number'
% make format
```
After that I reverted back change to `ASSERT*` in `util/testharness.h`. But preserved introduced `EXPECT*`, which is the same as `ASSERT*`. This will be deleted once switched to gtest.
This diff is independent and contains manual changes only in `util/testharness.h`.
Test Plan:
Make sure all tests are passing.
```lang=bash
% USE_CLANG=1 make check
```
Reviewers: igor, lgalanis, sdong, yufei.zhu, rven, meyering
Reviewed By: meyering
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33333
Summary:
On RocksDB, when there are multiple instances doing
flushes/compactions in the background, the close call takes a long time
because the flushes/compactions need to complete before the database can
shut down. If another instance is using the background threads and the compaction for this instance is in the queue since it has been scheduled, we still cannot shutdown. We now remove the scheduled background tasks which have not yet started running, so that shutdown is speeded up.
Test Plan: DB Test added.
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33741
Summary:
The preshutdown tests check for stopped compactions/flushes.
Removing stalls on the write path.
Test Plan: DBTests.PreShutdown*
Reviewers: yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35037
Summary:
Improve the robustness of ThreadStatusSingleCompaction
by ensuring the number of files flushed in the test.
Test Plan:
export ROCKSDB_TESTS=ThreadStatus
./db_test
Reviewers: sdong, igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35019
Summary:
Fix the deadlock issue in ThreadStatusSingleCompaction.
In the previous version of ThreadStatusSingleCompaction, the compaction
thread will wait for a SYNC_POINT while its db_mutex is held. However,
if the test hasn't finished its Put cycle while a compaction is running,
a deadlock will happen in the test.
Test Plan:
export ROCKSDB_TESTS=ThreadStatus
./db_test
Reviewers: sdong, igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35001
Summary: The test depends on snappy to be used. Skip the test if it is not supported.
Test Plan: Run the test
Reviewers: meyering, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D34995
Summary: Currently, we have `ifdef SNAPPY` around bunch of db_test code. Some tests that don't even use compression are also blocked when running system doesn't have snappy. This also causes hard-to-catch bugs, like D34983. We should dynamically figure out if compression is supported or not.
Test Plan: compiles
Reviewers: sdong, meyering
Reviewed By: meyering
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34989
Summary:
The tests using sync_point for intent to shutdown stop
compaction and this results in stalls if too many rows are written. We
now limit the number of rows written to prevent stalls, since the focus
of the test is to cancel background work, which is being correctly
tested. This fixes a Jenkins issue.
Test Plan: DBTest.PreShutdown*
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34893
Summary:
Change the way options.compression_per_level is used when options.level_compaction_dynamic_level_bytes=true so that options.compression_per_level[1] determines compression for the level L0 is merged to, options.compression_per_level[2] to the level after that, etc.
Test Plan: run all tests
Reviewers: rven, yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D34431
Summary:
Provide an API which enables users to infor Rocksdb that it is
shutting down.
Test Plan: db_test
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34617
Summary: Got it working by some voodoo programming
Test Plan: works!
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34611
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
Summary:
Summary:
Added a new option to ColumnFamllyOptions - optimize_filters_for_hits. This option can be used in the case where most
accesses to the store are key hits and we dont need to optimize performance for key misses.
This is useful when you have a very large database and most of your lookups succeed. The option allows the store to
not store and use filters in the last level (the largest level which contains data). These filters can take a large amount of
space for large databases (in memory and on-disk). For the last level, these filters are only useful for key misses and not
for key hits. If we are not optimizing for key misses, we can choose to not store these filters for that level.
This option is only provided for BlockBasedTable. We skip the filters when we are compacting
Test Plan:
1. Modified db_test toalso run tests with an additonal option (skip_filters_on_last_level)
2. Added another unit test to db_test which specifically tests that filters are being skipped
Reviewers: rven, igor, sdong
Reviewed By: sdong
Subscribers: lgalanis, yoshinorim, MarkCallaghan, rven, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33717
Summary:
When using latest clang (3.6 or 3.7/trunck) rocksdb is failing with many errors. Almost all of them are missing override errors. This diff adds missing override keyword. No manual changes.
Prerequisites: bear and clang 3.5 build with extra tools
```lang=bash
% USE_CLANG=1 bear make all # generate a compilation database http://clang.llvm.org/docs/JSONCompilationDatabase.html
% clang-modernize -p . -include . -add-override
% make format
```
Test Plan:
Make sure all tests are passing.
```lang=bash
% #Use default fb code clang.
% make check
```
Verify less error and no missing override errors.
```lang=bash
% # Have trunk clang present in path.
% ROCKSDB_NO_FBCODE=1 CC=clang CXX=clang++ make
```
Reviewers: igor, kradhakrishnan, rven, meyering, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34077
Summary: Add a DB property about live versions. It can be helpful to figure out whether there are files not live but not yet deleted, in some use cases.
Test Plan: make all check
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33327
Summary:
This is a diff for managed iterator. A managed iterator
is a wrapper around an iterator which saves the options for that
iterator as well as the current key/value so that the underlying iterator
and its associated memory can be released when it is aged out
automatically or on the request of the user. Will provide the automatic release as a follow-up diff.
Test Plan: Managed* tests in db_test and XF tests for managed iterator
Reviewers: igor, yhchiang, anthony, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31401
Summary:
Remove ThreadStatusMultiCompaction test as it's currently written
in a way that depends on some randomness, while the flush / compaction
status of a single thread is also covered in ThreadStatusFlush
and ThreadStatusSingleCompaction tests.
Test Plan: ./db_test
Reviewers: igor, sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33537
Summary: Allow GetThreadList to reflect flush activity.
Test Plan:
Developed ThreadStatusFlush test and updated ThreadStatusMultiCompaction test.
./db_test ./thread_list_test
Reviewers: sdong, rven, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D32871
Summary:
It would be good to assing background job their IDs. Two benefits:
1) makes LOGs more readable
2) I might use it in my EventLogger, which will try to make our LOG easier to read/query/visualize
Test Plan: ran rocksdb, read the LOG
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31617
Summary: DBTest.DestroyDBMetaDatabase occasionally fails on my dev host, for file not existing. Always create directories to avoid that.
Test Plan: Run the test
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33321
Summary: Remember whole key or prefix filtering on/off in SST files. If user opens the DB with a different setting that cannot be satisfied while reading the SST file, ignore the bloom filter.
Test Plan: Add a unit test for it
Reviewers: yhchiang, igor, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D32889
Summary: For description of the bug, see comment in db_test. The fix is pretty straight forward.
Test Plan: added unit test. eventually we need better testing of FOF/POF process.
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33081
Summary:
- In statistics.h , added tickers.
- In version_set.cc,
-- Added a getter method for hit_file_level_ in the class FilePicker
-- Added a line in the Get() method in case of a found, increment the corresponding counters based on the level of the file respectively.
Corresponding task: https://our.intern.facebook.com/intern/tasks/?s=506100481&t=5952818
Personal fork: 0c3f2e3600
Test Plan:
In terminal,
```
make -j32 db_test
ROCKSDB_TESTS=L0L1L2AndUpHitCounter ./db_test
```
Or to use debugger,
```
make -j32 db_test
export ROCKSDB_TESTS=L0L1L2AndUpHitCounter
gdb db_test
```
Reviewers: rven, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D32205
Summary:
This diff basically reverts D30249 and also adds a unit test that was failing before this patch.
I have no idea how I didn't catch this terrible bug when writing a diff, sorry about that :(
I think we should redesign our system of keeping track of and deleting files. This is already a second bug in this critical piece of code. I'll think of few ideas.
BTW this diff is also a regression when running lots of column families. I plan to revisit this separately.
Test Plan: added a unit test
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33045
Summary:
When DestroyDB() finds a wal file in the DB directory, it assumes it is actually in WAL directory. This can lead to confusion, since it reports IO error when it tries to delete wal file from DB directory. For example: https://ci-builds.fb.com/job/rocksdb_clang_build/296/console
This change will fix our unit tests.
Test Plan: unit tests work
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D32907
Summary:
Add a counter for collecting the wait time on db mutex.
Also add MutexWrapper and CondVarWrapper for measuring wait time.
Test Plan:
./db_test
export ROCKSDB_TESTS=MutexWaitStats
./db_test
verify stats output using db_bench
make clean
make release
./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10
Sample output:
rocksdb.db.mutex.wait.micros COUNT : 7546866
Reviewers: MarkCallaghan, rven, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D32787
Summary: db_test's test class SpecialEnv has a thread unsafe variable rnd_ but it can be accessed by multiple threads. It is complained by TSAN. Protect it by a mutex.
Test Plan: Run the test
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D32895
Summary:
Get() now doesn't make use of bloom filter if it is prefix based. Add the check.
Didn't touch block based bloom filter. I can't fully reason whether it is correct to do that. But it's straight-forward to for full bloom filter.
Test Plan:
make all check
Add a test case in DBTest
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: MarkCallaghan, leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D31941
Summary:
This diff containes the changes to the code and db_test
for supporting cross functional tests for inplace_update
Test Plan: Run XF with inplace_test and also without
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D32367
Summary:
This Diff provides the implementation of the cross functional
test infrastructure. This provides the ability to test a single feature
with every existing regression test in order to identify issues with
interoperability between features.
Test Plan:
Reference implementation of inplace update support cross
functional test. Able to find interoperability issues with inplace
support and ran all of db_test. Will add separate diff for those changes.
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D32247
Summary: Add CappedFixTransform, which is the same as fixed length prefix extractor, except that when slice is shorter than the fixed length, it will use the full key.
Test Plan:
Add a test case for
db_test
options_test
and a new test
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: MarkCallaghan, leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D31887
Summary: DBTest.SharedWriteBuffer uses an Options that doesn't pass CurrentOptions(), so that it doesn't use MockEnv. However, DBTest's constructor uses MockEnv to call DestoryDB() to clean up, causing uncleaned state before it runs.
Test Plan: Run the test modified to make sure they pass default Env and SharedWriteBuffer now passes MockEnv.
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D32475
Summary:
There's a bug in TSAN (or libstdc++?) with std::shared_ptr<> for some reason. In db_test, only FlushSchedule is affected.
See more: https://groups.google.com/forum/#!topic/thread-sanitizer/vz_s-t226Vg
With this change and all other @sdong's and mine diffs, our db_test should be TSAN-clean. I'll move to other tests.
Test Plan: no more flush schedule when running TSAN
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: sdong, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D32469
Summary: Now we don't sync manifest file when initializing it, so DB cannot be safely reopened before the first mem table flush. Fix it by syncing it. This fixes fault_injection_test.
Test Plan: make all check
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D32001
Summary:
This diff adds BlockBasedTable format_version = 2. New format version brings better compressed block format for these compressions:
1) Zlib -- encode decompressed size in compressed block header
2) BZip2 -- encode decompressed size in compressed block header
3) LZ4 and LZ4HC -- instead of doing memcpy of size_t encode size as varint32. memcpy is very bad because the DB is not portable accross big/little endian machines or even platforms where size_t might be 8 or 4 bytes.
It does not affect format for snappy.
If you write a new database with format_version = 2, it will not be readable by RocksDB versions before 3.10. DB::Open() will return corruption in that case.
Test Plan:
Added a new test in db_test.
I will also run db_bench and verify VSIZE when block_cache == 1GB
Reviewers: yhchiang, rven, MarkCallaghan, dhruba, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31461
Summary:
Add structures for exposing events and operations. Event describes
high-level action about a thread such as doing compaciton or
doing flush, while an operation describes lower-level action
of a thread such as reading / writing a SST table, waiting for
mutex. Events and operations are designed to be independent.
One thread would typically involve in one event and one operation.
Code instrument will be in a separate diff.
Test Plan:
Add unit-tests in thread_list_test
make dbg -j32
./thread_list_test
export ROCKSDB_TESTS=ThreadList
./db_test
Reviewers: ljin, igor, sdong
Reviewed By: sdong
Subscribers: rven, jonahcohen, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29781
Dike out the body of VerifyCompactionResult. With assert() compiled out, the
loop index variable in the inner loop was unused, breaking the build when
-Werror is enabled.
Summary:
GetThreadList() feature depends on the thread creation and destruction, which is currently handled under Env.
This patch moves GetThreadList() feature under Env to better manage the dependency of GetThreadList() feature
on thread creation and destruction.
Renamed ThreadStatusImpl to ThreadStatusUpdater. Add ThreadStatusUtil, which is a static class contains
utility functions for ThreadStatusUpdater.
Test Plan: run db_test, thread_list_test and db_bench and verify the life cycle of Env and ThreadStatusUpdater is properly managed.
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: ljin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30057
Summary:
There are two versions of FindObsoleteFiles():
* full scan, which is executed every 6 hours (and it's terribly slow)
* no full scan, which is executed every time a background process finishes and iterator is deleted
This diff is optimizing the second case (no full scan). Here's what we do before the diff:
* Get the list of obsolete files (files with ref==0). Some files in obsolete_files set might actually be live.
* Get the list of live files to avoid deleting files that are live.
* Delete files that are in obsolete_files and not in live_files.
After this diff:
* The only files with ref==0 that are still live are files that have been part of move compaction. Don't include moved files in obsolete_files.
* Get the list of obsolete files (which exclude moved files).
* No need to get the list of live files, since all files in obsolete_files need to be deleted.
I'll post the benchmark results, but you can get the feel of it here: https://reviews.facebook.net/D30123
This depends on D30123.
P.S. We should do full scan only in failure scenarios, not every 6 hours. I'll do this in a follow-up diff.
Test Plan:
One new unit test. Made sure that unit test fails if we don't have a `if (!f->moved)` safeguard in ~Version.
make check
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30249
Summary:
Allow Level-style compaction to place files in different paths
This diff provides the code for task 4854591. We now support level-compaction
to place files in different paths by specifying them in db_paths along with
the minimum level for files to store in that path.
Test Plan: ManualLevelCompactionOutputPathId in db_test.cc
Reviewers: yhchiang, MarkCallaghan, dhruba, yoshinorim, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29799
Summary: Now DB::GetSnapshot() doesn't scale to more column families, as it needs to go through all the column families to find whether snapshot is supported. This patch optimizes it.
Test Plan:
Add unit tests to cover negative cases.
make all check
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30093
Summary:
Add a counter in SnapshotList to show number of snapshots. Also a unix timestamp in every snapshot.
Add two DB Properties to return number of snapshots and timestamp of the oldest one.
Test Plan: Add unit test checking
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: leveldb, dhruba, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D29919
Summary:
Introduces a new class for managing write buffer memory across column
families. We supplement ColumnFamilyOptions::write_buffer_size with
ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer
instance that enforces memory limits before flushing out to disk.
Test Plan: Added SharedWriteBuffer unit test to db_test.cc
Reviewers: sdong, rven, ljin, igor
Reviewed By: igor
Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim
Differential Revision: https://reviews.facebook.net/D22581
Summary:
In some environment such as android, the c++ library does not have
std::to_string. This path adds rocksdb::ToString(), which wraps std::to_string
when std::to_string is not available, and implements std::to_string
in the other case.
Test Plan:
make dbg -j32
./db_test
make clean
make dbg OPT=-DOS_ANDROID -j32
./db_test
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29181
Summary: Free checkpoint after its directory is removed.
Test Plan: Run valgrind with GetSnapshotLink.
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29493
Summary:
An entry of ConstantColumnFamilyInfo is created when:
1. DB::Open
2. CreateColumnFamily.
However, there are cases that DB::Open could also call CreateColumnFamily
when create_missing_column_families=true. As a result, it will create
duplicate ConstantColumnFamilyInfo and one of them would be leaked.
Test Plan: ./deletefile_test
Reviewers: igor, sdong, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29307
Summary:
Add enable_thread_tracking to DBOptions to allow
tracking thread status related to the DB. Default is off.
Test Plan:
export ROCKSDB_TESTS=ThreadList
./db_test
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29289
Summary:
Add GetThreadList API, which allows developer to track the
status of each process. Currently, calling GetThreadList will
only get the list of background threads in RocksDB with their
thread-id and thread-type (priority) set. Will add more support
on this in the later diffs.
ThreadStatus currently has the following properties:
// An unique ID for the thread.
const uint64_t thread_id;
// The type of the thread, it could be ROCKSDB_HIGH_PRIORITY,
// ROCKSDB_LOW_PRIORITY, and USER_THREAD
const ThreadType thread_type;
// The name of the DB instance where the thread is currently
// involved with. It would be set to empty string if the thread
// does not involve in any DB operation.
const std::string db_name;
// The name of the column family where the thread is currently
// It would be set to empty string if the thread does not involve
// in any column family.
const std::string cf_name;
// The event that the current thread is involved.
// It would be set to empty string if the information about event
// is not currently available.
Test Plan:
./thread_list_test
export ROCKSDB_TESTS=GetThreadList
./db_test
Reviewers: rven, igor, sdong, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D25047
Summary: This way we can gurantee that old MemTables get destructed before DBImpl gets destructed, which might be useful if we want to make them depend on state from DBImpl.
Test Plan: make check with asserts in JobContext's destructor
Reviewers: ljin, sdong, yhchiang, rven, jonahcohen
Reviewed By: jonahcohen
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28959
Summary: Store links to live files in directory on same disk
Test Plan:
Take snapshot and open it. Added a test GetSnapshotLink in
db_test.
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28713
Summary: Add a unit test in db_test to verify the behavior when both of merge operator and compaction filter apply to a key when merging.
Test Plan: Run the new test
Reviewers: ljin, yhchiang, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28455
Summary:
We need to turn on -Wshorten-64-to-32 for mobile. See D1671432 (internal phabricator) for details.
This diff turns on the warning flag and fixes all the errors. There were also some interesting errors that I might call bugs, especially in plain table. Going forward, I think it makes sense to have this flag turned on and be very very careful when converting 64-bit to 32-bit variables.
Test Plan: compiles
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: yhchiang
Subscribers: bobbaldwin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28689
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
Summary:
Here's a prototype of redesigning pending_outputs_. This way, we don't have to expose pending_outputs_ to other classes (CompactionJob, FlushJob, MemtableList). DBImpl takes care of it.
Still have to write some comments, but should be good enough to start the discussion.
Test Plan: make check, will also run stress test
Reviewers: ljin, sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28353
Summary:
Make PartialCompactionFailure Test more robust again by
blocking background compaction until we simulate the
file creation error.
Test Plan:
export ROCKSDB_TESTS=PartialCompactionFailure
./db_test
Reviewers: sdong, igor, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28431
Summary:
TEST_WaitForFlush should wait until it sees error when parameter is set
to true so we don't need to loop and timeout
Test Plan: ROCKSDB_TESTS=DropWritesFlush ./db_test
Reviewers: sdong, igor
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28419
Summary: Make PartialCompactionFailure Test more robust.
Test Plan:
export ROCKSDB_TESTS=PartialCompactionFailure
./db_test
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28425
Summary: It turns out that -Wshadow has different rules for gcc than clang. Previous commit fixed clang. This commits fixes the rest of the warnings for gcc.
Test Plan: compiles
Reviewers: ljin, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28131
Summary: In order to avoid random failure of DBTest.GroupCommitTest, artificially sleep 100 microseconds in each log writing.
Test Plan: Run the test in a machine where valgrind version of the test always fails multiple times and see it always succeed.
Reviewers: igor, yhchiang, rven, ljin
Reviewed By: ljin
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D28401
Summary: This patch makes it a contract that if an empty write batch is passed to DB::Write() and WriteOptions.sync = true, fsync is called to WAL.
Test Plan: A new unit test
Reviewers: ljin, rven, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D28365
Summary: as title
Test Plan: ./db_test
Reviewers: sdong, yhchiang, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28269
Summary: Replace some ASSERT_TRUE() to ASSERT_GT() and ASSERT_LT() so that in case the assert is triggered, the value is printed out.
Test Plan: Run the two tests
Reviewers: ljin, rven, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D28293
Summary:
Now DBTest.DynamicMemtableOptions sets background compaction to be 4, without actually increasing thread pool size (even before the feature of automatic increasing it). To make sure the behavior stays the same after the automatic thread pool increasing, set it back to 1.
Hopefully it can fix the occasional failure of the test.
Test Plan: Run the test
Reviewers: igor, ljin
Reviewed By: ljin
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D28281