Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)
The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.
Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.
Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345
Differential Revision: D5210732
Pulled By: maysamyabandeh
fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
Summary:
CreateColumnFamily() releases DB mutex after adding column family to the set and install super version (to write option file), so if users call GetAggregatedIntProperty() in the middle, then super version will be null and the process will crash. Fix it by skipping those column families without super version installed.
Maybe we should also fix the problem of releasing the lock when reading option file, but it is more risky. so I'm doing a quick and safer fix and we can investigate it later.
Closes https://github.com/facebook/rocksdb/pull/2475
Differential Revision: D5298053
Pulled By: siying
fbshipit-source-id: 4b3c8f91c60400b163fcc6cda8a0c77723be0ef6
Summary:
stop calling Close() at the end of tests holding a compaction pressure token since it causes the write controller to be deleted while it's still needed. these calls were pointless anyways since Close() is already called in the test's destructor.
Closes https://github.com/facebook/rocksdb/pull/2367
Differential Revision: D5125906
Pulled By: ajkr
fbshipit-source-id: 6cad8673e5546a82ff602ac0ba59cc3f68dbde46
Summary:
- `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility
- `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure.
- `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions.
- The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled.
Closes https://github.com/facebook/rocksdb/pull/2205
Differential Revision: D4937461
Pulled By: ajkr
fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9
Summary:
TSAN shows warning of data race of EnvCounter::num_new_writable_file_. Make it atomic.
Closes https://github.com/facebook/rocksdb/pull/2331
Differential Revision: D5089215
Pulled By: siying
fbshipit-source-id: 15f6dcfb770a3310cbb6337c22482c8b330daffc
Summary:
Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time.
The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO.
Closes https://github.com/facebook/rocksdb/pull/2248
Differential Revision: D5001578
Pulled By: yiwu-arbug
fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090
Differential Revision: D4833681
Pulled By: siying
fbshipit-source-id: 2fd8bef
Summary:
Level-based L0->L0 compaction operates on spans of files that aren't currently being compacted. It reduces the number of L0 files, thus making write stall conditions harder to reach.
- L0->L0 is triggered when base level is unavailable due to pending compactions
- L0->L0 always outputs one file of at most `max_level0_burst_file_size` bytes.
- Subcompactions are disabled for L0->L0 since we want to output one file.
- Input files are chosen as the longest span of available files that will fit within the size limit. This minimizes number of files in L0.
Closes https://github.com/facebook/rocksdb/pull/2027
Differential Revision: D4760318
Pulled By: ajkr
fbshipit-source-id: 9d07183
Summary:
Add two DB properties: rocksdb.actual_delayed_write_rate and rocksdb.is_write_stooped, for people to know whether current writes are being throttled.
Closes https://github.com/facebook/rocksdb/pull/2043
Differential Revision: D4782975
Pulled By: siying
fbshipit-source-id: 6b2f5cf
Summary:
Remove disableDataSync, and another similarly named disable_data_sync options.
This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods.
Closes https://github.com/facebook/rocksdb/pull/1859
Differential Revision: D4541292
Pulled By: sagar0
fbshipit-source-id: 5b3a6ca
Summary:
introduce new methods into a public threadpool interface,
- allow submission of std::functions as they allow greater flexibility.
- add Joining methods to the implementation to join scheduled and submitted jobs with
an option to cancel jobs that did not start executing.
- Remove ugly `#ifdefs` between pthread and std implementation, make it uniform.
- introduce pimpl for a drop in replacement of the implementation
- Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread.
- Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation.
- should be no functionality changes.
Closes https://github.com/facebook/rocksdb/pull/1823
Differential Revision: D4492902
Pulled By: siying
fbshipit-source-id: c74cb11
Summary:
Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.
Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
Closes https://github.com/facebook/rocksdb/pull/1562
Differential Revision: D4218519
Pulled By: siying
fbshipit-source-id: 95e4088
Summary:
change ioptions.comparator to user_comparator instread of internal_comparator.
Also change Comparator* to InternalKeyComparator* to make its type explicitly.
Test Plan: make all check -j64
Reviewers: andrewkr, sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D65121
Summary: Some tests in column_family_test depend on functions that are not available in LITE build, which sometimes cause flakiness. Disable them.
Test Plan: Run those tests in LITE build.
Reviewers: yiwu, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65271
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.
Test Plan: existing unit tests.
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63945
Summary: In ColumnFamilyTest.FlushCloseWALFiles, there is a small window in which the flush has finished but the log writer is not yet closed, causing the assert failure. Fix it by explicitly waiting the flush job to finish.
Test Plan: Run the test many times in high parallelism.
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63423
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.
Test Plan: Add two new unit tests. Run all existing tests, including jtest.
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59829
Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too.
Test Plan: I am adding unit tests.
Reviewers: andrewkr, IslamAbdelRahman, yiwu
Reviewed By: yiwu
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61503
Summary:
My understanding is that the purpose of write stall triggers are to wait for auto-compaction to catch up. Without auto-compaction, we don't need to stall writes.
Also with this diff, flush/compaction conditions are recalculated on dynamic option change. Previously the conditions are recalculate only when write stall options are changed.
Test Plan: See the new test. Removed two tests that are no longer valid.
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61437
Summary: MyRocks is adding support for the user of the SstFileWriter which needs a comparator. It would be more convenient to get the comparator from the column family (which already has to have it) than to have caller keep track of it.
Test Plan: Standard tests (adding one for the new method)
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61155
Summary:
Stale log files can be deleted out of order. This can happen for various reasons. One of the reason is that no data is ever inserted to a column family and we have an optimization to update its log number, but not all the old log files are cleaned up (the case shown in the unit tests added). It can also happen when we simply delete multiple log files out of order.
This causes data corruption because we simply increase seqID after processing the next row and we may end up with writing data with smaller seqID than what is already flushed to memtables.
In DB recovery, for the oldest files we are replaying, if there it contains no data for any column family, we ignore the sequence IDs in the file.
Test Plan: Add two unit tests that fail without the fix.
Reviewers: IslamAbdelRahman, igor, yiwu
Reviewed By: yiwu
Subscribers: hermanlee4, yoshinorim, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60891
Summary: Multiput atomiciy is broken across multiple column families if we don't sync WAL before flushing one column family. The WAL file may contain a write batch containing writes to a key to the CF to be flushed and a key to other CF. If we don't sync WAL before flushing, if machine crashes after flushing, the write batch will only be partial recovered. Data to other CFs are lost.
Test Plan: Add a new unit test which will fail without the diff.
Reviewers: yhchiang, IslamAbdelRahman, igor, yiwu
Reviewed By: yiwu
Subscribers: yiwu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60915
Summary:
add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to close column family instead of deleting cfh*
User should call this to close a cf and then we can detect the deletion in this function.
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60765
Summary: We saw instances where total_log_size is off the real value, but I'm not able to reproduce it. Add more logging to help debugging when it happens again.
Test Plan: Run the unit test and see the logging.
Reviewers: andrewkr, yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60081
Summary:
Add a test to detect that when WAL gets truncated,
seq no's are checked to be contiguous.
This test is put in ColumnFamilyTest as it has the necessary
infrastructure/functions for flushing column families, which
we use to ensure 2 active WAL files
Test Plan:
This is a test, no feature has been added.
This test fails today and hence disabled
Reviewers: sdong
Reviewed By: sdong
Subscribers: lgalanis, dhruba, andrewkr, pritamdamania
Differential Revision: https://reviews.facebook.net/D59253
Summary:
Changing several option defaults:
options.max_open_files changes from 5000 to -1
options.base_background_compactions changes from max_background_compactions to 1
options.wal_recovery_mode changes from kTolerateCorruptedTailRecords to kTolerateCorruptedTailRecords
options.compaction_pri changes from kByCompensatedSize to kByCompensatedSize
Test Plan: Write unit tests to see OldDefaults() works as expected.
Reviewers: IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: MarkCallaghan, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56427
Summary: Add a test case in ColumnFamilyTest.ReadDroppedColumnFamily to make sure existing iterator is not impacted by column family dropping.
Test Plan: N/A
Reviewers: igor, yhchiang, anthony, andrewkr, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54657
Summary: DBImpl::SyncWAL() releases db mutex before calling DBImpl::MarkLogsSynced(), while inside DBImpl::MarkLogsSynced() we assert there is none or one outstanding log file. However, a memtable switch can happen in between and causing two or outstanding logs there, failing the assert. The diff adds a unit test that repros the issue and fix the assert so that the unit test passes.
Test Plan: Run the new tests.
Reviewers: anthony, kolmike, yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54621
Summary:
Fixed the asan error on column_family_test caused by not disabling
SyncPoint.
Test Plan: column_family_test
Reviewers: anthony, rven, kradhakrishnan, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53505
Summary:
If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions.
The watermarks are calculated based on slowdown thresholds.
Test Plan:
Add new test cases in column_family_test.
Adding more unit tests.
Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D53409
Summary:
This patch addes ColumnFamilyHandle::GetDescriptor(), which allows
developers to obtain the CF options and names of the associated column
family given its handle.
// Returns the up-to-date descriptor used by the current handle. Since it
// returns the up-to-date information, this call might internally locks
// and releases DB mutex to access the up-to-date CF options.
virtual ColumnFamilyDescriptor GetDescriptor() = 0;
Test Plan: augment column_family_test
Reviewers: sdong, yoshinorim, IslamAbdelRahman, rven, kradhakrishnan, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51543
Summary:
myrocks seems to build rocksdb using
-Wmissing-field-initializers (and treats warnings as errors). This diff
adds that flag to the rocksdb build, and fixes the compilation failures
that result. I have not checked for any other differences in the build
flags for rocksdb build as part of myrocks.
Test Plan: make check
Reviewers: sdong, rven
Reviewed By: rven
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D52443
Summary: ColumnFamilyTest.WriteStallSingleColumnFamily and ColumnFamilyTest.WriteStallTwoColumnFamilies didn't clean up test state cleanly, causing memory leak. Fix it.
Test Plan: Run the two tests in valgrind and make sure they now pass.
Reviewers: yhchiang, anthony, rven, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52347
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.
Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000
and make sure without the commit, write stop will happen, but with the commit, it will not happen.
Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52131
Summary:
When there are waiting manual compactions, we need to signal
them after removing the current manual compaction from the deque.
Test Plan: ColumnFamilytTest.SameCFManualManualCommaction
Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D52119
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
Summary:
This patch fix a race condition in persisting options which will cause a crash when:
* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.
Test Plan: Add a test in column_family_test that can reproduce the crash
Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51717
Summary:
This patch fix a race condition in persisting options which will cause a crash when:
* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.
Test Plan: Add a test in column_family_test that can reproduce the crash
Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51609
Summary: After the skip list optimization, ColumnFamilyTest.DifferentWriteBufferSizes can occasionally fail with flush triggering of column family 3. Insert more data to it to make sure flush will trigger.
Test Plan: Run it multiple times with both of jemaloc on and off and see it always passes. (Without thd commit the run with jemalloc fails with chance of about one in two)
Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51645