Summary:
This patch fixes https://github.com/facebook/mysql-5.6/issues/121
There is a recent change in rocksdb to disable auto compactions on startup: https://reviews.facebook.net/D51147. However, there is a small timing window where a column family needs to be compacted and schedules a compaction, but the scheduled compaction fails when it checks the disable_auto_compactions setting. The expectation is once the application is ready, it will call EnableAutoCompactions() to allow new compactions to go through. However, if the Column family is stalled because L0 is full, and no writes can go through, it is possible the column family may never have a new compaction request get scheduled. EnableAutoCompaction() should probably schedule an new flush and compaction event when it resets disable_auto_compaction.
Using InstallSuperVersionAndScheduleWork, we call SchedulePendingFlush,
SchedulePendingCompaction, as well as MaybeScheduleFlushOrcompaction on all the
column families to avoid the situation above.
This is still a first pass for feedback.
Could also just call SchedePendingFlush and SchedulePendingCompaction directly.
Test Plan:
Run on Asan build
cd _build-5.6-ASan/ && ./mysql-test/mtr --mem --big --testcase-timeout=36000 --suite-timeout=12000 --parallel=16 --suite=rocksdb,rocksdb_rpl,rocksdb_sys_vars --mysqld=--default-storage-engine=rocksdb --mysqld=--skip-innodb --mysqld=--default-tmp-storage-engine=MyISAM --mysqld=--rocksdb rocksdb_rpl.rpl_rocksdb_stress_crash --repeat=1000
Ensure that it no longer hangs during the test.
Reviewers: hermanlee4, yhchiang, anthony
Reviewed By: anthony
Subscribers: leveldb, yhchiang, dhruba
Differential Revision: https://reviews.facebook.net/D51747
Summary: Stopped using std::timed_mutex as it has known issues in older versiong of gcc. Ran into these problems when testing MongoRocks.
Test Plan: unit tests. Manual mongo testing on gcc 4.8.
Reviewers: igor, yhchiang, rven, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52197
Summary:
I added this line in my previous patch D48999 (which is incorrect)
We should not release the iterator since releasing it will evict the blocks from cache
Test Plan:
Run the test under valgrind
make check
Reviewers: rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D52161
Summary: Fix a bug that options.soft_pending_compaction_bytes_limit is not actually set with --soft_pending_compaction_bytes_limit
Test Plan: Run db_bench with this parameter and make sure the parameter is set correctly.
Reviewers: anthony, kradhakrishnan, yhchiang, IslamAbdelRahman, igor, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52125
Summary:
When there are waiting manual compactions, we need to signal
them after removing the current manual compaction from the deque.
Test Plan: ColumnFamilytTest.SameCFManualManualCommaction
Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D52119
Summary: Now if inserting to mem table is much faster than writing to files, there is no mechanism users can rely on to avoid stopping for reaching options.max_write_buffer_number. With the commit, if there are more than four maximum write buffers configured, we slow down to the rate of options.delayed_write_rate while we reach the last one.
Test Plan:
1. Add a new unit test.
2. Run db_bench with
./db_bench --benchmarks=fillrandom --num=10000000 --max_background_flushes=6 --batch_size=32 -max_write_buffer_number=4 --delayed_write_rate=500000 --statistics
based on hard drive and see stopping is avoided with the commit.
Reviewers: yhchiang, IslamAbdelRahman, anthony, rven, kradhakrishnan, igor
Reviewed By: igor
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52047
Summary:
Documenting the unschedFunction parameter to Schedule as
requested by Michael Kolupaev.
Test Plan: build, unit test
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: kolmike, dhruba
Differential Revision: https://reviews.facebook.net/D52089
Summary: Now ZSTD hard code level 1. Change it to use the compression level setting.
Test Plan: Run it with hacked codes of sst_dump and show ZSTD compression sizes with different levels.
Reviewers: rven, anthony, yhchiang, kradhakrishnan, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52041
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted
ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted
Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.
Benchmark results (using https://phabricator.fb.com/P20083553)
```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G /home/tec/local/normal.4K.Snappy/db10077
// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G /home/tec/local/zero.8K.LZ4/db10077
// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
// --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
// --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"
// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s
// ============================================================================
// BM_StringCopy 1.73s 576.97m
// BM_StringPiece 103.74% 1.67s 598.55m
// ============================================================================
// Match rate : 1000000 / 1000000
// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s
// ============================================================================
// BM_StringCopy 611.99ms 1.63
// BM_StringPiece 203.76% 300.35ms 3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```
Test Plan: Unit tests
Reviewers: sdong, igor, anthony, yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, lovro, adsharma
Differential Revision: https://reviews.facebook.net/D48999
Summary: For some reason `make librocksdb.a` is not valid anymore. Replace with `make static_lib`
Test Plan: cd examples/; make all;
Reviewers: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52017
Summary:
List of changes:
1) Fix the snprintf() usage in cases where wrong variable was used to determine the output buffer size.
2) Remove unnecessary checks before calling delete operator.
3) Increase code correctness by using size_t type when getting vector's size.
4) Unify the coding style by removing namespace::std usage at the top of the file to confirm to the majority usage.
5) Fix various lint errors pointed out by 'arc lint'.
Test Plan:
Code review and build:
git diff
make clean
make -j 32 commit-prereq
arc lint
Reviewers: kradhakrishnan, sdong, rven, anthony, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51849
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
Summary: Add support to change write options after creating a transaction. This is needed for MongoRocks.
Test Plan: added test
Reviewers: sdong, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51867
Summary:
Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.
With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).
Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.
Test Plan: unit tests, db bench
Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb, yoshinorim
Differential Revision: https://reviews.facebook.net/D50475
Summary: Improving the parser string to make better error report. Currently the
error report fails to capture the assert details. This fix addresses the issue.
Test Plan: None
Reviewers:
CC: leveldb@
Task ID: #6968635
Blame Rev:
Summary:
By default, RocksDB initializes the singletons of ThreadLocalPtr first, then initializes PosixEnv
via static initializer. Destructor terminates objects in reverse order, so terminating PosixEnv
(calling pthread_mutex_lock), then ThreadLocal (calling pthread_mutex_destroy).
However, in certain case, application might initialize PosixEnv first, then ThreadLocalPtr.
This will cause core dump at the end of the program (eg. https://github.com/facebook/mysql-5.6/issues/122)
This patch fix this issue by ensuring the destruction order by moving the global static singletons
to function static singletons. Since function static singletons are initialized when the function is first
called, this property allows us invoke to enforce the construction of the static PosixEnv and the
singletons of ThreadLocalPtr by calling the function where the ThreadLocalPtr singletons belongs
right before we initialize the static PosixEnv.
Test Plan: Verified in the MyRocks.
Reviewers: yoshinorim, IslamAbdelRahman, rven, kradhakrishnan, anthony, sdong, MarkCallaghan
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51789
This is an Env implementation that mirrors all storage-related methods on
two different backend Env's and verifies that they return the same
results (return status and read results). This is useful for implementing
a new Env and verifying its correctness.
Signed-off-by: Sage Weil <sage@redhat.com>
Summary: Correct a comment in include/rocksdb/cache.h
Test Plan: No code change.
Reviewers: igor, sdong, IslamAbdelRahman, rven, kradhakrishnan, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51831
Summary: For Transactions, we want to start using the SST files to do write conflict checking. To do this, we need to make sure that compaction never removes all writes if an earlier snapshot exists. So I had to change the way we process SingleDeletes to sometimes leave a SingleDelete behind when we encounter a Put followed by a SingleDelete. See the comments in this diff for a more detailed explanation.
Test Plan: added more unit tests
Reviewers: rven, igor, kradhakrishnan, IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D50295
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.
Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.
Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51117
Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases.
Test Plan: Add a unit test and run it in valgrind too.
Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51459
Summary: Now in benchmark "uncompress" in db_bench, we get size from compressed stream for all other compression types except Snappy, where we allocate memory based on parameter. Change it to match to behavior of other compression types.
Test Plan: Run ./db_bench --benchmarks=uncompress with snappy and other compression types.
Reviewers: yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51681