Summary: Now if inserting to mem table is much faster than writing to files, there is no mechanism users can rely on to avoid stopping for reaching options.max_write_buffer_number. With the commit, if there are more than four maximum write buffers configured, we slow down to the rate of options.delayed_write_rate while we reach the last one.
Test Plan:
1. Add a new unit test.
2. Run db_bench with
./db_bench --benchmarks=fillrandom --num=10000000 --max_background_flushes=6 --batch_size=32 -max_write_buffer_number=4 --delayed_write_rate=500000 --statistics
based on hard drive and see stopping is avoided with the commit.
Reviewers: yhchiang, IslamAbdelRahman, anthony, rven, kradhakrishnan, igor
Reviewed By: igor
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52047
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted
ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted
Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.
Benchmark results (using https://phabricator.fb.com/P20083553)
```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G /home/tec/local/normal.4K.Snappy/db10077
// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G /home/tec/local/zero.8K.LZ4/db10077
// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
// --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
// --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"
// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s
// ============================================================================
// BM_StringCopy 1.73s 576.97m
// BM_StringPiece 103.74% 1.67s 598.55m
// ============================================================================
// Match rate : 1000000 / 1000000
// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s
// ============================================================================
// BM_StringCopy 611.99ms 1.63
// BM_StringPiece 203.76% 300.35ms 3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```
Test Plan: Unit tests
Reviewers: sdong, igor, anthony, yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, lovro, adsharma
Differential Revision: https://reviews.facebook.net/D48999
Summary:
List of changes:
1) Fix the snprintf() usage in cases where wrong variable was used to determine the output buffer size.
2) Remove unnecessary checks before calling delete operator.
3) Increase code correctness by using size_t type when getting vector's size.
4) Unify the coding style by removing namespace::std usage at the top of the file to confirm to the majority usage.
5) Fix various lint errors pointed out by 'arc lint'.
Test Plan:
Code review and build:
git diff
make clean
make -j 32 commit-prereq
arc lint
Reviewers: kradhakrishnan, sdong, rven, anthony, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51849
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.
Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.
Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51117
Summary: DBTest.DynamicCompactionOptions ocasionally fails during valgrind run. We sent a sleeping task to block compaction thread pool but we don't wait it to run.
Test Plan: Run the test multiple times in an environment which can cause failure.
Reviewers: rven, kradhakrishnan, igor, IslamAbdelRahman, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51687
Summary:
Fixes T8781168.
Added a new function EnableAutoCompactions in db.h to be publicly
avialable. This allows compaction to be re-enabled after disabling it via
SetOptions
Refactored code to set the dbptr earlier on in TransactionDB::Open and DB::Open
Temporarily disable auto_compaction in TransactionDB::Open until dbptr is set to
prevent race condition.
Test Plan:
Ran make all check
verified fix on myrocks side:
was able to reproduce the seg fault with
../tools/mysqltest.sh --mem --force rocksdb.drop_table
method was to manually sleep the thread after DB::Open but before TransactionDB ptr was
assigned in transaction_db_impl.cc:
DB::Open(db_options, dbname, column_families_copy, handles, &db);
clock_t goal = (60000 * 10) + clock();
while (goal > clock());
...dbptr(aka rdb) gets assigned below
verified my changes fixed the issue.
Also added unit test 'ToggleAutoCompaction' in transaction_test.cc
Reviewers: hermanlee4, anthony
Reviewed By: anthony
Subscribers: alex, dhruba
Differential Revision: https://reviews.facebook.net/D51147
Summary: Verifiction condition of DBTest.SuggestCompactRangeTest is too strict. Based on key distribution, we might have more small files in last level. Not check number of files in the last level.
Test Plan: Run DBTest.SuggestCompactRangeTest with both of jemalloc on and off.
Reviewers: rven, IslamAbdelRahman, yhchiang, kradhakrishnan, igor, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51501
Summary: DBTest.DynamicCompactionOptions sometimes fails the assert but I can't repro it locally. Make it more deterministic and readable and see whether the problem is still there.
Test Plan: Run tht test and make sure it passes
Reviewers: kradhakrishnan, yhchiang, igor, rven, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51309
Summary: DBTest.SuggestCompactRangeTest fails for the case when jemalloc is disabled, including ASAN and valgrind builds. It is caused by the improvement of skip list, which allocates different size of nodes for a new records. Fix it by using a special mem table that triggers a flush by number of entries. In that way the behavior will be consistent for all allocators.
Test Plan: Run the test with both of DISABLE_JEMALLOC=1 and 0
Reviewers: anthony, rven, yhchiang, kradhakrishnan, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51423
Summary: When option.db_write_buffer_size is hit, we currently flush all column families. Move to flush the column family with the largest active memt table instead. In this way, we can avoid too many small files in some cases.
Test Plan: Modify test DBTest.SharedWriteBuffer to work with the updated behavior
Reviewers: kradhakrishnan, yhchiang, rven, anthony, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: march, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51291
Summary: Now DBIter::Next() always compares with current key with itself first, which is unnecessary if the last key is not a merge key. I made the change and didn't see db_iter_test fails. Want to hear whether people have any idea what I miss.
Test Plan: Run all unit tests
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48279
Summary:
This diff completes the creation of InlineSkipList<Cmp>, which is like
SkipList<const char*, Cmp> but it always allocates the key contiguously
with the node. This allows us to remove the pointer from the node
to the key. As a result the memory usage of the skip list is reduced
(by 1 to sizeof(void*) bytes depending on the padding required to align
the key storage), cache locality is improved, and we halve the number
of calls to the allocator.
For skip lists whose keys are freshly-allocated const char*,
InlineSkipList is stricly preferrable to SkipList. This diff doesn't
replace SkipList, however, because some of the use cases of SkipList in
RocksDB are either character sequences that are not allocated at the
same time as the skip list node allocation (for example
hash_linklist_rep) or have different key types (for example
write_batch_with_index). Taking advantage of inline allocation for
those cases is left to future work.
The perf win is biggest for small values. For single-threaded CPU-bound
(32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte
values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec. For
large values the improvement is less pronounced, but seems to be between
5% and 10% on the same configuration.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51123
* conversion from 'size_t' to 'type', by add static_cast
Tested:
* by build solution on Windows, Linux locally,
* run tests
* build CI system successful
Summary: DBTest.DynamicLevelCompressionPerLevel2 sometimes fails during valgrind runs. This causes our valgrind tests to fail. Not sure what the best fix is for this test, but hopefully this simple change is sufficient.
Test Plan: run test
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51111
Summary: DBTest.MergeTestTime is a test verifying timing counters. Depending on real time may cause non-determinstic results. Change to fake time to be determinsitic.
Test Plan: Run the test and make sure it passes
Reviewers: yhchiang, anthony, rven, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D50883
Summary: Timing counters' upper bounds depend on platform. It frequently fails in valgrind runs. Relax the upper bound.
Test Plan: Run the same valgrind test and make sure it passes.
Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D50829
Summary:
This patch adds GetAggregatedIntProperty() that returns the aggregated
value from all CFs
Test Plan: Added a test in db_test
Reviewers: igor, sdong, anthony, IslamAbdelRahman, rven
Reviewed By: rven
Subscribers: rven, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D49497
Summary:
Update DB::AddFile() restrictions to be
- Key range in loaded table file don't overlap with existing keys or tombstones in DB.
- No other writes happen during AddFile call.
The updated AddFile() will verify that the file key range don't overlap with any keys or tombstones in the DB, and then add the file to L0
Test Plan: unit tests
Reviewers: igor, rven, anthony, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: adsharma, ameyag, dhruba
Differential Revision: https://reviews.facebook.net/D49233
Summary: Run "make format" for some recent commits.
Test Plan: Build and run tests
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49707
Summary: As above.
Test Plan: USE_CLANG=1 make check -j
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48981
When we recycle log files, we need to mix the log number into the CRC
for each record. Note that for logs that don't get recycled (like the
manifest), we always pass a log_number of 0 and false.
Signed-off-by: Sage Weil <sage@redhat.com>
Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB.
Test Plan: ran the GetPropertiesOfTablesInRange
Reviewers: rven, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48651
Summary:
Add rocksdb.num-running-compactions and rocksdb.num-running-flushes
to GetIntProperty() that reports the number of currently running
compactions / flushes.
Test Plan: augmented existing tests in db_test
Reviewers: igor, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48693
Summary:
We are cleaning up dependencies.
This diff takes a first step at moving memtable files to their own
directory called memtable. In future diffs, we will move other memtable
files from db to memtable.
Test Plan: make check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48915
Summary:
This diff exclude alot of tests in db_test that are not compiling / failing under ROCKSD_LITE
Test Plan:
OPT=-DROCKSDB_LITE make check -j64
make check -j64
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48771
Summary:
Two fixes:
1. Wait compaction after generating each L0 file so that we are sure there are one L0 file left.
2. https://reviews.facebook.net/D48423 increased from 500 keys to 700 keys but in verification phase we are still querying the first 500 keys. It is a bug to fix.
Test Plan: Run the test in the same environment that fails by chance of one in tens of times. It doesn't fail after 1000 times.
Reviewers: yhchiang, IslamAbdelRahman, igor, rven, kradhakrishnan
Reviewed By: rven, kradhakrishnan
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48759
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
Test Plan: Run all existing tests.
Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48549
Summary:
As part of tech debt week, we are cleaning up dependencies.
This diff moves db_test_util.[h,cc] from util to db directory.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48543
Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently.
Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct.
Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48411
Summary: Now based on environment, DBTest.AggregatedTableProperties has a possibility of issuing a L0->L1 compaction after reopening and the results are not what we expected. We tune the L0 compaction trigger to make it less likely to happen.
Test Plan: I can't repro the failure but I think the change is better. Just run the test and make sure it passes.
Reviewers: kradhakrishnan, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48423
Summary: With this commit, we add a new format in manifest when adding a new file. Now path ID and need-compaction hint are first two customized fields.
Test Plan: Add a test case in version_edit_test to verify the encoding and decoding logic. Add a unit test in db_test to verify need compaction is persistent after DB restarting.
Reviewers: kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, rven, igor
Reviewed By: igor
Subscribers: javigon, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48123
Summary:
hit and miss bloom filter stats for memtable and SST
stats added to perf_context struct
key matches and prefix matches combined into one stat
Test Plan: unit test veryfing the functionality added, see BloomStatsTest in db_test.cc for details
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47859
Summary:
Handle SST files with both ".sst" and ".ldb" suffix.
This enables user to migrate from leveldb to rocksdb.
Test Plan:
Added unit test with DB operating on SSTs with names schema.
See db/dc_test.cc:SSTsWithLdbSuffixHandling for details
Reviewers: yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48003