Test Plan: Compile. Run it.
Reviewers: yhchiang, dhruba, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D31479
Summary:
In this diff I add another parameter to BlockBasedTableOptions that will let users specify block based table's format. This will greatly simplify block based table's format changes in the future.
First format change that this will support is encoding decompressed size in Zlib and BZip2 blocks. This diff is blocking https://reviews.facebook.net/D31311.
Test Plan: Added a unit tests. More tests to come as part of https://reviews.facebook.net/D31311.
Reviewers: dhruba, MarkCallaghan, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31383
Summary:
Previous to this commit ColumnFamilyDescriptor took a String as name for the ColumnFamily name. String is however encoding dependent which is bad because listColumnFamilies returns byte arrays without any encoding information.
All public API call were deprecated and flagged to be removed in 3.10.0
Test Plan:
make rocksdbjava
make test
mvn -f rocksjni.pom package
Reviewers: yhchiang, adamretter, ankgup87
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D30525
Summary: We keep checksum functions in util/, there is no reason for compression to be in port/
Test Plan: compiles
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31281
* Use emplace when possible.
* Make FileInfo shared among all BackupMeta, instead of storing filenames.
* Make checksum_value in FileInfo constant.
* Reserve space beforehand if container size is known.
* Make FileInfo and BackupMeta non-copyable and non-assignable to prevent future logic errors.
It is very dangerous to copy BackupMeta without careful handling refcounts of FileInfo.
* Remove a copy of BackupMeta when detected corrupt backup.
Valgrind report prior to this fix:
==20829== Memcheck, a memory error detector
==20829== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==20829== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==20829== Command: ./c_simple_example
==20829==
==20829== Invalid read of size 1
==20829== at 0x4C2F1C8: strcmp (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20829== by 0x422522: main (in /home/user/rocksgit/transfer/rocksdb-git/examples/c_simple_example)
==20829== Address 0x5f60df5 is 0 bytes after a block of size 5 alloc'd
==20829== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20829== by 0x4226D5: CopyString (c.cc:498)
==20829== by 0x423032: rocksdb_get (c.cc:730)
==20829== by 0x4224EB: main (in /home/user/rocksgit/transfer/rocksdb-git/examples/c_simple_example)
==20829==
==20829==
==20829== HEAP SUMMARY:
==20829== in use at exit: 77 bytes in 5 blocks
==20829== total heap usage: 4,491 allocs, 4,486 frees, 839,216 bytes allocated
==20829==
==20829== LEAK SUMMARY:
==20829== definitely lost: 5 bytes in 1 blocks
==20829== indirectly lost: 0 bytes in 0 blocks
==20829== possibly lost: 0 bytes in 0 blocks
==20829== still reachable: 72 bytes in 4 blocks
==20829== suppressed: 0 bytes in 0 blocks
==20829== Rerun with --leak-check=full to see details of leaked memory
==20829==
==20829== For counts of detected and suppressed errors, rerun with: -v
==20829== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Summary: I'm tired of double-tab when opening build_tools/<something>. This change will make bu<tab> fully complete my path :)
Test Plan: `vi bu<tab>` gives me `vi build_tools/` yay!
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30639
Summary:
Create a benchmark for testing memtablereps. This diff is a bit rough, but it should do the trick until other bootcampers can clean it up.
Addressing comments
Removed the mutexes
Changed ReadWriteBenchmark to fix number of reads and count the number of writes we can perform in that time.
Test Plan:
Run it.
Below runs pass
./memtablerep_bench --benchmarks fillrandom,readrandom --memtablerep skiplist
./memtablerep_bench --benchmarks fillseq,readseq --memtablerep skiplist
./memtablerep_bench --benchmarks readwrite,seqreadwrite --memtablerep skiplist --num_operations 200 --num_threads 5
./memtablerep_bench --benchmarks fillrandom,readrandom --memtablerep hashskiplist
./memtablerep_bench --benchmarks fillseq,readseq --memtablerep hashskiplist
--num_scans 2
./memtablerep_bench --benchmarks fillseq,readseq --memtablerep vector
Reviewers: jpaton, ikabiljo, sdong
Reviewed By: sdong
Subscribers: dhruba, ameyag
Differential Revision: https://reviews.facebook.net/D22683
Summary: Add comments in db.h to help users discover their options.
Test Plan: Compile
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: MarkCallaghan, yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31077
Summary: Add an extra assert to make sure current version is included in VersionSet::AddLiveFiles().
Test Plan: make all check
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: dhruba, hermanlee4, leveldb
Differential Revision: https://reviews.facebook.net/D30819
Summary: During recovery, VersionBuilder::Apply() was called multiple times. If the DB is open for long enough, most of files added earlier will be deleted by later deletes. In current solution, sorting added file happens first and then deletes are applied. In this patch, deletes are applied when possible inside Apply(), which can significantly reduce the sorting time in some cases.
Test Plan:
Add unit tests in version_builder
valgrind_check
Open a manifest of 50MB, with 9K live files. The manifest read time reduced from 1.6 seconds to 0.7 seconds.
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30765
Summary:
This patch changes concurrency guarantees around ColumnFamilySet::column_families_ and ColumnFamilySet::column_families_data_.
Before:
* When mutating: lock DB mutex and spin lock
* When reading: lock DB mutex OR spin lock
After:
* When mutating: lock DB mutex and be in write thread
* When reading: lock DB mutex or be in write thread
That way, we eliminate the spin lock that protects these hash maps and simplify concurrency. That means we don't need to lock the spin lock during writing, since writing is mutually exclusive with column family create/drop (the only operations that mutate those hash maps).
With these new restrictions, I also needed to move column family create to the write thread (column family drop was already in the write thread).
Even though we don't need to lock the spin lock during write, impact on performance should be minimal -- the spin lock is almost never busy, so locking it is almost free.
This addresses task t5116919.
Test Plan:
make check
Stress test with lots and lots of column family drop and create:
time ./db_stress --threads=30 --ops_per_thread=5000000 --max_key=5000 --column_families=200 --clear_column_family_one_in=100000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress/
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30651
Summary: When trivial move commit is done, we log the summary of the input version instead of current. This is inconsistent with other log messages and confusing.
Test Plan: compiles
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30939
Summary:
A command line like this to run all the tests:
source benchmark.config.sh && nohup ./benchmark.sh 'bulkload,fillseq,overwrite,filluniquerandom,readrandom,readwhilewriting'
where
benchmark.config.sh is:
export DB_DIR=/data/mysql/rocksdata
export WAL_DIR=/txlogs/rockswal
export OUTPUT_DIR=/root/rocks_benchmarking/output
Will fail for the tests that need a new DB .
Also 1) set disable_data_sync=0 and 2) add debug mode to run through all the tests more quickly
Test Plan: run ./benchmark.sh 'debug,bulkload,fillseq,overwrite,filluniquerandom,readrandom,readwhilewriting' and verify that there are no complaints about WAL dir not being empty.
Reviewers: sdong, yhchiang, rven, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D30909
Summary:
Since https://reviews.facebook.net/D16119, we ignore partial tailing writes. Because of that, we no longer need skip_log_error_on_recovery.
The documentation says "Skip log corruption error on recovery (If client is ok with losing most recent changes)", while the option actually ignores any corruption of the WAL (not only just the most recent changes). This is very dangerous and can lead to DB inconsistencies. This was originally set up to ignore partial tailing writes, which we now do automatically (after D16119). I have digged up old task t2416297 which confirms my findings.
Test Plan: There was actually no tests that verified correct behavior of skip_log_error_on_recovery.
Reviewers: yhchiang, rven, dhruba, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30603
Summary:
This is a serious bug. If paranod_check == true and WAL is corrupted, we don't fail DB::Open(). I tried going into history and it seems we've been doing this for a long long time.
I found this when investigating t5852041.
Test Plan: Added unit test to verify correct behavior.
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30597
* Use strtoul() and strtoull() instead of sscanf().
glibc's sscanf() will do a implicit strlen().
* Move implicit construction of Slice("crc32 ") out of loop.
Summary: CLANG was broken for a recent change in db_ench. Fix it.
Test Plan: Build db_bench using CLANG.
Reviewers: rven, igor, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30801
Summary:
Add structures for exposing events and operations. Event describes
high-level action about a thread such as doing compaciton or
doing flush, while an operation describes lower-level action
of a thread such as reading / writing a SST table, waiting for
mutex. Events and operations are designed to be independent.
One thread would typically involve in one event and one operation.
Code instrument will be in a separate diff.
Test Plan:
Add unit-tests in thread_list_test
make dbg -j32
./thread_list_test
export ROCKSDB_TESTS=ThreadList
./db_test
Reviewers: ljin, igor, sdong
Reviewed By: sdong
Subscribers: rven, jonahcohen, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29781
Summary: Having --num_hot_column_families default on fails some existing regression tests. By default turn it off
Test Plan: Run db_bench to make sure it is default off.
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D30705
Summary:
Add option --num_hot_column_families in db_bench. If it is set, write options will first write to that number of column families, and then move on to next set of hot column families. The working set of column families can be smaller than total number of CFs.
It is to test how RocksDB can handle cold column families
Test Plan: Run db_bench with --num_hot_column_families set and not set.
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30663