rocksdb

Author	SHA1	Message	Date
Jay	49d88be021	c abi: allow compaction filter ignore snapshot (#1268 ) close #1262	2016-08-17 18:48:43 -07:00
Anirban Rahut	2fc2fd92a9	Single Delete Mismatch and Fallthrough statistics Summary: Added 2 statistics in compaction job statistics, to identify if single deletes are not meeting a matching key (fallthrough) or single deletes are meeting a merge, delete or another single delete (i.e. not the expected case of put). Test Plan: Tested the statistics using write_stress and compaction_job_stats_test Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61749	2016-08-16 08:21:43 -07:00
Andrew Kryczka	3771e37970	WriteBatch support for range deletion Summary: Add API to WriteBatch to store range deletions in its buffer which are later added to memtable. In the WriteBatch buffer, a range deletion is encoded as "<optype><CF ID (optional)><begin key><end key>". With this diff, the range tombstones are stored inline with the data in the memtable. It's useful for now because the test cases rely on the data being accessible via memtable. My next step is to store range tombstones in a separate area in the memtable. Test Plan: unit tests Reviewers: IslamAbdelRahman, sdong, wanning Reviewed By: wanning Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61401	2016-08-16 08:16:04 -07:00
Islam AbdelRahman	64a0082c69	Fix DBSSTest::AddExternalSstFileSkipSnapshot valgrind fail Summary: Fix the test by releasing the last snapshot Test Plan: run the test under valgrind Reviewers: andrewkr, yiwu, lightmark, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62091	2016-08-15 14:04:40 -07:00
sdong	6525ce4caf	Compaction stats printing: "batch" => "commit group" Summary: "Batch" is ambiguous in this context. It can mean "write batch" or commit group. Change it to commit group to be clear. Test Plan: Build Reviewers: MarkCallaghan, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62055	2016-08-15 10:47:29 -07:00
Islam AbdelRahman	a297643f2e	Fix valgrind memory leak	2016-08-11 23:34:19 -07:00
Islam AbdelRahman	d11c09d9e2	Eliminate memcpy from ForwardIterator Summary: This diff update ForwardIterator to support pinning keys and values, which will allow DBIter to take advantage of that and eliminate memcpy when executing merge operators This diff is stacked on D61305 Test Plan: existing tests (updated them to test tailing iterator) new test Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60009	2016-08-11 19:10:16 -07:00
Islam AbdelRahman	b693ba68b5	Minor PinnedIteratorsManager Refactoring Summary: This diff include these simple change - Rename ReleasePinnedIterators to ReleasePinnedData - Rename PinIteratorIfNeeded to PinIterator - Use std::vector directly in PinnedIteratorsManager instead of std::unique_ptr<std::vector> - Generalize PinnedIteratorsManager by adding PinPtr which can pin any pointer Test Plan: existing tests Reviewers: sdong, yiwu, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61305	2016-08-11 11:54:17 -07:00
sdong	4beffe001d	Fix test data race in two FaultInjectionTest tests Summary: Background sleeping tasks may conflict with test cleaning up. Wait for the sleeping tasks to finish before ending the test. Test Plan: Run these tests. Reviewers: andrewkr, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61827	2016-08-10 13:56:50 -07:00
sdong	56dd034115	read_options.background_purge_on_iterator_cleanup to cover forward iterator and log file closing too. Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too. Test Plan: I am adding unit tests. Reviewers: andrewkr, IslamAbdelRahman, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61503	2016-08-10 13:16:41 -07:00
Islam AbdelRahman	ccecf3f4fb	UniversalCompaction should ignore sorted runs being compacted (when compacting for file num) Summary: If we have total number of sorted runs greater than level0_file_num_compaction_trigger, Universal compaction will always issue a compaction even if the number of sorted runs that are not being compacted is less than level0_file_num_compaction_trigger. This diff changes this behaviour to relay on the `number of sorted runs not being compacted` instead of `total number of sorted runs` Test Plan: New unit test Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61533	2016-08-10 12:37:43 -07:00
Zongzhi Chen	98d0b78eac	Added check_snapshot option in the DB's AddFile function (#1261 ) * Added check_snapshot option in the DB's AddFile function * change check_snapshot to skip_snapshot_check * add unit test for skip_snapshot_check * Add skip_snapshot_check comment	2016-08-09 18:14:13 -07:00
Yi Wu	7882cb9773	Make DBOptionsTest::EnableAutoCompactionAndTriggerStall less falky Summary: Explicitly flush two times to generate two sst files. Test Plan: run the test. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61671	2016-08-05 16:45:57 -07:00
omegaga	c3a4bea5dc	Fix flaky test `ObsoleteFiles` Summary: The test `ObsoleteFiles` failed occasionally on slow device. This problem appears on Travis CI several times. The reason is that we did not wait until compaction jobs are finished in the test, while in slower device the background jobs take longer time to finish. Test Plan: Pass existing tests. Reviewers: yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61479	2016-08-03 15:19:35 -07:00
Yi Wu	ee027fc19f	Ignore write stall triggers when auto-compaction is disabled Summary: My understanding is that the purpose of write stall triggers are to wait for auto-compaction to catch up. Without auto-compaction, we don't need to stall writes. Also with this diff, flush/compaction conditions are recalculated on dynamic option change. Previously the conditions are recalculate only when write stall options are changed. Test Plan: See the new test. Removed two tests that are no longer valid. Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61437	2016-08-02 21:55:26 -07:00
Aaron Gao	343304e1d3	Use StopWatch to do statistic job in db_impl_add_file.cc Summary: patch for diff https://reviews.facebook.net/D58587 Also change StopWatch class to add a fifth param named overwrite which decides whether to overwrite *elapse or add on it. Test Plan: make all check -j64 Reviewers: sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61239	2016-08-02 14:53:29 -07:00
Jay Edgar	cdc4eb6892	Add a GetComparator() function to the ColumnFamilyHandle base class so that the user's comparator can be retrieved. Summary: MyRocks is adding support for the user of the SstFileWriter which needs a comparator. It would be more convenient to get the comparator from the column family (which already has to have it) than to have caller keep track of it. Test Plan: Standard tests (adding one for the new method) Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61155	2016-08-02 14:34:57 -07:00
Islam AbdelRahman	5e2c796589	Make DBTest.CompressionStatsTest more deterministic Summary: DBTest.CompressionStatsTest on non_shm test where the storage device is slow DBTest.CompressionStatsTest assumes that a flush happens to check the number of compressed blocks. This is not always true if the Flush is slow, make the test more deterministic by forcing a flush before doing the check Test Plan: Run the test locally Reviewers: andrewkr, yiwu, lightmark, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61317	2016-07-29 11:42:28 -07:00
Aaron Gao	e72ea485ed	add InDomain regression test Summary: regression tests to make sure seek keys not in domain would not fail assertion Test Plan: ``` [gzh@dev6163.prn2 ~/local/rocksdb] ./prefix_test --gtest_filter=SamePrefixTest.* /tmp/rocksdbtest-112628/prefix_test Note: Google Test filter = SamePrefixTest.* [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from SamePrefixTest [ RUN ] SamePrefixTest.InDomainTest [ OK ] SamePrefixTest.InDomainTest (211 ms) [----------] 1 test from SamePrefixTest (211 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (211 ms total) ``` Reviewers: andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61161	2016-07-27 18:45:53 -07:00
sdong	e5b5f12b81	Change options memtable_prefix_bloom_huge_page_tlb_size => memtable_huge_page_size and cover huge page to memtable too Summary: Extend the option memtable_prefix_bloom_huge_page_tlb_size from just putting memtable bloom filter to huge page to memtable itself too. Test Plan: Run all existing tests. Reviewers: IslamAbdelRahman, yhchiang, andrewkr Reviewed By: andrewkr Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60513	2016-07-26 18:15:11 -07:00
sdong	0ce258f9b3	Compaction picker to expand output level files for keys cross files' boundary too. Summary: We may wrongly drop delete operation if we pick a file with the entry to be delete, the put entry of the same user key is in the next file in the level, and the next file is not picked. We expand compaction inputs for output level too. Test Plan: Add unit tests that reproduct the bug of dropping delete entry. Change compaction_picker_test to assert the new behavior. Reviewers: IslamAbdelRahman, igor Reviewed By: igor Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61173	2016-07-26 17:56:36 -07:00
Wanning Jiang	e12270dfee	fix previous typo Summary: old typos with FILTER/INDEX_CACHE Test Plan: still pass this unit test Reviewers: andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61185	2016-07-26 11:15:14 -07:00
Islam AbdelRahman	16e225f70d	Fix MergeContext::copied_operands_ strings moving Summary: MergeContext::copied_operands contain strings that MergeContext::operand_list_ Slices point to It's possible that when MergeContext::copied_operands grow, these strings are moved and there place in memory is changed, this will cause MergeContext::operand_list_ to point to invalid memory. fix this problem by using unique_ptr<string> instead of string Test Plan: run tests under mac/clang Reviewers: sdong, yiwu Reviewed By: yiwu Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61023	2016-07-25 15:31:41 -07:00
Yi Wu	ae0ad719de	Fix flaky DBSSTTEST::DeleteObsoleteFilesPendingOutputs Summary: The test is flaky on Travis in osx environment. The background flush the test wanting to block can run behind the L2 manual compaction, making the test actually blocking the L2 compaction and won't able to proceed. Test Plan: Test run on travis Reviewers: kradhakrishnan, sdong, andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61101	2016-07-25 15:09:34 -07:00
Yi Wu	c6654588bd	Disable two dynamic options tests under lite build Summary: RocksDB lite don't support dynamic options. Disable the two test from lite build, and assert `SetOptions` should return `status::OK`. Test Plan: Run the db_options test under lite build and normal build. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61119	2016-07-25 11:48:17 -07:00
sdong	2a6d0cde72	Ignore stale logs while restarting DBs Summary: Stale log files can be deleted out of order. This can happen for various reasons. One of the reason is that no data is ever inserted to a column family and we have an optimization to update its log number, but not all the old log files are cleaned up (the case shown in the unit tests added). It can also happen when we simply delete multiple log files out of order. This causes data corruption because we simply increase seqID after processing the next row and we may end up with writing data with smaller seqID than what is already flushed to memtables. In DB recovery, for the oldest files we are replaying, if there it contains no data for any column family, we ignore the sequence IDs in the file. Test Plan: Add two unit tests that fail without the fix. Reviewers: IslamAbdelRahman, igor, yiwu Reviewed By: yiwu Subscribers: hermanlee4, yoshinorim, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60891	2016-07-25 11:47:31 -07:00
sdong	d5a51d4de3	Need to make sure log file synced before flushing memtable of one column family Summary: Multiput atomiciy is broken across multiple column families if we don't sync WAL before flushing one column family. The WAL file may contain a write batch containing writes to a key to the CF to be flushed and a key to other CF. If we don't sync WAL before flushing, if machine crashes after flushing, the write batch will only be partial recovered. Data to other CFs are lost. Test Plan: Add a new unit test which will fail without the diff. Reviewers: yhchiang, IslamAbdelRahman, igor, yiwu Reviewed By: yiwu Subscribers: yiwu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60915	2016-07-21 16:29:06 -07:00
Yi Wu	89f319c2df	Fix unit test which breaks lite build Summary: Comment out assertion of number of table files from lite build. Test Plan: OPT=-DROCKSDB_LITE make check Reviewers: lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60999	2016-07-21 14:32:12 -07:00
Yi Wu	32604e6601	Fix flush not being commit while writing manifest Summary: Fix flush not being commit while writing manifest, which is a recent bug introduced by D60075. The issue: # Options.max_background_flushes > 1 # Background thread A pick up a flush job, flush, then commit to manifest. (Note that mutex is released before writing manifest.) # Background thread B pick up another flush job, flush. When it gets to `MemTableList::InstallMemtableFlushResults`, it notices another thread is commiting, so it quit. # After the first commit, thread A doesn't double check if there are more flush result need to commit, leaving the second flush uncommitted. Test Plan: run the test. Also verify the new test hit deadlock without the fix. Reviewers: sdong, igor, lightmark Reviewed By: lightmark Subscribers: andrewkr, omegaga, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60969	2016-07-21 10:10:41 -07:00
John Alexander	9ab38c45ad	Remove %z Format Specifier and Fix Windows Build of sim_cache.cc (#1224 ) * Replace %zu format specifier with Windows-compatible macro 'ROCKSDB_PRIszt' * Added "port/port.h" include to sim_cache.cc for call to snprintf(). * Applied cleaner fix to windows build, reverting part of `7bedd94`	2016-07-20 15:28:04 -07:00
omegaga	e70020e4f6	Only cache level 0 indexes and filter when opening table reader Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1). Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?) Reviewers: sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59913	2016-07-20 11:23:31 -07:00
Islam AbdelRahman	68a8e6b8fa	Introduce FullMergeV2 (eliminate memcpy from merge operators) Summary: This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice> This diff is stacked on top of D56493 and D56511 In this diff we - Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future - Replace std::deque<std::string> with std::vector<Slice> to pass operands - Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187) - Allow FullMergeV2 output to be an existing operand ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 1 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s [master] readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s ``` ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s [master] readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 1 operand per key] [FullMergeV2] $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s [master] readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions [FullMergeV2] readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s [master] readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, andrewkr, sdong Reviewed By: sdong Subscribers: lovro, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57075	2016-07-20 09:49:03 -07:00
sdong	e70ba4e40e	MemTable::PostProcess() can skip updating num_deletes if the delta is 0 Summary: In many use cases there is no deletes. No need to pay the overhead of atomically updating num_deletes. Test Plan: Run existing test. Reviewers: ngbronson, yiwu, andrewkr, igor Reviewed By: andrewkr Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60555	2016-07-19 18:10:18 -07:00
sdong	2a282e5f54	DBTablePropertiesTest.GetPropertiesOfTablesInRange: Fix Flaky Summary: Summary There is a possibility that there is no L0 file after writing the data. Generate an L0 file to make it work. Test Plan: Run the test many times. Reviewers: andrewkr, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60825	2016-07-19 15:46:20 -07:00
John Alexander	9430333f84	New Statistics to track Compression/Decompression (#1197 ) * Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer). * Prevent incrementing compression counter when compression is turned off in options. * Prevent incrementing compression counter when compression is turned off in options. * Added two more supported compression types to test code in db_test.cc * Prevent incrementing compression counter when compression is turned off in options. * Added new StatsLevel that excludes compression timing. * Fixed casting error in coding.h * Fixed CompressionStatsTest for new StatsLevel. * Removed unused variable that was breaking the Linux build	2016-07-19 09:44:03 -07:00
sdong	21c55bdb6e	DBTest.DynamicLevelCompressionPerLevel: Tune Threshold Summary: Each SST's file size increases after we add more table properties. Threshold in DBTest.DynamicLevelCompressionPerLevel need to adjust accordingly to avoid occasional failures. Test Plan: Run the test Reviewers: andrewkr, yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60819	2016-07-15 16:10:09 -07:00
sdong	6797e6ffac	Avoid updating memtable allocated bytes if write_buffer_size is not set Summary: If options.write_buffer_size is not set, nor options.write_buffer_manager, no need to update the bytes allocated counter in MemTableAllocator, which is expensive in parallel memtable insert case. Remove it can improve parallel memtable insert throughput by 10% with write batch size 128. Test Plan: Run benchmarks TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -disable_auto_compactions -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -num=10000000 --writes=1000000 -max_background_flushes=16 -max_write_buffer_number=16 --threads=32 --batch_size=128 -allow_concurrent_memtable_write -enable_write_thread_adaptive_yield The throughput grows 10% with the benchmark. Reviewers: andrewkr, yiwu, IslamAbdelRahman, igor, ngbronson Reviewed By: ngbronson Subscribers: ngbronson, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60465	2016-07-13 19:33:57 -07:00
Aaron Gao	dda6c72ac8	Add DestroyColumnFamilyHandle(ColumnFamilyHandle) to db.h Summary: add DestroyColumnFamilyHandle(ColumnFamilyHandle) to close column family instead of deleting cfh* User should call this to close a cf and then we can detect the deletion in this function. Test Plan: make all check -j64 Reviewers: andrewkr, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60765	2016-07-13 17:59:25 -07:00
Andrew Kryczka	56222f57df	Avoid FileMetaData copy Summary: as titled Test Plan: unit tests Reviewers: sdong, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60597	2016-07-13 15:36:22 -07:00
Yi Wu	6ea41f8527	Fix deadlock when trying update options when write stalls Summary: When write stalls because of auto compaction is disabled, or stop write trigger is reached, user may change these two options to unblock writes. Unfortunately we had issue where the write thread will block the attempt to persist the options, thus creating a deadlock. This diff fix the issue and add two test cases to detect such deadlock. Test Plan: Run unit tests. Also, revert db_impl.cc to master (but don't revert `DBImpl::BackgroundCompaction:Finish` sync point) and run db_options_test. Both tests should hit deadlock. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60627	2016-07-12 15:30:38 -07:00
Jay Edgar	efd013d6d8	Miscellaneous performance improvements Summary: I was investigating performance issues in the SstFileWriter and found all of the following: - The SstFileWriter::Add() function created a local InternalKey every time it was called generating a allocation and free each time. Changed to have an InternalKey member variable that can be reset with the new InternalKey::Set() function. - In SstFileWriter::Add() the smallest_key and largest_key values were assigned the result of a ToString() call, but it is simpler to just assign them directly from the user's key. - The Slice class had no move constructor so each time one was returned from a function a new one had to be allocated, the old data copied to the new, and the old one was freed. I added the move constructor which also required a copy constructor and assignment operator. - The BlockBuilder::CurrentSizeEstimate() function calculates the current estimate size, but was being called 2 or 3 times for each key added. I changed the class to maintain a running estimate (equal to the original calculation) so that the function can return an already calculated value. - The code in BlockBuilder::Add() that calculated the shared bytes between the last key and the new key duplicated what Slice::difference_offset does, so I replaced it with the standard function. - BlockBuilder::Add() had code to copy just the changed portion into the last key value (and asserted that it now matched the new key). It is more efficient just to copy the whole new key over. - Moved this same code up into the 'if (use_delta_encoding_)' since the last key value is only needed when delta encoding is on. - FlushBlockBySizePolicy::BlockAlmostFull calculated a standard deviation value each time it was called, but this information would only change if block_size of block_size_deviation changed, so I created a member variable to hold the value to avoid the calculation each time. - Each PutVarint??() function has a buffer and calls std::string::append(). Two or three calls in a row could share a buffer and a single call to std::string::append(). Some of these will be helpful outside of the SstFileWriter. I'm not 100% the addition of the move constructor is appropriate as I wonder why this wasn't done before - maybe because of compiler compatibility? I tried it on gcc 4.8 and 4.9. Test Plan: The changes should not affect the results so the existing tests should all still work and no new tests were added. The value of the changes was seen by manually testing the SstFileWriter class through MyRocks and adding timing code to identify problem areas. Reviewers: sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59607	2016-07-12 14:15:32 -07:00
Aaron Gao	816ae098ea	fix test failure Summary: fix Rocksdb Unit Test USER_FAILURE Test Plan: make all check -j64 Reviewers: sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60603	2016-07-11 13:33:52 -07:00
Aaron Gao	8e6b38d895	update DB::AddFile to ingest list of sst files Summary: DB::AddFile(std::string file_path) API that allow them to ingest an SST file created using SstFileWriter We want to update this interface to be able to accept a list of files that will be ingested, DB::AddFile(std::vector<std::string> file_path_list). Test Plan: Add test case `AddExternalSstFileList` in `DBSSTTest`. To make sure: 1. files key ranges are not overlapping with each other 2. each file key range dont overlap with the DB key range 3. make sure no snapshots are held Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D58587	2016-07-11 10:43:12 -07:00
Yi Wu	296545a2c7	Fix clang analyzer errors Summary: Fixing erros reported by clang static analyzer. * Removing some unused variables. * Adding assertions to fix false positives reported by clang analyzer. * Adding `__clang_analyzer__` macro to suppress false positive warnings. Test Plan: USE_CLANG=1 OPT=-g make analyze -j64 Reviewers: andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60549	2016-07-08 17:50:51 -07:00
sdong	907f24d0e1	Concurrent memtable inserter to update counters and flush state after all inserts Summary: In concurrent memtable insert case, updating counters in MemTable::Add() can count for 5% CPU usage. By batch all the counters and update in the end of the write batch, the CPU overheads are overhead in the use cases where more than one key is updated in one write batch. Test Plan: Write throughput increases 12% with this benchmark setting: TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -disable_auto_compactions -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -num=10000000 --writes=1000000 -max_background_flushes=16 -max_write_buffer_number=16 --threads=64 --batch_size=128 -allow_concurrent_memtable_write -enable_write_thread_adaptive_yield Reviewers: andrewkr, IslamAbdelRahman, ngbronson, igor Reviewed By: ngbronson Subscribers: ngbronson, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60495	2016-07-08 10:19:55 -07:00
Andrew Kryczka	e1b3ee8a79	Cleanup auto-roll logger flush-while-rolling test Summary: Use @omegaga's awesome feature to avoid use of callbacks for ensuring SyncPoints happen in a particular thread. Depends on D60375. Test Plan: $ ./auto_roll_logger_test Reviewers: omegaga, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, omegaga, leveldb Differential Revision: https://reviews.facebook.net/D60471	2016-07-07 11:35:40 -07:00
omegaga	cd4178a015	Add a new feature to enforce a sync point only active on a thread Summary: Add markers to sync points. A marked sync point will only be active when it is on the same thread as the marker sync point. Test Plan: Write a unit test to validate. Reviewers: sdong, IslamAbdelRahman, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60375	2016-07-07 11:29:14 -07:00
Gunnar Kudrjavets	b954847fca	Fix release build for MyRocks by using debug-only code only in debug builds Summary: MyRocks release integration build breaks because we treat warnings caused by unused variables as errors. Variable `edit` is only used in debug builds. Therefore we need to guard it using `#ifndef NDEBUG` check. Test Plan: - `[p]arc diff --preview` for the default validation. - Verify that release build fails before this fix and passes after applying it. Reviewers: andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60423	2016-07-06 16:07:53 -07:00
sdong	a00bf1b3cf	Add More Logging to track total_log_size Summary: We saw instances where total_log_size is off the real value, but I'm not able to reproduce it. Add more logging to help debugging when it happens again. Test Plan: Run the unit test and see the logging. Reviewers: andrewkr, yhchiang, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60081	2016-07-06 14:29:18 -07:00
sdong	32df9733d1	Add options.write_buffer_manager: control total memtable size across DB instances Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances. Test Plan: Add a new unit test. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59925	2016-07-05 18:11:25 -07:00
Aaron Gao	5aaef91d4a	group multiple batch of flush into one manifest file (one call to LogAndApply) Summary: Currently, if several flush outputs are committed together, we issue each manifest write per batch (1 batch = 1 flush = 1 sst file = 1+ continuous memtables). Each manifest write requires one fsync and one fsync to parent directory. In some cases, it becomes the bottleneck of write. We should batch them and write in one manifest write when possible. Test Plan: ` ./db_bench -benchmarks="fillseq" -max_write_buffer_number=16 -max_background_flushes=16 -disable_auto_compactions=true -min_write_buffer_number_to_merge=1 -write_buffer_size=65536 -level0_stop_writes_trigger=10000 -level0_slowdown_writes_trigger=10000` Before ``` Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 4.9 Date: Fri Jul 1 15:38:17 2016 CPU: 32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz CPUCache: 20480 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 166.277 micros/op 6014 ops/sec; 0.7 MB/s ``` After ``` Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 4.9 Date: Fri Jul 1 15:35:05 2016 CPU: 32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz CPUCache: 20480 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 52.328 micros/op 19110 ops/sec; 2.1 MB/s ``` Reviewers: andrewkr, IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: igor, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60075	2016-07-05 18:09:59 -07:00
omegaga	a45ee83181	Fix a bug that accesses invalid address in iterator cleanup function Summary: Reported in T11889874. When registering the cleanup function we should copy the option so that we can still access it if ReadOptions is deleted. Test Plan: Add a unit test to reproduce this bug. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60087	2016-07-05 11:57:14 -07:00
Gunnar Kudrjavets	bdb1d19a69	Fix UBSan build break caused by variable not initialized Summary: UBSan is unhappy because `cfd` is not initialized. This breaks UBSan build which in turn breaks MyRocks continuous integration with RocksDB which in turns makes me unhappy :-) Fix this. Test Plan: - `[p]arc diff --preview` + Sandcastle. - Verify that `COMPILE_WITH_UBSAN=1 OPT=-g make J=1 ubsan_check` gets past the break. Reviewers: andrewkr, hermanlee4, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60117	2016-06-29 10:49:25 -07:00
sdong	c4cef07f1b	Update DBTestUniversalCompaction.UniversalCompactionSingleSortedRun to use max_size_amplification_percent = 0 Summary: With max_size_amplification_percent = 0 to make sure that DBTestUniversalCompaction.UniversalCompactionSingleSortedRun tests the configuration to compact to one single sorted run. Test Plan: Run all existing tests Reviewers: yhchiang, andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60021	2016-06-27 15:19:27 -07:00
charsyam	4f2b0946d1	fix simple typos (#1183 )	2016-06-25 08:29:40 +01:00
Andrew Kryczka	3b7ed677de	ColumnFamilyOptions API [CF + RepairDB part 3/3] Summary: Overload RepairDB to take vector-of-ColumnFamilyDescriptor, which tells us CF name + options. Also takes a ColumnFamilyOptions for unspecified column families encountered during the repair. One potentially confusing thing is that we store options in the constructor and don't invoke AddColumnFamily() until discovering the CF in ScanTable. This is because we don't know the CF ID until we find a table belonging to that CF. Depends on D59781. Test Plan: $ ./repair_test Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D59853	2016-06-24 16:29:43 -07:00
Andrew Kryczka	56ac686292	Detect column family from properties [CF + RepairDB part 2/3] Summary: This diff uses the CF ID and CF name properties in the SST file to associate recovered data with the proper column family. Depends on D59775. - In ScanTable(), create column families in VersionSet each time a new one is discovered (via reading SST file properties) - In ConvertLogToTable(), dump an SST file for every column family with data in the WAL - In AddTables(), make a VersionEdit per-column family that adds all of that CF's tables Test Plan: $ ./repair_test Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D59781	2016-06-24 13:12:13 -07:00
Andrew Kryczka	343507afb1	Refactor to use VersionSet [CF + RepairDB part 1/3] Summary: To support column families, it is easiest to use VersionSet to manage our column families (if we don't have Versions then ColumnFamilyData always behaves as a dummy column family). This diff only refactors the existing repair logic to use VersionSet; the next two parts will add support for multiple column families. Test Plan: $ ./repair_test Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D59775	2016-06-24 11:19:40 -07:00
omegaga	c4e19b77e8	Add a read option to enable background purge when cleaning up iterators Summary: Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators. Instead, a job is scheduled in high priority queue and would be executed in a separate background thread. Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background. Reviewers: IslamAbdelRahman, sdong Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59499	2016-06-21 18:41:23 -07:00
Islam AbdelRahman	fa813f7478	Update DB::AddFile() to ingest the file to the lowest possible level Summary: DB::AddFile() right now always add the ingested file to L0 update the logic to add the file to the lowest possible level Test Plan: unit tests Reviewers: jkedgar, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D59637	2016-06-21 17:57:59 -07:00
sdong	7b79238b65	Deprectate filter_deletes Summary: filter_deltes is not a frequently used feature. Remove it. Test Plan: Run all test suites. Reviewers: igor, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59427	2016-06-17 10:30:47 -07:00
Islam AbdelRahman	30a24f2d3d	Add InternalStats and logging for AddFile() Summary: We dont report the bytes that we ingested from AddFile which make the write amplification numbers incorrect Update InternalStats and add logging for AddFile() Test Plan: Make sure the code compile and existing tests pass Reviewers: lightmark, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59763	2016-06-16 16:21:41 -07:00
sdong	249e796dfc	Fix Flaky DBCompactionTest.SkipStatsUpdateTest Summary: DBCompactionTest.SkipStatsUpdateTest sometimes fails. I don't see any verification related to the deletes issued. Remove them to avoid the uncertainty. Test Plan: Run the test. Reviewers: IslamAbdelRahman, andrewkr, yhchiang Reviewed By: yhchiang Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59613	2016-06-15 12:00:51 -07:00
Islam AbdelRahman	f5177c761f	Remove wasteful instrumentation in FullMerge (stacked on D59577) Summary: [ This diff is stacked on top of D59577 ] We keep calling timer.ElapsedNanos() on every call to MergeOperator::FullMerge even when statistics are disabled, this is wasteful. I run the readseq benchmark on a DB containing 100K merge operands for 100K keys (1 operand per key) with 1GB block cache I see slight performance improvment Original results ``` $ ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=100000 --num=100000 --db="/dev/shm/100K_merge_compacted/" --cache_size=1073741824 --use_existing_db --disable_auto_compactions ------------------------------------------------ DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.498 micros/op 2006597 ops/sec; 222.0 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.295 micros/op 3393627 ops/sec; 375.4 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.285 micros/op 3511155 ops/sec; 388.4 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.286 micros/op 3500470 ops/sec; 387.2 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.283 micros/op 3530751 ops/sec; 390.6 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.289 micros/op 3464811 ops/sec; 383.3 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.277 micros/op 3612814 ops/sec; 399.7 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.283 micros/op 3539640 ops/sec; 391.6 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.285 micros/op 3503766 ops/sec; 387.6 MB/s ``` After patch ``` $ ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=100000 --num=100000 --db="/dev/shm/100K_merge_compacted/" --cache_size=1073741824 --use_existing_db --disable_auto_compactions ------------------------------------------------ DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.476 micros/op 2100119 ops/sec; 232.3 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.278 micros/op 3600887 ops/sec; 398.4 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.275 micros/op 3636698 ops/sec; 402.3 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.271 micros/op 3691661 ops/sec; 408.4 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.273 micros/op 3661534 ops/sec; 405.1 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.276 micros/op 3627106 ops/sec; 401.3 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.272 micros/op 3682635 ops/sec; 407.4 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.266 micros/op 3758331 ops/sec; 415.8 MB/s DB path: [/dev/shm/100K_merge_compacted/] readseq : 0.266 micros/op 3761907 ops/sec; 416.2 MB/s ``` Test Plan: make check -j64 Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59583	2016-06-13 16:22:14 -07:00
Islam AbdelRahman	7c919deccc	Reuse TimedFullMerge instead of FullMerge + instrumentation Summary: We have alot of code duplication whenever we call FullMerge we keep duplicating the instrumentation and statistics code This is a simple diff to refactor the code to use TimedFullMerge instead of FullMerge Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59577	2016-06-13 16:17:26 -07:00
Yi Wu	bc8af90e8c	add option to not flush memtable on open() Summary: Add option to not flush memtable on open() In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST. Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable. If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery. Test Plan: See unit test. Reviewers: dhruba, horuff, sdong Reviewed By: sdong Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57813	2016-06-13 11:34:16 -07:00
sdong	6faddd7c55	Merge db/slice.cc into util/slice.cc Summary: It confuses some compilers to have slice.cc under multiple directories. Merge them. Test Plan: Run existing tests Reviewers: andrewkr, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59409	2016-06-10 16:37:36 -07:00
sdong	5009b5326b	BlockBasedTable::FullFilterKeyMayMatch() Should skip prefix bloom if full key bloom exists Summary: Currently, if users define both of full key bloom and prefix bloom in SST files. During Get(), if full key bloom shows the key may exist, we still go ahead and check prefix bloom. This is wasteful. If bloom filter for full keys exists, we should always ignore prefix bloom in Get(). Test Plan: Run existing tests Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57825	2016-06-10 16:27:56 -07:00
sdong	20699df843	memtable_prefix_bloom_bits -> memtable_prefix_bloom_bits_ratio and deprecate memtable_prefix_bloom_probes Summary: memtable_prefix_bloom_probes is not a critical option. Remove it to reduce number of options. It's easier for users to make mistakes with memtable_prefix_bloom_bits, turn it to memtable_prefix_bloom_bits_ratio Test Plan: Run all existing tests Reviewers: yhchiang, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: gunnarku, yoshinorim, MarkCallaghan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59199	2016-06-10 12:12:10 -07:00
Wanning Jiang	56887f6cb8	Backup Options Summary: Backup options file to private directory Test Plan: backupable_db_test.cc, BackupOptions Modify DB options by calling OpenDB for 3 times. Check the latest options file is in the right place. Also check no redundent files are backuped. Reviewers: andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D59373	2016-06-09 19:03:10 -07:00
Anirban Rahut	a73b26f601	Adding test for contiguous WAL detection Summary: Add a test to detect that when WAL gets truncated, seq no's are checked to be contiguous. This test is put in ColumnFamilyTest as it has the necessary infrastructure/functions for flushing column families, which we use to ensure 2 active WAL files Test Plan: This is a test, no feature has been added. This test fails today and hence disabled Reviewers: sdong Reviewed By: sdong Subscribers: lgalanis, dhruba, andrewkr, pritamdamania Differential Revision: https://reviews.facebook.net/D59253	2016-06-07 18:04:15 -07:00
Aaron Gao	e532877940	Add statistics field to show total size of index and filter blocks in block cache Summary: With `table_options.cache_index_and_filter_blocks = true`, index and filter blocks are stored in block cache. Then people are curious how much of the block cache total size is used by indexes and bloom filters. It will be nice we have a way to report that. It can help people tune performance and plan for optimized hardware setting. We add several enum values for db Statistics. BLOCK_CACHE_INDEX/FILTER_BYTES_INSERT - BLOCK_CACHE_INDEX/FILTER_BYTES_ERASE = current INDEX/FILTER total block size in bytes. Test Plan: write a test case called `DBBlockCacheTest.IndexAndFilterBlocksStats`. The result is: ``` [gzh@dev9927.prn1 ~/local/rocksdb] make db_block_cache_test -j64 && ./db_block_cache_test --gtest_filter=DBBlockCacheTest.IndexAndFilterBlocksStats Makefile:101: Warning: Compiling in debug mode. Don't use the resulting binary in production GEN util/build_version.cc make: `db_block_cache_test' is up to date. Note: Google Test filter = DBBlockCacheTest.IndexAndFilterBlocksStats [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBBlockCacheTest [ RUN ] DBBlockCacheTest.IndexAndFilterBlocksStats [ OK ] DBBlockCacheTest.IndexAndFilterBlocksStats (689 ms) [----------] 1 test from DBBlockCacheTest (689 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (689 ms total) [ PASSED ] 1 test. ``` Reviewers: IslamAbdelRahman, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D58677	2016-06-03 10:47:47 -07:00
Jan Doms	02ec8154e5	allow updating block cache capacity from C (#1149 )	2016-06-03 14:04:51 +01:00
Andrew Kryczka	842958651f	Fix race condition in SwitchMemtable Summary: MemTableList::current_ could be written by background flush thread and simultaneously read in the user thread (NumNotFlushed() is used in SwitchMemtable()). Use the lock to prevent this case. Found the error from tsan. Related: D58833 Test Plan: $ OPT=-g COMPILE_WITH_TSAN=1 make -j64 db_test $ TEST_TMPDIR=/dev/shm/rocksdb ./db_test --gtest_filter=DBTest.RepeatedWritesToSameKey Reviewers: lightmark, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D59139	2016-06-02 17:11:45 -07:00
PraveenSinghRao	3a276b0cbe	Add a callback for when memtable is moved to immutable (#1137 ) * Create a callback for memtable becoming immutable Create a callback for memtable becoming immutable Create a callback for memtable becoming immutable moved notification outside the lock Move sealed notification to unlocked portion of SwitchMemtable * fix lite build	2016-06-02 11:57:31 -07:00
Mike Kolupaev	936973d145	Small tweaks to logging to track the number of immutable memtables Summary: We see some write stalls because of number of unflushed memtables. With existing logging I couldn't figure out what's happening exactly. See internal task t11446054 for details if interested. This diff adds: - logging of memtable creation at info level; I wanted it on multiple occasions for different reasons; also include number of immutable memtables, - logging of number of remaining immutable memtables after a flush. Test Plan: ran tests Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58833	2016-06-01 11:11:33 -07:00
siddontang	21c047ab49	add readahead size option (#1146 )	2016-06-01 10:48:50 -07:00
Reid Horuff	5d85fdb2c5	add missing lock	2016-05-31 12:26:48 -07:00
sdong	345fd73faf	Fix flaky DBTestDynamicLevel.DynamicLevelMaxBytesBase2 Summary: We added more table properties for each SST file, so when using 2KB SST file size, the estimated size of SST files is off by almost half, causing the LSM tree structure not as expected. Fix it by making file size 4x as previously, as well as LSM base size. Also avoid the sleeping based synchronization and turn to use sync points. Test Plan: Run paralell unit tests multiple times and make sure they always pass. Reviewers: IslamAbdelRahman, kradhakrishnan Reviewed By: kradhakrishnan Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58749	2016-05-26 10:13:24 -07:00
krad	8fc75de327	Minor fix to disable DynamicLevelMaxBytesBase2	2016-05-24 17:45:50 -07:00
Ashish Shenoy	99765ed855	Clean up the ComputeCompactionScore() API Summary: Make CompactionOptionsFIFO a part of mutable_cf_options Test Plan: UT Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, lgalanis, dhruba Differential Revision: https://reviews.facebook.net/D58653	2016-05-23 15:55:29 -07:00
Shen Li	def2f7bd0e	Expose report_bg_io_stats option in the C API. (#1131 )	2016-05-23 13:13:47 -07:00
siddontang	8f1214531e	C API: Expose DeleteFileInRange (#1132 )	2016-05-23 04:19:47 -07:00
Sage Weil	11f329bd40	db/db_impl: restrict WALRecoveryMode when using recycled log files kPointInTimeRecovery is indistinguishable from kTolerateCorruptedTailRecords in recycle mode since we define the "end" of the log as the first corrupt record we encounter. kAbsoluteConsistency doesn't make sense because even a clean shutdown leaves old junk at the end of the log file. Signed-off-by: Sage Weil <sage@redhat.com>	2016-05-22 22:00:15 -07:00
Sage Weil	2b2a898e0b	db/log_reader: combine kBadRecord{Len,Checksum} for readability These vary only by the corruption string reported. Signed-off-by: Sage Weil <sage@redhat.com>	2016-05-22 22:00:15 -07:00
Sage Weil	34df1c94d5	db/log_reader: treat bad record length or checksum as EOF If we are in kTolerateCorruptedTailRecords, treat these errors as the end of the log. This is particularly important for recycled logs, where we will regularly see corrupted headers (bad length or checksum) when replaying a log. If we are aligned with a block boundary or get lucky, we will land on an old header and see the log number mismatch, but more commonly we will land midway through some previous block and record and effectively see noise. These must be treated as the end of the log in order for recycling to work. This makes the LogTest.Recycle/1 test pass. We also modify a number of existing tests because the recycled log files behave fundamentally differently in that they always stop when they reach the first bad record. Signed-off-by: Sage Weil <sage@redhat.com>	2016-05-22 22:00:15 -07:00
Sage Weil	7947aba68c	db/log_reader: move kBadRecord{Len,Checksum} handling into ReadRecord The behavior here needs to depend on the WAL recovery mode. No functional change in this patch. Signed-off-by: Sage Weil <sage@redhat.com>	2016-05-22 22:00:15 -07:00
Sage Weil	847e471db6	db/log_test: add recycle log test This currently fails because we do not properly map a corrupt header to the logical end of the log. Signed-off-by: Sage Weil <sage@redhat.com>	2016-05-22 22:00:15 -07:00
Aaron Orenstein	2073cf3775	Eliminate use of 'using namespace std'. Also remove a number of ADL references to std functions. Summary: Reduce use of argument-dependent name lookup in RocksDB. Test Plan: 'make check' passed. Reviewers: andrewkr Reviewed By: andrewkr Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58203	2016-05-20 07:42:18 -07:00
Richard Cairns Jr	f6e404c20a	Added "number of merge operands" to statistics in ssts. Summary: A couple of notes from the diff: - The namespace block I added at the top of table_properties_collector.cc was in reaction to an issue i was having with PutVarint64 and reusing the "val" string. I'm not sure this is the cleanest way of doing this, but abstracting this out at least results in the correct behavior. - I chose "rocksdb.merge.operands" as the property name. I am open to suggestions for better names. - The change to sst_dump_tool.cc seems a bit inelegant to me. Is there a better way to do the if-else block? Test Plan: I added a test case in table_properties_collector_test.cc. It adds two merge operands and checks to make sure that both of them are reflected by GetMergeOperands. It also checks to make sure the wasPropertyPresent bool is properly set in the method. Running both of these tests should pass: ./table_properties_collector_test ./sst_dump_test Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58119	2016-05-19 14:24:48 -07:00
omegaga	3c69f77c67	Move IO failure test to separate file Summary: This is a part of effort to reduce the size of db_test.cc. We move the following tests to a separate file `db_io_failure_test.cc`: * DropWrites * DropWritesFlush * NoSpaceCompactRange * NonWritableFileSystem * ManifestWriteError * PutFailsParanoid Test Plan: Run `make check` to see if the tests are working properly. Reviewers: sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58341	2016-05-18 17:09:20 -07:00
Islam AbdelRahman	c70a9335de	Fix mutex unlock issue between scheduled compaction and ReleaseCompactionFiles() Summary: NotifyOnCompactionCompleted can unlock the mutex. That mean that we can schedule a background compaction that will start before we ReleaseCompactionFiles(). Test Plan: added unittest existing unittest Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: yoshinorim, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58065	2016-05-18 14:56:30 -07:00
Reid Horuff	a6254f2bd4	Long outstanding prepare test Summary: This tests that a prepared transaction is not lost after several crashes, restarts, and memtable flushes. Test Plan: TwoPhaseLongPrepareTest Reviewers: sdong Subscribers: hermanlee4, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58185	2016-05-17 18:57:06 -07:00
Aaron Gao	43afd72bee	[rocksdb] make more options dynamic Summary: make more ColumnFamilyOptions dynamic: - compression - soft_pending_compaction_bytes_limit - hard_pending_compaction_bytes_limit - min_partial_merge_operands - report_bg_io_stats - paranoid_file_checks Test Plan: Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit. All passed. Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57519	2016-05-17 13:11:56 -07:00
Islam AbdelRahman	f6aedb62c0	Fix Transaction memory leak Summary: - Make sure we clean up recovered_transactions_ on DBImpl destructor - delete leaked txns and env in TransactionTest Test Plan: Run transaction_test under valgrind Reviewers: sdong, andrewkr, yhchiang, horuff Reviewed By: horuff Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58263	2016-05-16 16:32:55 -07:00
krad	a08c8c851a	Added PersistentCache abstraction Summary: Added a new abstraction to cache page to RocksDB designed for the read cache use. RocksDB current block cache is more of an object cache. For the persistent read cache project, what we need is a page cache equivalent. This changes adds a cache abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache uncompressed pages or raw pages (content as in filesystem). The user can choose to operate PersistentCache either in COMPRESSED or UNCOMPRESSED mode. Blame Rev: Test Plan: Run unit tests Reviewers: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55707	2016-05-15 22:17:18 -07:00
Reid Horuff	a400336398	TransactionLogIterator sequence gap fix Summary: DBTestXactLogIterator.TransactionLogIterator was failing due the sequence gaps. This was caused by an off-by-one error when calculating the new sequence number after recovering from logs. Test Plan: db_log_iter_test Reviewers: andrewkr Subscribers: andrewkr, hermanlee4, dhruba, IslamAbdelRahman Differential Revision: https://reviews.facebook.net/D58053	2016-05-12 13:54:08 -07:00
Islam AbdelRahman	560358dc93	Fix data race in GetObsoleteFiles() Summary: GetObsoleteFiles() and LogAndApply() functions modify obsolete_manifests_ vector we need to make sure that the mutex is held when we modify the obsolete_manifests_ Test Plan: run the test under TSAN Reviewers: andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58011	2016-05-10 19:30:09 -07:00
Reid Horuff	c27061dae7	[rocksdb] 2PC double recovery bug fix Summary: 1. prepare() 2. crash 3. recover 4. commit() 5. crash 6. data is lost This is due to the transaction data still only residing in the WAL but because the logs were flushed on the first recovery the data is ignored on the second recovery. We must scan all logs found on recovery and only ignore redundant data at the time of replay. It is not possible to know which logs still contain relevant data at time of recovery. We cannot simply ignore a log because all of the non-2pc data it contains has already been written to L0. The changes made to MemTableInserter are to ensure that prepared sections are still recovered even if all of the non-2pc data in that log has already been flushed to L0. Test Plan: Provided test. Reviewers: sdong Subscribers: andrewkr, hermanlee4, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57729	2016-05-10 14:06:07 -07:00
Reid Horuff	a657ee9a9c	[rocksdb] Recovery path sequence miscount fix Summary: Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert. [1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)] [1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)] [4: COMMIT(a)] [7: COMMIT(b)] The first two batches do not consume any sequence numbers so are both prefixed with seq=1. For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL. We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries. This is a valid WAL. Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains. We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b)) So, now upon recovery we must track the actual consumption of sequence numbers. In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct? Test Plan: provided test. Reviewers: sdong Subscribers: andrewkr, leveldb, dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D57645	2016-05-10 14:06:07 -07:00

1 2 3 4 5 ...

2456 Commits