rocksdb

Author	SHA1	Message	Date
Victor Tyutyunov	e5c614e1df	Fixing snapshot 0 assertion Summary: Solution is not to change db sequence number to start from 1 because 0 value is used in multiple other places. Fix covers only compact_iterator::findEarliestVisibleSnapshot with updated logic to support snapshot's numbering starting from 0. Test Plan: run: make all check it should pass all tests Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: lgalanis, mgalushka, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56601	2016-04-16 01:47:15 -07:00
sdong	6d436a3f85	DBTest.HardLimit made more deterministic Summary: In DBTest.HardLimit, multiple flushes may merge into one, based on thread scheduling. Avoid it by waiting each flush to finish before generating the next one. Test Plan: Run test in parallel several times and see it doesn't fail any more. Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yiwu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56853	2016-04-15 17:36:57 -07:00
sdong	9d35ae649e	Make DBTestUniversalCompaction.IncreaseUniversalCompactionNumLevels more deterministic Summary: DBTestUniversalCompaction, IncreaseUniversalCompactionNumLevels fails one in about 30 runs when running in parallel. We wait for compaction after each flush to make the compaction behavior deterministic. Test Plan: Run the test 1000 times in parallel and it still passes. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: kradhakrishnan, yiwu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56841	2016-04-15 16:16:53 -07:00
sdong	4b6833aec1	Rename options.compaction_measure_io_stats to options.report_bg_io_stats and include flush too. Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it. Test Plan: Try db_bench and see the stats are printed out. Reviewers: yhchiang Reviewed By: yhchiang Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56769	2016-04-15 10:22:18 -07:00
Dmitri Smirnov	c9d668c584	Fix unit tests issues on Windows (#1078 )	2016-04-14 17:33:53 -07:00
sdong	535af525d6	BlockBasedTable::PrefixMayMatch() to skip index checking if we can't find a filter block. Summary: In the case where we can't find a filter block, there is not much benefit of doing the binary search and see whether the index key has the prefix. With the change, we blindly return true if we can't get the filter. It also fixes missing row cases for reverse comparator with full bloom. Test Plan: Add a test case that used to fail. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: kradhakrishnan, yiwu, hermanlee4, yoshinorim, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56697	2016-04-13 19:06:48 -07:00
Islam AbdelRahman	19ef3de57e	Fix ManualCompactionPartial test flakiness Summary: The reason for this test flakiness is that we try to verify that number of files in L0 is 3 after flushing the 3rd file although we may have a compaction running in the background that may finish before we do the check and the 3 L0 files are converted to 1 L1 file Test Plan: Run a modified version of the test that sleep before doing the check Reviewers: sdong, andrewkr, kradhakrishnan, yhchiang Reviewed By: yhchiang Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56643	2016-04-13 10:38:45 -07:00
sdong	dff4c48ede	BlockBasedTable::PrefixMayMatch: no need to find data block after full bloom checking Summary: Full block checking should be a good enough indication of prefix existance. No need to further check data block. This also fixes wrong results when using prefix bloom and reverse bitwise comparator. Test Plan: Will add a unit test. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: hermanlee4, yoshinorim, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56625	2016-04-12 16:25:54 -07:00
Yueh-Hsuan Chiang	13e6c8e97a	Relax an assertion in Compaction::ShouldStopBefore Summary: In some case, it is possible to have two concesutive SST files might sharing same boundary keys. However, in the assertion in Compaction::ShouldStopBefore, it exclude such possibility. This patch fix this issue by relaxing the assertion to allow the equal case. Test Plan: rocksdb tests Reviewers: IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55875	2016-04-11 20:15:52 -07:00
Yueh-Hsuan Chiang	ae21d71e94	Fixed a bug in RocksDB Statistics where flush is considered as compaction Summary: Fixed a bug in RocksDB Statistics where flush is considered as compaction Test Plan: unit test Reviewers: sdong, IslamAbdelRahman, rven, kradhakrishnan, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D56271	2016-04-11 19:59:25 -07:00
Jay Edgar	2448f80375	Make sure that if use_mmap_reads is on use_os_buffer is also on Summary: The code assumes that if use_mmap_reads is on then use_os_buffer is also on. This make sense as by using memory mapped files for reading you are expecting the OS to cache what it needs. Add code to make sure the user does not turn off use_os_buffer when they turn on use_mmap_reads Test Plan: New test: DBTest.MMapAndBufferOptions Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56397	2016-04-08 14:30:15 -07:00
Andrew Kryczka	2391ef7214	Embed column family name in SST file Summary: Added the column family name to the properties block. This property is omitted only if the property is unavailable, such as when RepairDB() writes SST files. In a next diff, I will change RepairDB to use this new property for deciding to which column family an existing SST file belongs. If this property is missing, it will add it to the "unknown" column family (same as its existing behavior). Test Plan: New unit test: $ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55605	2016-04-06 23:10:32 -07:00
Igor Canadi	ab4c62332e	Don't use version in the error message Summary: We use object `v` in the error message, which is not initialized if the edit is column family manipulation. This doesn't provide much useful info, so this diff is removing it. Instead, it dumps actual VersionEdit contents. Test Plan: compiles. would be great to get tests in version_set_test.cc that cover cases where a file write fails Reviewers: sdong, yhchiang, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D56349	2016-04-06 15:00:15 -07:00
Aaron Gao	cc87075d63	No need to limit to 20 files in UpdateAccumulatedStats() if options.max_open_files=-1 Summary: There is a hardcoded constraint in our statistics collection that prevents reading properties from more than 20 SST files. This means our statistics will be very inaccurate for databases with > 20 files since additional files are just ignored. The purpose of constraining the number of files used is to bound the I/O performed during statistics collection, since these statistics need to be recomputed every time the database reopened. However, this constraint doesn't take into account the case where option "max_open_files" is -1. In that case, all the file metadata has already been read, so MaybeInitializeFileMetaData() won't incur any I/O cost. so this diff gets rid of the 20-file constraint in case max_open_files == -1. Test Plan: write into unit test db/db_properties_test.cc - "ValidateSampleNumber". We generate 20 files with 2 rows and 10 files with 1 row. If max_open_files !=-1, the `rocksdb.estimate-num-keys` should be (101 + 102)/20 * 30 = 45. Otherwise, it should be the ground truth, 50. {F1089153} Reviewers: andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D56253	2016-04-01 16:19:12 -07:00
Islam AbdelRahman	8a1a603fdb	Eliminate std::deque initialization while iterating over merge operands Summary: This patch is similar to D52563, When we iterate over a DB with merge operands we keep creating std::queue to store the operands, optimize this by reusing merge_operands_ data member Before the patch ``` ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq" --db="/dev/shm/bench_merge_memcpy_on_the_fly/" --merge_operator="put" --merge_keys=10000 --num=10000 DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] mergerandom : 3.757 micros/op 266141 ops/sec; 29.4 MB/s ( updates:10000) DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.413 micros/op 2423538 ops/sec; 268.1 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.451 micros/op 2219071 ops/sec; 245.5 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.420 micros/op 2382039 ops/sec; 263.5 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.408 micros/op 2452017 ops/sec; 271.3 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] mergerandom : 3.947 micros/op 253376 ops/sec; 28.0 MB/s ( updates:10000) DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.441 micros/op 2266473 ops/sec; 250.7 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.471 micros/op 2122033 ops/sec; 234.8 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.440 micros/op 2271407 ops/sec; 251.3 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.429 micros/op 2331471 ops/sec; 257.9 MB/s ``` with the patch ``` ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq" --db="/dev/shm/bench_merge_memcpy_on_the_fly/" --merge_operator="put" --merge_keys=10000 --num=10000 DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] mergerandom : 4.080 micros/op 245092 ops/sec; 27.1 MB/s ( updates:10000) DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.308 micros/op 3241843 ops/sec; 358.6 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.312 micros/op 3200408 ops/sec; 354.0 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.332 micros/op 3013962 ops/sec; 333.4 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.300 micros/op 3328017 ops/sec; 368.2 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] mergerandom : 3.973 micros/op 251705 ops/sec; 27.8 MB/s ( updates:10000) DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.320 micros/op 3123752 ops/sec; 345.6 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.335 micros/op 2986641 ops/sec; 330.4 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.339 micros/op 2950047 ops/sec; 326.4 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.319 micros/op 3131565 ops/sec; 346.4 MB/s ``` Test Plan: make check -j64 Reviewers: yhchiang, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56031	2016-04-01 15:48:55 -07:00
Islam AbdelRahman	f38540b12a	WriteBatchWithIndex micro optimization Summary: - Put key offset and key size in WriteBatchIndexEntry - Use vector for comparators in WriteBatchEntryComparator I use a slightly modified version of @yoshinorim code to benchmark https://gist.github.com/IslamAbdelRahman/b120f4fba8d6ff7d58d2 For Put I create a transaction that put a 1000000 keys and measure the time spent without commit. For GetForUpdate I read the keys that I added in the Put transaction. Original time: ``` rm -rf /dev/shm/rocksdb-example/ ./txn_bench put 1000000 1000000 OK Ops \| took 3.679 seconds ./txn_bench get_for_update 1000000 1000000 OK Ops \| took 3.940 seconds ``` New Time ``` rm -rf /dev/shm/rocksdb-example/ ./txn_bench put 1000000 1000000 OK Ops \| took 2.727 seconds ./txn_bench get_for_update 1000000 1000000 OK Ops \| took 3.880 seconds ``` It looks like there is no significant improvement in GetForUpdate() but we can see ~30% improvement in Put() Test Plan: unittests Reviewers: yhchiang, anthony, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D55539	2016-04-01 15:23:46 -07:00
Marton Trencseni	9b51987521	Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. Test Plan: 'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK. I didn't run the Java tests, I don't have Java set up on my devserver. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56133	2016-04-01 10:42:39 -07:00
sdong	2feafa3db9	Change some RocksDB default options Summary: Change some RocksDB default options to make it more friendly to server workloads. Test Plan: Run all existing tests Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: sumeet, muthu, benj, MarkCallaghan, igor, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55941	2016-03-31 17:12:18 -07:00
Sandeep Joshi	63e8f1b55b	Formatted lines to adhere to 80 char limit	2016-03-31 08:26:55 +05:30
Sandeep Joshi	24420947d9	Replace kHeader by WriteBatchInternal::kHeader in few more places kHeader was moved from write_batch.cc to header file because it is being used wherever the number "12" was being used to check for record size	2016-03-30 23:13:00 +05:30
Sandeep Joshi	3bdbe89614	Merge branch 'magic12' of https://github.com/dcengines/rocksdb into magic12	2016-03-30 23:09:39 +05:30
zensan	78711524b7	In all the places where log records are read, there was a check that record.size() should not be less than 12. This "magic number" seems to be the WriteBatch header (8 byte sequence and 4 byte count). Replaced all the places where "12" was used by WriteBatchInternal::kHeader.	2016-03-30 23:05:22 +05:30
Islam AbdelRahman	99ffb3d533	Fix perf_context::merge_operator_time_nanos calculation Summary: We were not measuring the time spent in merge_operator when called from Version::Get() Test Plan: added a unittest Reviewers: sdong, yhchiang Reviewed By: yhchiang Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55905	2016-03-25 18:29:43 -07:00
Yueh-Hsuan Chiang	ad2fdaa823	Correct a typo in a comment Summary: Correct a typo in a comment Test Plan: No code change. Reviewers: sdong, kradhakrishnan, IslamAbdelRahman Reviewed By: kradhakrishnan, IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55803	2016-03-24 19:39:13 -07:00
Yueh-Hsuan Chiang	be9816b3d9	Fix data race issue when sub-compaction is used in CompactionJob Summary: When subcompaction is used, all subcompactions share the same Compaction pointer in CompactionJob while each subcompaction all keeps their mutable stats in SubcompactionState. However, there're still some mutable part that is currently store in the shared Compaction pointer. This patch makes two changes: 1. Make the shared Compaction pointer const so that it can never be modified during the compaction. 2. Move necessary states from Compaction to SubcompactionState. 3. Make functions of Compaction const if the function does not modify its internal state. Test Plan: rocksdb and MyRocks test Reviewers: sdong, kradhakrishnan, andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, yoshinorim, gunnarku, leveldb Differential Revision: https://reviews.facebook.net/D55923	2016-03-24 19:36:39 -07:00
Praveen Rao	583157f710	Avoid overloaded virtual function	2016-03-22 17:10:31 -07:00
Praveen Rao	136b8e0cad	Merge from master	2016-03-22 12:38:44 -07:00
Praveen Rao	2dcbb3b4f3	Addressed review comments	2016-03-22 12:07:15 -07:00
sdong	b1fafcaca6	Revert "Adding pin_l0_filter_and_index_blocks_in_cache feature." This reverts commit `522de4f59e`. It has bug of index block cleaning up.	2016-03-21 11:50:42 -07:00
agiardullo	fbbb8a6144	Add test for Snapshot 0 Summary: I ran into this assert when stress testing transactions. It's pretty easy to repro. Changing VersionSet::last_sequence_ to start at 1 seems pretty straightforward. We would just need to change the 4 callers of SetLastSequence(), including recovery code. I'd make this change myself, but I do not have enough time to test changes to recovery code-paths this week. But checking in this test case (disabled) for future fixing. Test Plan: n/a Reviewers: yhchiang, kradhakrishnan, andrewkr, anthony, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55311	2016-03-18 16:16:20 -07:00
Andrew Kryczka	e182f03c1e	Add unit tests for RepairDB Summary: Basic test cases: - Manifest is lost or corrupt - Manifest refers to too many or too few SST files - SST file is corrupt - Unflushed data is present when RepairDB is called Depends on D55065 for its CreateFile() function in file_utils Test Plan: Ran the tests. Reviewers: IslamAbdelRahman, yhchiang, yoshinorim, sdong Reviewed By: sdong Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55485	2016-03-18 15:18:42 -07:00
Praveen Rao	7d371863e5	travis build fixes	2016-03-18 14:43:22 -07:00
Praveen Rao	4f1c74a46e	merge from master	2016-03-18 12:48:01 -07:00
Praveen Rao	f8c2189307	Publish log numbers for column family to wal_filter, and provide log number in the record callback	2016-03-18 12:32:15 -07:00
Marton Trencseni	44756260ae	Reset block cache in failing unit test. Test Plan: make -j40 check OPT=-g, on both /tmp and /dev/shm Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55701	2016-03-18 06:13:54 +00:00
Marton Trencseni	522de4f59e	Adding pin_l0_filter_and_index_blocks_in_cache feature. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction. Test Plan: Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false). DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32 Mac: OK. Linux: with D55287 patched in it's OK. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54801	2016-03-17 22:40:01 +00:00
dhruba borthakur	33d568611d	Merge pull request #1040 from bureau14/master Fixes warnings and ensure correct int behavior on 32-bit platforms.	2016-03-17 03:23:30 -07:00
Edouard A	02e62ebbc8	Fixes warnings and ensure correct int behavior on 32-bit platforms.	2016-03-16 22:57:57 +01:00
Islam AbdelRahman	3ff98bd209	Fix no compression test Summary: DBBlockCacheTest.TestWithCompressedBlockCache is depending on compression using snappy, so this test fail when snappy is not available block this test when we don't have snappy https://ci-builds.fb.com/view/rocksdb/job/rocksdb_no_compression/833/console Test Plan: run the test when compression libraries are not avaliable Reviewers: sdong, yiwu Reviewed By: yiwu Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55413	2016-03-15 12:17:40 -07:00
Dhruba Borthakur	1a2cc27e01	ColumnFamilyOptions SanitizeOptions is buggy on 32-bit platforms. Summary: The pre-existing code is trying to clamp between 65,536 and 0, resulting in clamping to 65,536, resulting in very small buffers, resulting in ShouldFlushNow() being true quite easily, resulting in assertion failing and database performance being "not what it should be". https://github.com/facebook/rocksdb/issues/1018 Test Plan: make check Reviewers: sdong, andrewkr, IslamAbdelRahman, yhchiang, igor Reviewed By: igor Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55455	2016-03-14 16:21:54 -07:00
sdong	b2ae5950ba	Index Reader should not be reused after DB restart Summary: In block based table reader, wow we put index reader to block cache, which can be retrieved after DB restart. However, index reader may reference internal comparator, which can be destroyed after DB restarts, causing problems. Fix it by making cache key identical per table reader. Test Plan: Add a new test which failed with out the commit but now pass. Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: maro, yhchiang, kradhakrishnan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55287	2016-03-14 10:04:09 -07:00
Islam AbdelRahman	580fede347	Aggregate hot Iterator counters in LocalStatistics (DBIter::Next perf regression) Summary: This patch bump the counters in the frequent code path DBIter::Next() / DBIter::Prev() in a local data members and send them to Statistics when the iterator is destroyed A better solution will be to have thread_local implementation for Statistics New performance ``` readseq : 0.035 micros/op 28597881 ops/sec; 3163.7 MB/s 1,851,568,819 stalled-cycles-frontend # 31.29% frontend cycles idle [49.86%] 884,929,823 stalled-cycles-backend # 14.95% backend cycles idle [50.21%] readreverse : 0.071 micros/op 14077393 ops/sec; 1557.3 MB/s 3,239,575,993 stalled-cycles-frontend # 27.36% frontend cycles idle [49.96%] 1,558,253,983 stalled-cycles-backend # 13.16% backend cycles idle [50.14%] ``` Existing performance ``` readreverse : 0.174 micros/op 5732342 ops/sec; 634.1 MB/s 20,570,209,389 stalled-cycles-frontend # 70.71% frontend cycles idle [50.01%] 18,422,816,837 stalled-cycles-backend # 63.33% backend cycles idle [50.04%] readseq : 0.119 micros/op 8400537 ops/sec; 929.3 MB/s 15,634,225,844 stalled-cycles-frontend # 79.07% frontend cycles idle [49.96%] 14,227,427,453 stalled-cycles-backend # 71.95% backend cycles idle [50.09%] ``` Test Plan: unit tests Reviewers: yhchiang, sdong, igor Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55107	2016-03-11 19:01:12 -08:00
Baris Yazici	e8e6cf0173	fix: handle_fatal_signal (sig=6) in std::vector<std::string, std::allocator<std::string> >::_M_range_check \| c++/4.8.2/bits/stl_vector.h:794 #174 Summary: Fix for https://github.com/facebook/mysql-5.6/issues/174 When there is no old files to purge, vector.at(i) function was crashing if (old_info_log_file_count != 0 && old_info_log_file_count >= db_options_.keep_log_file_num) { std::sort(old_info_log_files.begin(), old_info_log_files.end()); size_t end = old_info_log_file_count - db_options_.keep_log_file_num; for (unsigned int i = 0; i <= end; i++) { std::string& to_delete = old_info_log_files.at(i); Added check to old_info_log_file_count be non zero. Test Plan: run existing tests Reviewers: gunnarku, vasilep, sdong, yhchiang Reviewed By: yhchiang Subscribers: andrewkr, webscalesql-eng, dhruba Differential Revision: https://reviews.facebook.net/D55245	2016-03-11 11:11:45 -08:00
Andrew Kryczka	d9620239d2	Cleanup stale manifests outside of full purge Summary: - Keep track of obsolete manifests in VersionSet - Updated FindObsoleteFiles() to put obsolete manifests in the JobContext for later use by PurgeObsoleteFiles() - Added test case that verifies a stale manifest is deleted by a non-full purge Test Plan: $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation Reviewers: IslamAbdelRahman, yoshinorim, sdong Reviewed By: sdong Subscribers: andrewkr, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D55269	2016-03-10 18:16:21 -08:00
Yi Wu	f71fc77b7c	Cache to have an option to fail Cache::Insert() when full Summary: Cache to have an option to fail Cache::Insert() when full. Update call sites to check status and handle error. I totally have no idea what's correct behavior of all the call sites when they encounter error. Please let me know if you see something wrong or more unit test is needed. Test Plan: make check -j32, see tests pass. Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D54705	2016-03-10 17:35:19 -08:00
Yueh-Hsuan Chiang	765597fa78	Update compaction score right after CompactFiles forms a compaction Summary: This is a follow-up patch of https://reviews.facebook.net/D54891. As the information about files being compacted will also be used when making compaction decision, it is necessary to update the compaction score when a compaction plan has been made but not yet execute. This patch adds a missing call to update the compaction score in CompactFiles(). Test Plan: compact_files_test Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55227	2016-03-10 14:34:28 -08:00
Yueh-Hsuan Chiang	aa3f02d50c	Improve comment in compaction.h and compaction_picker.h Summary: ReleaseCompactionFiles must be called when DB mutex is held, but the documentation is mission. Test Plan: no code change Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D54987	2016-03-08 16:46:41 -08:00
sdong	294bdf9ee2	Change Property name from "rocksdb.current_version_number" to "rocksdb.current-super-version-number" Summary: I realized I again is wrong about the naming convention. Let me change it to the correct one. Test Plan: Run unit tests. Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D55041	2016-03-04 18:15:29 -08:00
Yueh-Hsuan Chiang	a7d4eb2f34	Fix a bug where flush does not happen when a manual compaction is running Summary: Currently, when rocksdb tries to run manual compaction to refit data into a level, there's a ReFitLevel() process that requires no bg work is currently running. When RocksDB plans to ReFitLevel(), it will do the following: 1. pause scheduling new bg work. 2. wait until all bg work finished 3. do the ReFitLevel() 4. unpause scheduling new bg work. However, as it pause scheduling new bg work at step one and waiting for all bg work finished in step 2, RocksDB will stop flushing until all bg work is done (which could take a long time.) This patch fix this issue by changing the way ReFitLevel() pause the background work: 1. pause scheduling compaction. 2. wait until all bg work finished. 3. pause scheduling flush 4. do ReFitLevel() 5. unpause both flush and compaction. The major difference is that. We only pause scheduling compaction in step 1 and wait for all bg work finished in step 2. This prevent flush being blocked for a long time. Although there's a very rare case that ReFitLevel() might be in starvation in step 2, but it's less likely the case as flush typically finish very fast. Test Plan: existing test. Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55029	2016-03-04 14:24:52 -08:00
Islam AbdelRahman	dfe96c72c3	Fix WriteLevel0TableForRecovery file delete protection Summary: The call to ``` CaptureCurrentFileNumberInPendingOutputs() ``` should be before ``` versions_->NewFileNumber() ``` Right now we are not actually protecting the file from being deleted Test Plan: make check Reviewers: sdong, anthony, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D54645	2016-03-03 18:25:07 -08:00
sdong	ef204df7ef	Compaction always needs to be removed from level0_compactions_in_progress_ for universal compaction Summary: We always put compaction to level0_compactions_in_progress_ for universal compaction, so we should also remove it. The bug causes assert failure when running manual compaction. Test Plan: TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom,compact --subcompactions=16 --compaction_style=1 always fails on my host. After the fix, it doesn't fail any more. Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D55017	2016-03-02 21:23:28 -08:00
sdong	e79ad9e184	Add Iterator Property rocksdb.iterator.version_number Summary: We want to provide a way to detect whether an iterator is stale and needs to be recreated. Add a iterator property to return version number. Test Plan: Add two unit tests for it. Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54921	2016-03-02 16:23:59 -08:00
sdong	19ea40f8b6	Subcompaction boundary keys should not terminate after an empty level Summary: Now we skip to add boundary keys to subcompaction candidates since we see an empty level. This makes subcompaction almost disabled for universal compaction. We should consider all files instead. Test Plan: Run existing tests. Reviewers: IslamAbdelRahman, andrewkr, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D55005	2016-03-02 15:45:07 -08:00
sdong	74b660702e	Rename iterator property "rocksdb.iterator.is.key.pinned" => "rocksdb.iterator.is-key-pinned" Summary: Rename iterator property to folow property naming convention. Test Plan: Run all existing tests. Reviewers: andrewkr, anthony, yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54957	2016-03-01 13:47:12 -08:00
Islam AbdelRahman	6743135ea1	Fix DB::AddFile() issue when PurgeObsoleteFiles() is called Summary: In some situations the DB will scan all existing files in the DB path and delete the ones that are Obsolete. If this happen during adding an external sst file. this could cause the file to be deleted while we are adding it. This diff fix this issue Test Plan: unit test to reproduce the bug existing unit tests Reviewers: sdong, yhchiang, andrewkr Reviewed By: andrewkr Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D54627	2016-03-01 12:05:29 -08:00
sdong	432f3adf2c	Add DB Property "rocksdb.current_version_number" Summary: Add a DB Property "rocksdb.current_version_number" for users to monitor version changes and stale iterators. Test Plan: Add a unit test. Reviewers: andrewkr, yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54927	2016-03-01 10:55:40 -08:00
sdong	b5b1db167a	Recompute compaction score after scheduling manual compaction Summary: After we made manual compaction runnable concurrently with automaticallly compaction, we need to run ComputeCompactionScore() to prepare a coming compaction picking call before the compaction finishes. Test Plan: Run existing tests. Reviewers: yhchiang, IslamAbdelRahman, andrewkr, kradhakrishnan, anthony, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54891	2016-02-29 17:17:51 -08:00
sdong	1f5954147b	Introduce Iterator::GetProperty() and replace Iterator::IsKeyPinned() Summary: Add Iterator::GetProperty(), a way for users to communicate with iterator, and turn Iterator::IsKeyPinned() with it. As a follow-up, I'll ask a property as the version number attached to the iterator Test Plan: Rerun existing tests and add a negative test case. Reviewers: yhchiang, andrewkr, kradhakrishnan, anthony, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54783	2016-02-29 14:01:31 -08:00
Andrew Kryczka	69c471bd9b	Handle concurrent manifest update and backup creation Summary: Fixed two related race conditions in backup creation. (1) CreateNewBackup() uses DB::DisableFileDeletions() to prevent table files from being deleted while it is copying; however, the MANIFEST file could still rotate during this time. The fix is to stop deleting the old manifest in the rotation logic. It will be deleted safely later when PurgeObsoleteFiles() runs (can only happen when file deletions are enabled). (2) CreateNewBackup() did not account for the CURRENT file being mutable. This is significant because the files returned by GetLiveFiles() contain a particular manifest filename, but the manifest to which CURRENT refers can change at any time. This causes problems when CURRENT changes between the call to GetLiveFiles() and when it's copied to the backup directory. To workaround this, I manually forge a CURRENT file referring to the manifest filename returned in GetLiveFiles(). (2) also applies to the checkpointing code, so let me know if this approach is good and I'll make the same change there. Test Plan: new test for roll manifest during backup creation. running the test before this change: $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation ... IO error: /tmp/rocksdbtest-9383/backupable_db/MANIFEST-000001: No such file or directory running the test after this change: $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation ... [ RUN ] BackupableDBTest.ChangeManifestDuringBackupCreation [ OK ] BackupableDBTest.ChangeManifestDuringBackupCreation (2836 ms) Reviewers: IslamAbdelRahman, anthony, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D54711	2016-02-29 12:56:55 -08:00
sdong	8800975fb0	Make DBTestUniversalCompaction.IncreaseUniversalCompactionNumLevels more robust Summary: Based on thread scheduling, DBTestUniversalCompaction.IncreaseUniversalCompactionNumLevels can fail to flush enough files to trigger expected compactions. Fix it by waiting for flush after inserting each key. There are failrue reported: db/db_universal_compaction_test.cc:1134: Failure Expected: (NumTableFilesAtLevel(options.num_levels - 1, 1)) > (0), actual: 0 vs 0 but I can't repro it. Try to fix the bug and see whether it goes away. Test Plan: Run the test multiple time. Reviewers: IslamAbdelRahman, anthony, andrewkr, kradhakrishnan, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54747	2016-02-26 11:59:31 -08:00
sdong	82f15fb15d	Add test to make sure DropColumnFamily doesn't impact existing iterators Summary: Add a test case in ColumnFamilyTest.ReadDroppedColumnFamily to make sure existing iterator is not impacted by column family dropping. Test Plan: N/A Reviewers: igor, yhchiang, anthony, andrewkr, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54657	2016-02-24 10:25:38 -08:00
sdong	38201b3599	Fix assert failure when DBImpl::SyncWAL() conflicts with log rolling Summary: DBImpl::SyncWAL() releases db mutex before calling DBImpl::MarkLogsSynced(), while inside DBImpl::MarkLogsSynced() we assert there is none or one outstanding log file. However, a memtable switch can happen in between and causing two or outstanding logs there, failing the assert. The diff adds a unit test that repros the issue and fix the assert so that the unit test passes. Test Plan: Run the new tests. Reviewers: anthony, kolmike, yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54621	2016-02-23 11:42:15 -08:00
Andrew Kryczka	b046916656	Redo SyncPoints for flush while rolling test Summary: There was a race condition in the test where the rolling thread acquired the mutex before the flush thread pinned the logger. Rather than add more complicated synchronization to fix it, I followed Siying's suggestion to use SyncPoint in the test code. Comments in the LoadDependency() invocation explain the reason for each of the sync points. Test Plan: Ran test 1000 times for tsan/asan. Will wait for all sandcastle tests to finish before committing since this is a tricky test. Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D54615	2016-02-22 21:32:19 -08:00
Mike Kolupaev	eef63ef807	Fixed CompactFiles() spuriously failing or corrupting DB Summary: We started getting two kinds of crashes since we started using `DB::CompactFiles()`: (1) `CompactFiles()` fails saying something like "/data/logdevice/4440/shard12/012302.sst: No such file or directory", and presumably makes DB read-only, (2) DB fails to open saying "Corruption: Can't access /267000.sst: IO error: /data/logdevice/4440/shard1/267000.sst: No such file or directory". AFAICT, both can be explained by background thread deleting compaction output as "obsolete" while it's being written, before it's committed to manifest. If it ends up committed to the manifest, we get (2); if compaction notices the disappearance and fails, we get (1). The internal tasks t10068021 and t10134177 have some details about the investigation that led to this. Test Plan: `make -j check`; the new test fails to reopen the DB without the fix Reviewers: yhchiang Reviewed By: yhchiang Subscribers: dhruba, sdong Differential Revision: https://reviews.facebook.net/D54561	2016-02-22 13:54:58 -08:00
Dmitri Smirnov	d37d348da8	This addresses build issues on Windows https://github.com/facebook/rocksdb/issues/1002	2016-02-19 12:29:54 -08:00
Andrew Kryczka	d825fc70d4	Use condition variable in log roller test Summary: Previously I just slept until the flush_thread was "probably" ready since proper synchronization in test cases seemed like overkill. But then tsan complained about it, so I did the synchronization (mostly) properly now. Test Plan: $ COMPILE_WITH_TSAN=1 make -j32 auto_roll_logger_test $ ./auto_roll_logger_test Reviewers: anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D54399	2016-02-18 18:03:53 -08:00
Islam AbdelRahman	df9ba6df62	Introduce SstFileManager::SetMaxAllowedSpaceUsage() to cap disk space usage Summary: Introude SstFileManager::SetMaxAllowedSpaceUsage() that can be used to limit the maximum space usage allowed for RocksDB. When this limit is exceeded WriteImpl() will fail and return Status::Aborted() Test Plan: unit testing Reviewers: yhchiang, anthony, andrewkr, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53763	2016-02-17 15:20:23 -08:00
Andrew Kryczka	3943d16780	Fix race conditions in auto-rolling logger Summary: For GetLogFileSize() and Flush(), they previously did not follow the synchronization pattern for accessing logger_. This meant ResetLogger() could cause logger_ destruction while the unsynchronized functions were accessing it, causing a segfault. Also made the mutex instance variable mutable so we can preserve GetLogFileSize()'s const-ness. Test Plan: new test case, it's quite ugly because both threads need to access one of the functions with SyncPoints (PosixLogger::Flush()), and also special handling is needed to prevent the mutex and sync points from conflicting. Reviewers: kradhakrishnan, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D54237	2016-02-17 12:06:45 -08:00
reid horuff	a7b6f0748a	Improve write_with_callback_test to sync WAL Summary: Currently write_with_callback_test does not test with WAL syncing enabled. This addresses that. Test Plan: write_with_callback_test Reviewers: anthony Reviewed By: anthony Subscribers: leveldb, dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D54255	2016-02-16 14:04:14 -08:00
reid horuff	5bcf952a87	Fix WriteImpl empty batch hanging issue Summary: There is an issue in DBImpl::WriteImpl where if an empty writebatch comes in and sync=true then the logs will be marked as being synced yet the sync never actually happens because there is no data in the writebatch. This causes the next incoming batch to hang while waiting for the logs to complete syncing. This fix syncs logs even if the writebatch is empty. Test Plan: DoubleEmptyBatch unit test in transaction_test. Reviewers: yoshinorim, hermanlee4, sdong, ngbronson, anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54057	2016-02-16 12:21:33 -08:00
Mike Kolupaev	44371501f0	Fixed a segfault when compaction fails Summary: We've hit it today. Test Plan: `make -j check`; didn't reproduce the issue Reviewers: yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D54219	2016-02-16 11:11:16 -08:00
Jonathan Wiepert	7bd284c374	Separeate main from bench functionality to allow cusomizations Summary: Isolate db_bench functionality from main so custom benchmark code can be written and managed Test Plan: Tested commands ./build_tools/regression_build_test.sh ./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 ./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 --reads=500 --writes=500 ./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 --merge_keys=100 --numdistinct=100 --num_column_families=3 --num_hot_column_families=1 ./db_bench --stats_interval_seconds=1 --num=1000 --bloom_locality=1 --seed=5 --threads=5 ./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --usee_uint64_comparator=true --batch-size=5 ./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --use_uint64_comparator=true --batch_size=5 ./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --usee_uint64_comparator=true --batch-size=5 Test Results - https://phabricator.fb.com/P56130387 Additional tests for: ./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --use_uint64_comparator=true --batch_size=5 --key_size=8 --merge_operator=put ./db_bench --stats_interval_seconds=1 --num=1000 --bloom_locality=1 --seed=5 --threads=5 --merge_operator=uint64add Results: https://phabricator.fb.com/P56130607 Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53991	2016-02-16 06:17:31 -08:00
sdong	92a9ccf1a6	Add a new compaction priority that picks file whose overlapping ratio is smallest Summary: Add a new compaction priority as following: For every file, we calculate total size of files overalapping with the file in the next level, over the file's size itself. The file with smallest ratio will be picked first. My "db_bench --fillrandom" shows about 5% less compaction than kOldestSmallestSeqFirst if --hard_pending_compaction_bytes_limit value to keep LSM tree in shape. If not limiting hard_pending_compaction_bytes_limit, improvement is only 1% or 2%. Test Plan: Add a unit test Reviewers: andrewkr, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54075	2016-02-11 15:59:19 -08:00
Peter Mattis	239aaf2fc0	Use user_comparator when comparing against iterate_upper_bound. Fixes #983.	2016-02-11 08:47:16 -05:00
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	2016-02-09 15:12:00 -08:00
Yueh-Hsuan Chiang	4a8cbf4e31	Allows Get and MultiGet to read directly from SST files. Summary: Add kSstFileTier to ReadTier, which allows Get and MultiGet to read only directly from SST files and skip mem-tables. kSstFileTier = 0x2 // data in SST files. // Note that this ReadTier currently only supports // Get and MultiGet and does not support iterators. Test Plan: add new test in db_test. Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: igor, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53511	2016-02-09 11:20:22 -08:00
sdong	a76e9093f0	Fix LITE db_test build broken by previous commit Summary: Previous commit introduces a test that is not supported in LITE. Fix it. Test Plan: Build the test with ROCKSDB_LITE. Reviewers: kradhakrishnan, IslamAbdelRahman, anthony, yhchiang, andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D53901	2016-02-05 14:29:09 -08:00
sdong	b1887c5dd9	Explictly fail when memtable doesn't support concurrent insert Summary: If users turn on concurrent insert but the memtable doesn't support it, they might see unexcepted crash. Fix it by explicitly fail. Test Plan: Run different setting of stress_test and make sure it fails correctly. Will add a unit test too. Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, andrewkr, ngbronson Reviewed By: ngbronson Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D53895	2016-02-05 14:15:50 -08:00
reid horuff	6f71d3b68b	Improve perf of Pessimistic Transaction expirations (and optimistic transactions) Summary: copy from task 8196669: 1) Optimistic transactions do not support batching writes from different threads. 2) Pessimistic transactions do not support batching writes if an expiration time is set. In these 2 cases, we currently do not do any write batching in DBImpl::WriteImpl() because there is a WriteCallback that could decide at the last minute to abort the write. But we could support batching write operations with callbacks if we make sure to process the callbacks correctly. To do this, we would first need to modify write_thread.cc to stop preventing writes with callbacks from being batched together. Then we would need to change DBImpl::WriteImpl() to call all WriteCallback's in a batch, only write the batches that succeed, and correctly set the state of each batch's WriteThread::Writer. Test Plan: Added test WriteWithCallbackTest to write_callback_test.cc which creates multiple client threads and verifies that writes are batched and executed properly. Reviewers: hermanlee4, anthony, ngbronson Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52863	2016-02-05 10:44:13 -08:00
Islam AbdelRahman	8e6172bc57	Add BlockBasedTableOptions::index_block_restart_interval Summary: Add a new option to BlockBasedTableOptions that will allow us to change the restart interval for the index block Test Plan: unit tests Reviewers: yhchiang, anthony, andrewkr, sdong Reviewed By: sdong Subscribers: march, dhruba Differential Revision: https://reviews.facebook.net/D53721	2016-02-05 10:22:37 -08:00
Nathan Bronson	2c1db5ea51	always invalidate sequential-insertion cache for concurrent skiplist adds Summary: InlineSkipList::InsertConcurrently should invalidate the sequential-insertion cache prev_[] for all inserts of multi-level nodes, not just those that increase the height of the skip list. The invariant for prev_ is that prev_[i] (i > 0) is supposed to be the predecessor of prev_[0] at level i. Before this diff InsertConcurrently could violate this constraint when inserting a multi-level node after prev_[i] but before prev_[0]. This diff also reenables kConcurrentSkipList as db_test's MultiThreaded/MultiThreadedDBTest.MultiThreaded/29. Test Plan: 1. unit tests 2. temporarily hack kConcurrentSkipList timing so that it is fast but has a 1.5% failure rate on my dev box (1ms stagger on thread launch, 1s test duration, failure rate baseline over 1000 runs) 3. observe 1000 passes post-fix Reviewers: igor, sdong Reviewed By: sdong Subscribers: MarkCallaghan, dhruba Differential Revision: https://reviews.facebook.net/D53751	2016-02-03 11:08:16 -08:00
Andrew Kryczka	284aa613a7	Eliminate duplicated property constants Summary: Before this diff, there were duplicated constants to refer to properties (user- facing API had strings and InternalStats had an enum). I noticed these were inconsistent in terms of which constants are provided, names of constants, and documentation of constants. Overall it seemed annoying/error-prone to maintain these duplicated constants. So, this diff gets rid of InternalStats's constants and replaces them with a map keyed on the user-facing constant. The value in that map contains a function pointer to get the property value, so we don't need to do string matching while holding db->mutex_. This approach has a side benefit of making many small handler functions rather than a giant switch-statement. Test Plan: db_properties_test passes, running "make commit-prereq -j32" Reviewers: sdong, yhchiang, kradhakrishnan, IslamAbdelRahman, rven, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53253	2016-02-02 19:14:56 -08:00
Nathan Bronson	5fcd1ba30a	disable kConcurrentSkipList multithreaded test Summary: Disable test that is intermittently failing Test Plan: unit tests Reviewers: igor, andrewkr, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53715	2016-02-02 18:24:47 -08:00
Tomas Kolda	a62c519bb6	RollLogFile tries to find non conflicting file until there is no conflict.	2016-02-02 10:33:49 +01:00
Tomas Kolda	57a95a7001	Making use of GetSystemTimePreciseAsFileTime dynamic - code review fixes	2016-02-02 10:23:56 +01:00
Tomas Kolda	502d41f150	Making use of GetSystemTimePreciseAsFileTime dynamic to not break compatibility with Windows 7. The issue with rotated logs was fixed other way.	2016-02-02 10:23:56 +01:00
Nathan Bronson	9c2cf9479b	Fix for --allow_concurrent_memtable_write with batching Summary: Concurrent memtable adds were incorrectly computing the last sequence number for a write batch group when the write batches were not solitary. This is the cause of https://github.com/facebook/mysql-5.6/issues/155 Test Plan: 1. unit tests 2. new unit test 3. parallel db_bench stress tests with batch size of 10 and asserts enabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: IslamAbdelRahman, MarkCallaghan, dhruba Differential Revision: https://reviews.facebook.net/D53595	2016-02-01 20:41:57 -08:00
Siying Dong	7b943da1b2	Merge pull request #967 from SherlockNoMad/ValueSize Add histogram for value size per operation	2016-02-01 17:59:08 -08:00
Islam AbdelRahman	1ad8182950	Fix WriteBatchTest.ManyUpdates, WriteBatchTest.LargeKeyValue under clang Summary: Fix current clang failure https://ci-builds.fb.com/view/rocksdb/job/rocksdb_clang_build/1398/console Test Plan: make sure that both clang and g++ compilation succeed USE_CLANG=1 make check -j64 make check -j64 Reviewers: anthony, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53667	2016-02-01 16:07:53 -08:00
sdong	ad7ecca72d	Add unit tests to verify large key/value Summary: Add unit tests: (1) insert entries of 8MB key and 3GB value to DB (2) insert entry of 3GB key and 3GB value into write batch and make sure we can read it. (3) insert 3 billions of key-value pairs into write batch and make sure we can read it. Disable them because not all platform can run it. Test Plan: Run the tests Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, andrewkr, anthony Reviewed By: anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D53619	2016-02-01 15:09:50 -08:00
Andrew Kryczka	fdd70d1495	Skip filters for last L0 file if hit-optimized Summary: Following up on D53493, we can still enable the filter-skipping optimization for last file in L0. It's correct to assume the key will be present in the last L0 file when we're hit-optimized and L0 is deepest. The FilePicker encapsulates the state for traversing each level's files, so I needed to make it expose whether the returned file is last in its level. Test Plan: verified below test fails before this patch and passes afterwards. The change to how the test memtable is populated is needed so file 1 has keys (0, 30, 60), file 2 has keys (10, 40, 70), etc. $ ./db_universal_compaction_test --gtest_filter=UniversalCompactionNumLevels/DBTestUniversalCompaction.OptimizeFiltersForHits/* Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53583	2016-02-01 14:58:46 -08:00
Dmytro Ivchenko	aa5e3b7c04	PerfContext::ToString() add option to exclude zero counters Test Plan: Added unit test to check w/ w/o zeros scenarios Reviewers: yhchiang Reviewed By: yhchiang Subscribers: sdong, dhruba Differential Revision: https://reviews.facebook.net/D52809	2016-02-01 13:41:13 -08:00
Yueh-Hsuan Chiang	1d854fa3d4	Fixed the asan error on column_family_test Summary: Fixed the asan error on column_family_test caused by not disabling SyncPoint. Test Plan: column_family_test Reviewers: anthony, rven, kradhakrishnan, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53505	2016-02-01 12:45:45 -08:00
SherlockNoMad	37159a6448	Add histogram for value size per operation	2016-01-31 18:09:24 -08:00
Venkatesh Radhakrishnan	3b2a1ddd2e	Add options.base_background_compactions as a number of compaction threads for low compaction debt Summary: If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions. The watermarks are calculated based on slowdown thresholds. Test Plan: Add new test cases in column_family_test. Adding more unit tests. Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony Reviewed By: anthony Subscribers: leveldb, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D53409	2016-01-29 16:15:53 -08:00
sdong	6ee38bb15c	Slowdown of writing to the last memtable should not override stopping Summary: Now slowing down for the last mem table takes priority against some stopping conditions. This is logically confusing. Fix it. Test Plan: Run all existing tests. Reviewers: yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr, anthony Reviewed By: anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D53529	2016-01-28 21:52:29 -08:00
Islam AbdelRahman	d6c838f1e1	Add SstFileManager (component tracking all SST file in DBs and control the deletion rate) Summary: Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler SstFileTracker can be used later to abort writes when we exceed a specific size Test Plan: unit tests Reviewers: rven, anthony, yhchiang, sdong Reviewed By: sdong Subscribers: igor, lovro, march, dhruba Differential Revision: https://reviews.facebook.net/D50469	2016-01-28 18:35:01 -08:00
sdong	4b50f13540	Should not skip bloom filter for L0 during the query. Summary: It's a regression bug caused by `e089db40f9`. With the change, if options.optimize_filters_for_hits=true and there are only L0 files (like single level universal compaction), we skip all the files in L0, which is more than necessary. Fix it by always trying to query bloom filter for files in level 0. Test Plan: Add a unit test for it. Reviewers: anthony, rven, yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D53493	2016-01-27 16:16:39 -08:00
sdong	d20915d52a	Disable stats about mutex duration by default Summary: Measuring mutex duration will measure time inside DB mutex, which breaks our best practice. Add a stat level in Statistics class. By default, disable to measure the mutex operations. Test Plan: Add a unit test to make sure it is off by default. Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, andrewkr, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D53367	2016-01-26 14:56:55 -08:00
Islam AbdelRahman	0c433cd1eb	Fix issue in Iterator::Seek when using Block based filter block with prefix_extractor Summary: Similar to D53385 we need to check InDomain before checking the filter block. Test Plan: unit tests Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53421	2016-01-26 14:47:42 -08:00

1 2 3 4 5 ...

2310 Commits