rocksdb

Author	SHA1	Message	Date
sdong	f1b9f804e9	Add a mode to always pick the oldest file to compact for each level Summary: Add options.compaction_pri, which specifies the policy about which file to compact first. kCompactionPriByLargestSeq will compact oldest files first. Verified the behavior in db_bench but did not write unit tests yet. Also need to make it settable through option string and dynamically changeable. Test Plan: Will write unit tests Reviewers: igor, rven, anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, MarkCallaghan Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D45951	2015-09-21 17:21:59 -07:00
Igor Canadi	a7e80379b0	LogAndApply() should fail if the column family has been dropped Summary: This patch finally fixes the ColumnFamilyTest.ReadDroppedColumnFamily test. The test has been failing very sporadically and it was hard to repro. However, I managed to write a new tests that reproes the failure deterministically. Here's what happens: 1. We start the flush for the column family 2. We check if the column family was dropped here: `a3fc49bfdd/db/flush_job.cc (L149)` 3. This check goes through, ends up in InstallMemtableFlushResults() and it goes into LogAndApply() 4. At about this time, we start dropping the column family. Dropping the column family process gets to LogAndApply() at about the same time as LogAndApply() from flush process 5. Drop column family goes through LogAndApply() first, marking the column family as dropped. 6. Flush process gets woken up and gets a chance to write to the MANIFEST. However, this is where it gets stuck: `a3fc49bfdd/db/version_set.cc (L1975)` 7. We see that the column family was dropped, so there is no need to write to the MANIFEST. We return OK. 8. Flush gets OK back from LogAndApply() and it deletes the memtable, thinking that the data is now safely persisted to sst file. The fix is pretty simple. Instead of OK, we return ShutdownInProgress. This is not really true, but we have been using this status code to also mean "this operation was canceled because the column family has been dropped". The fix is only one LOC. All other code is related to tests. I added a new test that reproes the failure. I also moved SleepingBackgroundTask to util/testutil.h (because I needed it in column_family_test for my new test). There's plenty of other places where we reimplement SleepingBackgroundTask, but I'll address that in a separate commit. Test Plan: 1. new test 2. make check 3. Make sure the ColumnFamilyTest.ReadDroppedColumnFamily doesn't fail on Travis: https://travis-ci.org/facebook/rocksdb/jobs/79952386 Reviewers: yhchiang, anthony, IslamAbdelRahman, kradhakrishnan, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46773	2015-09-15 11:28:44 -07:00
Ari Ekmekji	3c37b3cccd	Determine boundaries of subcompactions Summary: Up to this point, the subcompactions that make up a compaction job have been divided based on the key range of the L1 files, and each subcompaction has handled the key range of only one file. However DBOption.max_subcompactions allows the user to designate how many subcompactions at most to perform. This patch updates the CompactionJob::GetSubcompactionBoundaries() to determine these divisions accordingly based on that option and other input/system factors. The current approach orders the starting and/or ending keys of certain compaction input files and then generates a histogram to approximate the size covered by the key range between each consecutive pair of keys. Then it groups these ranges into groups so that the sizes are approximately equal to one another. The approach has also been adapted to work for universal compaction as well instead of just for level-based compaction as it was before. These subcompactions are then executed in parallel by locally spawning threads, one for each. The results are then aggregated and the compaction completed. Test Plan: make all && make check Reviewers: yhchiang, anthony, igor, noetzli, sdong Reviewed By: sdong Subscribers: MarkCallaghan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43269	2015-09-10 13:50:00 -07:00
Andres Noetzli	6bdc484fd8	Added Equal method to Comparator interface Summary: In some cases, equality comparisons can be done more efficiently than three-way comparisons. There are quite a few places in the code where we only care about equality. This patch adds an Equal() method that defaults to using the Compare() method. Test Plan: make clean all check Reviewers: rven, anthony, yhchiang, igor, sdong Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46233	2015-09-08 15:30:49 -07:00
sdong	3e0a672c50	Bug fix: table readers created by TableCache::Get() doesn't have latency histogram reported Summary: TableCache::Get() puts parameters in the wrong places so that table readers created by Get() will not have the histogram updated. Test Plan: Will write a unit test for that. Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D46035	2015-09-02 12:57:07 -07:00
Yueh-Hsuan Chiang	6996de87af	Expose per-level aggregated table properties via GetProperty() Summary: This patch adds "rocksdb.aggregated-table-properties" and "rocksdb.aggregated-table-properties-at-levelN", the former returns the aggreated table properties of a column family, while the later returns the aggregated table properties of the specified level N. Test Plan: Added tests in db_test Reviewers: igor, sdong, IslamAbdelRahman, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45087	2015-08-25 12:03:54 -07:00
sdong	07d2d34160	Add a counter about estimated pending compaction bytes Summary: Add a counter of estimated bytes the DB needs to compact for all the compactions to finish. Expose it as a DB Property. In the future, we can use threshold of this counter to replace soft rate limit and hard rate limit. A single threshold of estimated compaction debt in bytes will be easier for users to reason about when should slow down and stopping than more abstract soft and hard rate limits. Test Plan: Add unit tests Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, anthony, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44205	2015-08-20 22:17:10 -07:00
Islam AbdelRahman	027ca5b2cd	Total SST files size DB Property Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions Test Plan: Unittests for the new property Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D44799	2015-08-20 11:47:19 -07:00
sdong	72613657f0	Measure file read latency histogram per level Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled. Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44193	2015-08-14 17:32:42 -07:00
Islam AbdelRahman	cee1e8a080	Parallelize LoadTableHandlers Summary: Add a new option that all LoadTableHandlers to use multiple threads to load files on DB Open and Recover Test Plan: make check -j64 COMPILE_WITH_TSAN=1 make check -j64 DISABLE_JEMALLOC=1 make all valgrind_check -j64 (still running) Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43755	2015-08-11 12:19:56 -07:00
Andres Notzli	68f934355a	Better CompactionJob testing Summary: Changed compaction_job_test to support better/more thorough tests and added two tests. Also changed MockFileContents to order using InternalKeyComparator. Test Plan: make compaction_job_test && ./compaction_job_test; make all && make check Reviewers: sdong, rven, igor, yhchiang, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42837	2015-08-07 21:59:51 -07:00
Yueh-Hsuan Chiang	14d0bfa429	Add DBOptions::skip_sats_update_on_db_open Summary: UpdateAccumulatedStats() is used to optimize compaction decision esp. when the number of deletion entries are high, but this function can slowdown DBOpen esp. in disk environment. This patch adds DBOptions::skip_sats_update_on_db_open, which skips UpdateAccumulatedStats() in DB::Open() time when it's set to true. Test Plan: Add DBCompactionTest.SkipStatsUpdateTest Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: tnovak, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42843	2015-08-04 13:48:16 -07:00
Ari Ekmekji	40c64434d4	Parallelize L0-L1 Compaction: Restructure Compaction Job Summary: As of now compactions involving files from Level 0 and Level 1 are single threaded because the files in L0, although sorted, are not range partitioned like the other levels. This means that during L0-L1 compaction each file from L1 needs to be merged with potentially all the files from L0. This attempt to parallelize the L0-L1 compaction assigns a thread and a corresponding iterator to each L1 file that then considers only the key range found in that L1 file and only the L0 files that have those keys (and only the specific portion of those L0 files in which those keys are found). In this way the overlap is minimized and potentially eliminated between different iterators focusing on the same files. The first step is to restructure the compaction logic to break L0-L1 compactions into multiple, smaller, sequential compactions. Eventually each of these smaller jobs will be run simultaneously. Areas to pay extra attention to are # Correct aggregation of compaction job statistics across multiple threads # Proper opening/closing of output files (make sure each thread's is unique) # Keys that span multiple L1 files # Skewed distributions of keys within L0 files Test Plan: Make and run db_test (newer version has separate compaction tests) and compaction_job_stats_test Reviewers: igor, noetzli, anthony, sdong, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42699	2015-08-03 11:32:14 -07:00
Andres Notzli	06aebca592	Report live data size estimate Summary: Fixes T6548822. Added a new function for estimating the size of the live data as proposed in the task. The value can be accessed through the property rocksdb.estimate-live-data-size. Test Plan: There are two unit tests in version_set_test and a simple test in db_test. make version_set_test && ./version_set_test; make db_test && ./db_test gtest_filter=GetProperty Reviewers: rven, igor, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41493	2015-07-21 21:33:20 -07:00
sdong	6e9fbeb27c	Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future. Test Plan: Run all existing unit tests. Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42321	2015-07-17 16:58:18 -07:00
Igor Canadi	35ca59364c	Don't let flushes preempt compactions Summary: When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler. We have a special case that when you set max_background_flushes to 0, we schedule the flush to execute on the compaction thread. Test Plan: make check (there might be some unit tests that depend on this behavior) Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41931	2015-07-17 12:02:52 -07:00
Ari Ekmekji	74c755c552	Added JSON manifest dump option to ldb command Summary: Added a new flag --json to the ldb manifest_dump command that prints out the version edits as JSON objects for easier reading and parsing of information. Test Plan: Sample usage: ``` ./ldb manifest_dump --json --path=path/to/manifest/file ``` Sample output: ``` {"EditNumber": 0, "Comparator": "leveldb.BytewiseComparator", "ColumnFamily": 0} {"EditNumber": 1, "LogNumber": 0, "ColumnFamily": 0} {"EditNumber": 2, "LogNumber": 4, "PrevLogNumber": 0, "NextFileNumber": 7, "LastSeq": 35356, "AddedFiles": [{"Level": 0, "FileNumber": 5, "FileSize": 1949284, "SmallestIKey": "'", "LargestIKey": "'"}], "ColumnFamily": 0} ... {"EditNumber": 13, "PrevLogNumber": 0, "NextFileNumber": 36, "LastSeq": 290994, "DeletedFiles": [{"Level": 0, "FileNumber": 17}, {"Level": 0, "FileNumber": 20}, {"Level": 0, "FileNumber": 22}, {"Level": 0, "FileNumber": 24}, {"Level": 1, "FileNumber": 13}, {"Level": 1, "FileNumber": 14}, {"Level": 1, "FileNumber": 15}, {"Level": 1, "FileNumber": 18}], "AddedFiles": [{"Level": 1, "FileNumber": 25, "FileSize": 2114340, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 26, "FileSize": 2115213, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 27, "FileSize": 2114807, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 30, "FileSize": 2115271, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 31, "FileSize": 2115165, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 32, "FileSize": 2114683, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 35, "FileSize": 1757512, "SmallestIKey": "'", "LargestIKey": "'"}], "ColumnFamily": 0} ... ``` Reviewers: sdong, anthony, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D41727	2015-07-17 10:07:40 -07:00
Dmitri Smirnov	d2f0912bd3	Merge the latest changes from github/master	2015-07-02 17:23:41 -07:00
sdong	05e2831966	Allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena Summary: Try to allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena, instead of calling malloc and free. Test Plan: valgrind check Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40929	2015-06-30 17:30:38 -07:00
sdong	7842920be5	Slow down writes by bytes written Summary: We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch. The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work hard_rate_limit is deprecated. options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up. Test Plan: Add new unit tests in db_test Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor Reviewed By: igor Subscribers: ikabiljo, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D36351	2015-06-11 20:42:18 -07:00
sdong	75d7075a8a	Print info message about files need compaction for debuging purpose Summary: When there are files marked for compaction after compactions, print extra messages to help debugging. Example: 2015/06/08-23:12:55.212855 7ff5013ff700 [default] [JOB 121] Generated table #75: 54 keys, 4807 bytes (need compaction) 2015/06/08-23:12:55.556194 7ff5013ff700 (Original Log Time 2015/06/08-23:12:55.556160) [default] compacted to: base level 1 max bytes base 10240 files[0 1 9 32 12 0 0 0] max score 0.96 (2 files need compaction), MB/sec: 0.0 rd, 0.1 wr, level 2, files in(1, 3) out(5) MB in(0.0, 0.0) out(0.0), read-write-amplify(11.3) write-amplify(5.7) OK, records in: 40, records dropped: 0 Test Plan: Run test and see LOG files. valgrind test DBTest.TablePropertiesNeedCompactTest Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, igor Reviewed By: igor Subscribers: yoshinorim, maykov, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39771	2015-06-09 11:23:29 -07:00
sdong	6df589b446	Add TablePropertiesCollector::NeedCompact() to suggest DB to further compact output files Summary: It is experimental. Allow users to return from a call back function TablePropertiesCollector::NeedCompact(), based on the data in the file. It can be used to allow users to suggest DB to clear up delete tombstones faster. Test Plan: Add a unit test. Reviewers: igor, yhchiang, kradhakrishnan, rven Reviewed By: rven Subscribers: yoshinorim, MarkCallaghan, maykov, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39585	2015-06-05 20:18:21 -07:00
Igor Canadi	b2785472c8	Fix compile Summary: This commit broke the compile: `3ce3bb3da2` As evidenced here: https://evergreen.mongodb.com/task/mongodb_mongo_master_ubuntu1404_rocksdb_compile_ce2b1d11d42de93f7b375f7e6c41fb709f66e969_15_06_04_23_09_36 This should fix it Test Plan: make check Reviewers: IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39627	2015-06-05 09:41:45 -04:00
Islam AbdelRahman	3ce3bb3da2	Allowing L0 -> L1 trivial move on sorted data Summary: This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping The conditions for trivial move have been updated Introduced conditions: - Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual) - Input level files cannot be overlapping Removed conditions: - Trivial move only run when the compaction is not manual - Input level should can contain only 1 file More context on what tests failed because of Trivial move ``` DBTest.CompactionsGenerateMultipleFiles This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1 ``` ``` DBTest.NoSpaceCompactRange This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation because trivial move does not need any extra space, and did not check for that ``` ``` DBTest.DropWrites Similar to DBTest.NoSpaceCompactRange ``` ``` DBTest.DeleteObsoleteFilesPendingOutputs This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3 ``` ``` CuckooTableDBTest.CompactionIntoMultipleFiles Same as DBTest.CompactionsGenerateMultipleFiles ``` This diff is based on a work by @sdong https://reviews.facebook.net/D34149 Test Plan: make -j64 check Reviewers: rven, sdong, igor Reviewed By: igor Subscribers: yhchiang, ott, march, dhruba, sdong Differential Revision: https://reviews.facebook.net/D34797	2015-06-04 16:51:25 -07:00
sdong	4266d4fd90	Allow users to migrate to options.level_compaction_dynamic_level_bytes=true using CompactRange() Summary: In DB::CompactRange(), change parameter "reduce_level" to "change_level". Users can compact all data to the last level if needed. By doing it, users can migrate the DB to options.level_compaction_dynamic_level_bytes=true. Test Plan: Add a unit test for it. Reviewers: yhchiang, anthony, kradhakrishnan, igor, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39099	2015-06-01 18:21:14 -07:00
agiardullo	dc9d70de65	Optimistic Transactions Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported. Test Plan: Added a new test, but still need to look into stress testing. Reviewers: yhchiang, igor, rven, sdong Reviewed By: sdong Subscribers: adamretter, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D33435	2015-05-29 14:36:35 -07:00
Yueh-Hsuan Chiang	3ab8ffd4dd	Compaction now conditionally boosts the size of deletion entries. Summary: Compaction now boosts the size of deletion entries of a file only when the number of deletion entries is greater than the number of non-deletion entries in the file. The motivation here is that in a stable workload, the number of deletion entries should be roughly equal to the number of non-deletion entries. If we compensate the size of deletion entries in a stable workload, the deletion compensation logic might introduce unwanted effet which changes the shape of LSM tree. Test Plan: db_test --gtest_filter="Deletion" Reviewers: sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38703	2015-05-26 14:05:38 -07:00
Igor Canadi	7a3577519f	Don't artificially inflate L0 score Summary: This turns out to be pretty bad because if we prioritize L0->L1 then L1 can grow artificially large, which makes L0->L1 more and more expensive. For example: 256MB @ L0 + 256MB @ L1 --> 512MB @ L1 256MB @ L0 + 512MB @ L1 --> 768MB @ L1 256MB @ L0 + 768MB @ L1 --> 1GB @ L1 .... 256MB @ L0 + 10GB @ L1 --> 10.2GB @ L1 At some point we need to start compacting L1->L2 to speed up L0->L1. Test Plan: The performance improvement is massive for heavy write workload. This is the benchmark I ran: https://phabricator.fb.com/P19842671. Before this change, the benchmark took 47 minutes to complete. After, the benchmark finished in 2minutes. You can see full results here: https://phabricator.fb.com/P19842674 Also, we ran this diff on MongoDB on RocksDB on one replicaset. Before the change, our initial sync was so slow that it couldn't keep up with primary writes. After the change, the import finished without any issues Reviewers: dynamike, MarkCallaghan, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38637	2015-05-21 11:40:48 -07:00
Igor Canadi	dddceefe5e	Fix clang build Summary: fix build Test Plan: works Reviewers: kradhakrishnan Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37911	2015-04-30 11:11:35 -07:00
krad	d4540654e9	Optimize GetApproximateSizes() to use lesser CPU cycles. Summary: CPU profiling reveals GetApproximateSizes as a bottleneck for performance. The current implementation is sub-optimal, it scans every file in every level to compute the result. We can take advantage of the fact that all levels above 0 are sorted in the increasing order of key ranges and use binary search to locate the starting index. This can reduce the number of comparisons required to compute the result. Test Plan: We have good test coverage. Run the tests. Reviewers: sdong, igor, rven, dynamike Subscribers: dynamike, maykov, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37755	2015-04-30 10:55:03 -07:00
Igor Canadi	7f47ba0e26	Fix possible SIGSEGV in CompactRange (github issue #596 ) Summary: For very detailed explanation of what's happening read this: https://github.com/facebook/rocksdb/issues/596 Test Plan: make check + new unit test Reviewers: yhchiang, anthony, rven Reviewed By: rven Subscribers: adamretter, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37779	2015-04-29 10:52:31 -07:00
clark.kang	6ede020dc4	fix typos	2015-04-25 18:14:27 +09:00
Igor Canadi	e003d3864c	Abstract out SetMaxPossibleForUserKey() and SetMinPossibleForUserKey Summary: Based on feedback from D37083. Are all of these correct? In some spaces it seems like we're doing SetMaxPossibleForUserKey() although we want the smallest possible internal key for user key. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37341	2015-04-23 18:08:37 -07:00
sdong	9bf40b64d0	Print max score in level summary Summary: Add more logging to help debugging issues. Test Plan: Run test suites Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D37401	2015-04-23 11:34:36 -07:00
Igor Canadi	6059bdf86a	Add experimental API MarkForCompaction() Summary: Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys). All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart. MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction. Test Plan: added a new unit test Reviewers: yhchiang, rven, MarkCallaghan, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37083	2015-04-17 16:44:45 -07:00
Jim Meyering	d2a92c13bc	avoid returning a number-of-active-keys estimate of nearly 2^64 Summary: If accumulated_num_non_deletions_ were ever smaller than accumulated_num_deletions_, the computation of "accumulated_num_non_deletions_ - accumulated_num_deletions_" would result in a logically "negative" value, but since the two operands are unsigned (uint64_t), the result corresponding to e.g., -1 would 2^64-1. Instead, return 0 in that case. Test Plan: - ensure "make check" still passes - temporarily add an "abort();" call in the new "if"-block, and observe that it fails in some test cases. However, note that this case is triggered only when the two numbers are equal. Thus, no test case triggers the erroneous behavior this change is designed to avoid. If anyone can construct a scenario in which that bug would be triggered, I'll be happy to add a test case. Reviewers: ljin, igor, rven, igor.sugak, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D36489	2015-04-03 14:46:35 -07:00
sdong	a7ac6cef1f	Fix level size overflow for options_.level_compaction_dynamic_level_bytes=true Summary: Int is used for level size targets when options_.level_compaction_dynamic_level_bytes=true, which will cause overflow when database grows big. Fix it. Test Plan: Add a new unit test which fails without the fix. Reviewers: rven, yhchiang, MarkCallaghan, igor Reviewed By: igor Subscribers: leveldb, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D36453	2015-04-03 09:04:35 -07:00
sdong	b23bbaa82a	Universal Compactions with Small Files Summary: With this change, we use L1 and up to store compaction outputs in universal compaction. The compaction pick logic stays the same. Outputs are stored in the largest "level" as possible. If options.num_levels=1, it behaves all the same as now. Test Plan: 1) convert most of existing unit tests for universal comapaction to include the option of one level and multiple levels. 2) add a unit test to cover parallel compaction in universal compaction and run it in one level and multiple levels 3) add unit test to migrate from multiple level setting back to one level setting 4) add a unit test to insert keys to trigger multiple rounds of compactions and verify results. Reviewers: rven, kradhakrishnan, yhchiang, igor Reviewed By: igor Subscribers: meyering, leveldb, MarkCallaghan, dhruba Differential Revision: https://reviews.facebook.net/D34539	2015-03-30 15:12:02 -07:00
Anurag Indu	3d1a924ff3	Adding stats for the merge and filter operation Summary: We have addded new stats and perf_context for measuring the merge and filter operation time consumption. We have bounded all the merge operations within the GUARD statment and collected the total time for these operations in the DB. Test Plan: WIP Reviewers: rven, yhchiang, kradhakrishnan, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D34377	2015-03-24 14:42:04 -07:00
Igor Canadi	b088c83e6e	Don't delete files when column family is dropped Summary: To understand the bug read t5943287 and check out the new test in column_family_test (ReadDroppedColumnFamily), iter 0. RocksDB contract allowes you to read a drop column family as long as there is a live reference. However, since our iteration ignores dropped column families, AddLiveFiles() didn't mark files of a dropped column families as live. So we deleted them. In this patch I no longer ignore dropped column families in the iteration. I think this behavior was confusing and it also led to this bug. Now if an iterator client wants to ignore dropped column families, he needs to do it explicitly. Test Plan: Added a new unit test that is failing on master. Unit test succeeds now. Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32535	2015-03-19 17:04:29 -07:00
Igor Canadi	c88ff4ca76	Deprecate removeScanCountLimit in NewLRUCache Summary: It is no longer used by the implementation, so we should also remove it from the public API. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34971	2015-03-17 15:04:37 -07:00
Igor Canadi	db03739340	options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically. Summary: When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases. In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier. Test Plan: New unit tests and pass tests suites including valgrind. Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo Reviewed By: ikabiljo Subscribers: yoshinorim, ikabiljo, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31437	2015-03-02 22:40:41 -08:00
Igor Sugak	62247ffa3b	rocksdb: Add missing override Summary: When using latest clang (3.6 or 3.7/trunck) rocksdb is failing with many errors. Almost all of them are missing override errors. This diff adds missing override keyword. No manual changes. Prerequisites: bear and clang 3.5 build with extra tools ```lang=bash % USE_CLANG=1 bear make all # generate a compilation database http://clang.llvm.org/docs/JSONCompilationDatabase.html % clang-modernize -p . -include . -add-override % make format ``` Test Plan: Make sure all tests are passing. ```lang=bash % #Use default fb code clang. % make check ``` Verify less error and no missing override errors. ```lang=bash % # Have trunk clang present in path. % ROCKSDB_NO_FBCODE=1 CC=clang CXX=clang++ make ``` Reviewers: igor, kradhakrishnan, rven, meyering, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34077	2015-02-26 11:28:41 -08:00
Jim Meyering	9283c7afd2	build: remove always-true assertions Summary: Remove some always-true assertions. They provoke these compilation failures: table/plain_table_key_coding.cc:279:20: error: comparison of unsigned expression >= 0 is always true [-Werror=type-limits] db/version_set.cc:336:15: error: comparison of unsigned expression >= 0 is always true [-Werror=type-limits] * table/plain_table_key_coding.cc (rocksdb): Remove assertion that unsigned type variable is >= 0. * db/version_set.cc (DoGenerateLevelFilesBrief): Likewise. Test Plan: Run "make EXTRA_CXXFLAGS='-W -Wextra'" and see fewer errors. Reviewers: ljin, sdong, igor.sugak, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D33747	2015-02-20 11:07:03 -08:00
sdong	d45a6a4002	Add rocksdb.num-live-versions: number of live versions Summary: Add a DB property about live versions. It can be helpful to figure out whether there are files not live but not yet deleted, in some use cases. Test Plan: make all check Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33327	2015-02-19 13:10:37 -08:00
Igor Canadi	863009b5a5	Fix deleting obsolete files #2 Summary: For description of the bug, see comment in db_test. The fix is pretty straight forward. Test Plan: added unit test. eventually we need better testing of FOF/POF process. Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33081	2015-02-09 17:38:32 -08:00
Grace Law	1851f977c2	Added RocksDB stats GET_HIT_L0 and GET_HIT_L1 Summary: - In statistics.h , added tickers. - In version_set.cc, -- Added a getter method for hit_file_level_ in the class FilePicker -- Added a line in the Get() method in case of a found, increment the corresponding counters based on the level of the file respectively. Corresponding task: https://our.intern.facebook.com/intern/tasks/?s=506100481&t=5952818 Personal fork: `0c3f2e3600` Test Plan: In terminal, ``` make -j32 db_test ROCKSDB_TESTS=L0L1L2AndUpHitCounter ./db_test ``` Or to use debugger, ``` make -j32 db_test export ROCKSDB_TESTS=L0L1L2AndUpHitCounter gdb db_test ``` Reviewers: rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D32205	2015-02-09 14:53:58 -08:00
Igor Canadi	2a979822b6	Fix deleting obsolete files Summary: This diff basically reverts D30249 and also adds a unit test that was failing before this patch. I have no idea how I didn't catch this terrible bug when writing a diff, sorry about that :( I think we should redesign our system of keeping track of and deleting files. This is already a second bug in this critical piece of code. I'll think of few ideas. BTW this diff is also a regression when running lots of column families. I plan to revisit this separately. Test Plan: added a unit test Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33045	2015-02-06 08:44:30 -08:00
Yueh-Hsuan Chiang	181191a1e4	Add a counter for collecting the wait time on db mutex. Summary: Add a counter for collecting the wait time on db mutex. Also add MutexWrapper and CondVarWrapper for measuring wait time. Test Plan: ./db_test export ROCKSDB_TESTS=MutexWaitStats ./db_test verify stats output using db_bench make clean make release ./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10 Sample output: rocksdb.db.mutex.wait.micros COUNT : 7546866 Reviewers: MarkCallaghan, rven, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32787	2015-02-04 21:39:45 -08:00
Igor Canadi	e39f4f6cf9	Fix data race #3 Summary: Added requirement that ComputeCompactionScore() be executed in mutex, since it's accessing being_compacted bool, which can be mutated by other threads. Also added more comments about thread safety of FileMetaData, since it was a bit confusing. However, it seems that FileMetaData doesn't have data races (except being_compacted) Test Plan: Ran 100 ConvertCompactionStyle tests with thread sanitizer. On master -- some failures. With this patch -- none. Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32283	2015-02-04 16:04:51 -08:00
sdong	4e48753b73	Sync manifest file when initializing it Summary: Now we don't sync manifest file when initializing it, so DB cannot be safely reopened before the first mem table flush. Fix it by syncing it. This fixes fault_injection_test. Test Plan: make all check Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32001	2015-01-22 14:32:03 -08:00
sdong	bb128bfec3	More accurate message for compaction applied to a different version Test Plan: Compile. Run it. Reviewers: yhchiang, dhruba, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D31479	2015-01-13 16:48:18 -08:00
sdong	9ef59a09a5	VersionSet::AddLiveFiles() to assert current version is included. Summary: Add an extra assert to make sure current version is included in VersionSet::AddLiveFiles(). Test Plan: make all check Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: dhruba, hermanlee4, leveldb Differential Revision: https://reviews.facebook.net/D30819	2015-01-07 12:03:40 -08:00
Igor Canadi	0acc738810	Speed up FindObsoleteFiles() Summary: There are two versions of FindObsoleteFiles(): * full scan, which is executed every 6 hours (and it's terribly slow) * no full scan, which is executed every time a background process finishes and iterator is deleted This diff is optimizing the second case (no full scan). Here's what we do before the diff: * Get the list of obsolete files (files with ref==0). Some files in obsolete_files set might actually be live. * Get the list of live files to avoid deleting files that are live. * Delete files that are in obsolete_files and not in live_files. After this diff: * The only files with ref==0 that are still live are files that have been part of move compaction. Don't include moved files in obsolete_files. * Get the list of obsolete files (which exclude moved files). * No need to get the list of live files, since all files in obsolete_files need to be deleted. I'll post the benchmark results, but you can get the feel of it here: https://reviews.facebook.net/D30123 This depends on D30123. P.S. We should do full scan only in failure scenarios, not every 6 hours. I'll do this in a follow-up diff. Test Plan: One new unit test. Made sure that unit test fails if we don't have a `if (!f->moved)` safeguard in ~Version. make check Big number of compactions and flushes: ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000 Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30249	2014-12-22 12:04:45 +01:00
Jonah Cohen	a14b7873ee	Enforce write buffer memory limit across column families Summary: Introduces a new class for managing write buffer memory across column families. We supplement ColumnFamilyOptions::write_buffer_size with ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer instance that enforces memory limits before flushing out to disk. Test Plan: Added SharedWriteBuffer unit test to db_test.cc Reviewers: sdong, rven, ljin, igor Reviewed By: igor Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim Differential Revision: https://reviews.facebook.net/D22581	2014-12-02 12:09:20 -08:00
Lei Jin	9c7ca65d21	free builders in VersionSet::DumpManifest Summary: Reported by bootcamper This causes ldb tool to fail the assertion in ~ColumnFamilyData() Test Plan: ./ldb --db=/tmp/test_db1 --create_if_missing put a1 b1 ./ldb manifest_dump --path=/tmp/test_db1/MANIFEST-000001 Reviewers: sdong, yhchiang, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29517	2014-11-24 15:03:08 -08:00
Lei Jin	8d3f8f9696	remove all remaining references to cfd->options() Summary: The very last reference happens in DBImpl::GetOptions() I built with both DBImpl::GetOptions() and ColumnFamilyData::options() commented out Test Plan: make all check Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29073	2014-11-18 10:20:10 -08:00
Yueh-Hsuan Chiang	1d1a64f58a	Move NeedsCompaction() from VersionStorageInfo to CompactionPicker Summary: Move NeedsCompaction() from VersionStorageInfo to CompactionPicker to allow different compaction strategy to have their own way to determine whether doing compaction is necessary. When compaction style is set to kCompactionStyleNone, then NeedsCompaction() will always return false. Test Plan: export ROCKSDB_TESTS=Compact ./db_test Reviewers: ljin, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28719	2014-11-13 13:41:43 -08:00
Igor Canadi	25f273027b	Fix iOS compile with -Wshorten-64-to-32 Summary: So iOS size_t is 32-bit, so we need to static_cast<size_t> any uint64_t :( Test Plan: TARGET_OS=IOS make static_lib Reviewers: dhruba, ljin, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28743	2014-11-13 14:39:30 -05:00
sdong	fa50abb726	Fix bug of reading from empty DB. Summary: I found that db_stress sometimes segfault on my machine. Fix the bug. Test Plan: make all check. Run db_stress Reviewers: ljin, yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D28803	2014-11-13 10:42:18 -08:00
Yueh-Hsuan Chiang	5811419357	Fixed GetEstimatedActiveKeys Summary: Fixed a bug in GetEstimatedActiveKeys which does not normalized the sampled information correctly. Add a test in version_builder_test. Test Plan: version_builder_test Reviewers: ljin, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28707	2014-11-11 14:28:18 -08:00
Igor Canadi	767777c2bd	Turn on -Wshorten-64-to-32 and fix all the errors Summary: We need to turn on -Wshorten-64-to-32 for mobile. See D1671432 (internal phabricator) for details. This diff turns on the warning flag and fixes all the errors. There were also some interesting errors that I might call bugs, especially in plain table. Going forward, I think it makes sense to have this flag turned on and be very very careful when converting 64-bit to 32-bit variables. Test Plan: compiles Reviewers: ljin, rven, yhchiang, sdong Reviewed By: yhchiang Subscribers: bobbaldwin, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28689	2014-11-11 16:47:22 -05:00
Igor Canadi	e3d3567b5b	Get rid of mutex in CompactionJob's state Summary: Based on @sdong's feedback in the diff, we shouldn't keep db_mutex in CompactionJob's state. This diff removes db_mutex from CompactionJob state, by making next_file_number_ atomic. That way we only need to pass the lock to InstallCompactionResults() because of LogAndApply() Test Plan: make check Reviewers: ljin, yhchiang, rven, sdong Reviewed By: sdong Subscribers: sdong, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28491	2014-11-07 15:44:12 -08:00
Yueh-Hsuan Chiang	28c82ff1b3	CompactFiles, EventListener and GetDatabaseMetaData Summary: This diff adds three sets of APIs to RocksDB. = GetColumnFamilyMetaData = * This APIs allow users to obtain the current state of a RocksDB instance on one column family. * See GetColumnFamilyMetaData in include/rocksdb/db.h = EventListener = * A virtual class that allows users to implement a set of call-back functions which will be called when specific events of a RocksDB instance happens. * To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners = CompactFiles = * CompactFiles API inputs a set of file numbers and an output level, and RocksDB will try to compact those files into the specified level. = Example = * Example code can be found in example/compact_files_example.cc, which implements a simple external compactor using EventListener, GetColumnFamilyMetaData, and CompactFiles API. Test Plan: listener_test compactor_test example/compact_files_example export ROCKSDB_TESTS=CompactFiles db_test export ROCKSDB_TESTS=MetaData db_test Reviewers: ljin, igor, rven, sdong Reviewed By: sdong Subscribers: MarkCallaghan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D24705	2014-11-07 14:45:18 -08:00
Igor Canadi	9f20395cd6	Turn -Wshadow back on Summary: It turns out that -Wshadow has different rules for gcc than clang. Previous commit fixed clang. This commits fixes the rest of the warnings for gcc. Test Plan: compiles Reviewers: ljin, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28131	2014-11-06 11:14:28 -08:00
Yueh-Hsuan Chiang	d8e1196635	Apply InfoLogLevel to the logs in db/version_set.cc Summary: Apply InfoLogLevel to the logs in db/version_set.cc Test Plan: make Reviewers: ljin, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27879	2014-11-04 10:34:33 -08:00
sdong	ac6afaf9ef	Enforce naming convention of getters in version_set.h Summary: Enforce the accessier naming convention in functions in version_set.h Test Plan: make all check Reviewers: ljin, yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D28143	2014-11-04 09:59:05 -08:00
sdong	86905e3cbb	Move VersionBuilder logic to a separate .cc file Summary: Move all the logic of VersionBuilder to a separate .cc file Test Plan: make all check Reviewers: ljin, yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D28083	2014-10-31 16:34:38 -07:00
sdong	f7e6c856ab	Fix BaseReferencedVersionBuilder's destructor order Summary: BaseReferencedVersionBuilder now unreference version before destructing VersionBuilder, which is wrong. Fix it. Test Plan: make all check valgrind test to tests that used to fail Reviewers: igor, yhchiang, rven, ljin Reviewed By: ljin Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D28101	2014-10-31 13:42:18 -07:00
Igor Canadi	9f7fc3ac45	Turn on -Wshadow Summary: ...and fix all the errors :) Jim suggested turning on -Wshadow because it helped him fix number of critical bugs in fbcode. I think it's a good idea to be -Wshadow clean. Test Plan: compiles Reviewers: yhchiang, rven, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27711	2014-10-31 11:59:54 -07:00
sdong	4d2ba38b65	Make VersionBuilder unit testable Summary: Rename Version::Builder to VersionBuilder and expose its definition to a header. Make VerisonBuilder not reference Version or ColumnFamilyData, only working with VersionStorageInfo. Add version_builder_test which has a simple test. Test Plan: make all check Reviewers: rven, yhchiang, igor, ljin Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D27969	2014-10-31 10:44:06 -07:00
sdong	76d1c28e82	Make CompactionPicker more easily tested Summary: Make compaction picker easier to test. The basic idea is to separate a minimum subcomponent of Version to VersionStorageInfo, which just responsible to LSM tree. A stub VersionStorageInfo can then be easily created and passed into compaction picker so that we can check the outputs. It now passes most tests. Still two things need to be done: (1) deal with the FIFO compaction's file size. (2) write an example test to make sure the interface can do the job. Add a compaction_picker_test to make sure compaction picker codes can be easily unit tested. Test Plan: Pass all unit tests and compaction_picker_test Reviewers: yhchiang, rven, igor, ljin Reviewed By: ljin Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D27639	2014-10-29 15:16:53 -07:00
Yueh-Hsuan Chiang	3772a3d09d	Fix the bug where compaction does not fail when RocksDB can't create a new file. Summary: This diff has two fixes. 1. Fix the bug where compaction does not fail when RocksDB can't create a new file. 2. When NewWritableFiles() fails in OpenCompactionOutputFiles(), previously such fail-to-created file will be still be included as a compaction output. This patch also fixes this bug. 3. Allow VersionEdit::EncodeTo() to return Status and add basic check. Test Plan: ./version_edit_test export ROCKSDB_TESTS=FileCreationRandomFailure ./db_test Reviewers: ljin, sdong, nkg-, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D25581	2014-10-28 14:27:26 -07:00
Lei Jin	efa2fb33b0	make LevelFileNumIterator and LevelFileIteratorState anonymous Summary: No need to expose them in .h Test Plan: make release Reviewers: igor, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27645	2014-10-28 11:42:22 -07:00
Lei Jin	f981e08139	unfriend ColumnFamilyData from VersionSet Summary: as title Test Plan: make release will run full test on all stacked diffs before committing Reviewers: sdong, yhchiang, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27591	2014-10-28 10:04:38 -07:00
Lei Jin	834c67d77f	rename FileLevel to LevelFilesBrief / unfriend CompactedDBImpl Summary: We have several different types of data structures for file information. FileLevel is kinda of confusing since it only contains file range and fd. Rename it to LevelFilesBrief to make it clear. Unfriend CompactedDBImpl as a by product Test Plan: make release / make all will run full test with all stacked diffs Reviewers: sdong, yhchiang, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27585	2014-10-28 10:03:13 -07:00
Lei Jin	a28b3c4388	unfriend UniversalCompactionPicker,LevelCompactionPicker and FIFOCompactionPicker from VersionSet Summary: as title Test Plan: make release I will run make all check for all stacked diffs before commit Reviewers: sdong, yhchiang, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27573	2014-10-28 10:00:51 -07:00
Lei Jin	5187d896b9	unfriend Compaction and CompactionPicker from VersionSet Summary: as title Test Plan: running make all check Reviewers: sdong, yhchiang, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27549	2014-10-28 09:59:56 -07:00
Lei Jin	f1841985e4	dynamic inplace_update options Summary: Make inplace_update_support and inplace_update_num_locks dynamic. inplace_callback becomes immutable We are almost free of references to cfd->options() in db_impl Test Plan: unit test Reviewers: igor, yhchiang, rven, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D25293	2014-10-27 12:10:13 -07:00
Lei Jin	122f98e0b9	dynamic max_mem_compact_level Summary: as title Test Plan: unit test Reviewers: sdong, yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D25347	2014-10-23 15:37:14 -07:00
Yueh-Hsuan Chiang	6c66918645	Speed up DB::Open() and Version creation by limiting the number of FileMetaData initialization. Summary: This diff speeds up DB::Open() and Version creation by limiting the number of FileMetaData initialization. The behavior of Version::UpdateAccumulatedStats() is changed as follows: * It only initializes the first 20 uninitialized FileMetaData from file. This guarantees the size of the latest 20 files will always be compensated when they have any deletion entries. Previously it may initialize all FileMetaData by loading all files at DB::Open(). * In case none the first 20 files has any data entry, UpdateAccumulatedStats() will initialize the FileMetaData of the oldest file. Test Plan: db_test Reviewers: igor, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D24255	2014-10-17 14:58:30 -07:00
Igor Canadi	d6987216c9	Merge pull request #327 from dalgaaf/wip-da-SCA-20141001 Fix some issues from SCA	2014-10-02 10:59:52 -07:00
Lei Jin	5ec53f3edf	make compaction related options changeable Summary: make compaction related options changeable. Most of changes are tedious, following the same convention: grabs MutableCFOptions at the beginning of compaction under mutex, then pass it throughout the job and register it in SuperVersion at the end. Test Plan: make all check Reviewers: igor, yhchiang, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23349	2014-10-01 16:19:16 -07:00
Danny Al-Gaaf	4a171882d6	db/version_set.cc: remove unnecessary checks Fix for: [db/version_set.cc:1219]: (style) Unsigned variable 'last_file' can't be negative so it is unnecessary to test it. [db/version_set.cc:1234]: (style) Unsigned variable 'first_file' can't be negative so it is unnecessary to test it. Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>	2014-10-01 11:09:22 +02:00
Danny Al-Gaaf	b8b7117e97	db/version_set.cc: use !empty() instead of 'size() > 0' Use empty() since it should be prefered as it has, following the standard, a constant time complexity regardless of the containter type. The same is not guaranteed for size(). Fix for: [db/version_set.cc:2250]: (performance) Possible inefficient checking for 'column_families_not_found' emptiness. Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>	2014-09-30 23:30:31 +02:00
Lei Jin	2faf49d5f1	use GetContext to replace callback function pointer Summary: Intead of passing callback function pointer and its arg on Table::Get() interface, passing GetContext. This makes the interface cleaner and possible better perf. Also adding a fast pass for SaveValue() Test Plan: make all check Reviewers: igor, yhchiang, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D24057	2014-09-29 11:09:09 -07:00
Lei Jin	3c68006109	CompactedDBImpl Summary: Add a CompactedDBImpl that will enabled when calling OpenForReadOnly() and the DB only has one level (>0) of files. As a performan comparison, CuckooTable performs 2.1M/s with CompactedDBImpl vs. 1.78M/s with ReadOnlyDBImpl. Test Plan: db_bench Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23553	2014-09-25 11:14:01 -07:00
Lei Jin	a062e1f2c4	SetOptions() for memtable related options Summary: as title Test Plan: make all check I will think a way to set up stress test for this Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23055	2014-09-17 12:49:13 -07:00
Igor Canadi	4a27a2f193	Don't sync manifest when disableDataSync = true Summary: As we discussed offline Test Plan: compiles Reviewers: yhchiang, sdong, ljin, dhruba Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22989	2014-09-15 11:32:01 -07:00
Lei Jin	9b0f7ffa1c	rename version_set options_ to db_options_ to avoid confusion Summary: as title Test Plan: make release Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23007	2014-09-08 15:25:01 -07:00
Lei Jin	048560a642	reduce references to cfd->options() in DBImpl Summary: I found it is almost impossible to get rid of this function in a single batch. I will take a step by step approach Test Plan: make release Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22995	2014-09-08 15:04:34 -07:00
Igor Canadi	a2bb7c3c33	Push- instead of pull-model for managing Write stalls Summary: Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either: * proceed with all writes without delay * delay all writes by fixed time * stop all writes The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case). When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal. This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write. Test Plan: make check for now. I'll add some unit tests later. Also, perf test. Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22791	2014-09-08 11:20:25 -07:00
liuhuahang	bb6ae0f80c	fix more compile warnings N/A Change-Id: I5b6f9c70aea7d3f3489328834fed323d41106d9f Signed-off-by: liuhuahang <liuhuahang@zerus.co>	2014-09-05 14:14:37 +08:00
Stanislau Hlebik	45a5e3ede0	Remove path with arena==nullptr from NewInternalIterator Summary: Simply code by removing code path which does not use Arena from NewInternalIterator Test Plan: make all check make valgrind_check Reviewers: sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22395	2014-09-04 17:40:41 -07:00
Yueh-Hsuan Chiang	570ba5aca8	Avoid retrying to read property block from a table when it does not exist. Summary: Avoid retrying to read property block from a table when it does not exist in updating stats for compensating deletion entries. In addition, ReadTableProperties() now returns Status::NotFound instead of Status::Corruption when table properties does not exist in the file. Test Plan: make db_test -j32 export ROCKSDB_TESTS=CompactionDeleteionTrigger ./db_test Reviewers: ljin, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D21867	2014-08-15 12:17:44 -07:00
Stanislau Hlebik	0c9dc9f8e0	Remove malloc from FormatFileNumber Summary: Replace unnecessary malloc with stack allocation Test Plan: make all check Reviewers: sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D21771	2014-08-13 11:57:40 -07:00
miguelportilla	93e6b5e9d9	Changes to support unity build: * Script for building the unity.cc file via Makefile * Unity executable Makefile target for testing builds * Source code changes to fix compilation of unity build	2014-08-11 13:22:47 -04:00
sdong	1242bfcad7	Add DB property "rocksdb.estimate-table-readers-mem" Summary: Add a DB Property "rocksdb.estimate-table-readers-mem" to return estimated memory usage by all loaded table readers, other than allocated from block cache. Refactor the property codes to allow getting property from a version, with DB mutex not acquired. Test Plan: Add several checks of this new property in existing codes for various cases. Reviewers: yhchiang, ljin Reviewed By: ljin Subscribers: xjin, igor, leveldb Differential Revision: https://reviews.facebook.net/D20733	2014-08-06 11:39:46 -07:00
Feng Zhu	1129921e9b	logging_when_create_and_delete_manifest Summary: 1. logging when create and delete manifest file 2. fix formating in table/format.cc Test Plan: make all check run db_bench, track the LOG file. Reviewers: ljin, yhchiang, igor, yufei.zhu, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D21009	2014-08-04 11:25:42 -07:00
Yueh-Hsuan Chiang	49ee5a4ac4	Fixed the crash when merge_operator is not properly set after reopen. Summary: Fixed the crash when merge_operator is not properly set after reopen and added two test cases for this. Test Plan: make merge_test ./merge_test Reviewers: igor, ljin, sdong Reviewed By: sdong Subscribers: benj, mvikjord, leveldb Differential Revision: https://reviews.facebook.net/D20793	2014-07-30 17:24:36 -07:00
sdong	f6784766db	Add DB property estimated number of keys Summary: Add a DB property of estimated number of live keys, by adding number of entries of all mem tables and all files, subtracted by all deletions in all files. Test Plan: Add the case in unit tests Reviewers: hobbymanyp, ljin Reviewed By: ljin Subscribers: MarkCallaghan, yoshinorim, leveldb, igor, dhruba Differential Revision: https://reviews.facebook.net/D20631	2014-07-28 15:27:58 -07:00
Lei Jin	40fa8a4cd5	make statistics forward-able Summary: Make StatisticsImpl being able to forward stats to provided statistics implementation. The main purpose is to allow us to collect internal stats in the future even when user supplies custom statistics implementation. It avoids intrumenting 2 sets of stats collection code. One immediate use case is tuning advisor, which needs to collect some internal stats, users may not be interested. Test Plan: ran db_bench and see stats show up at the end of run Will run make all check since some tests rely on statistics Reviewers: yhchiang, sdong, igor Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D20145	2014-07-28 12:05:36 -07:00
Yueh-Hsuan Chiang	6480717a26	Fixed compaction-related errors where number of input levels are hard-coded. Summary: Fixed compaction-related errors where number of input levels are hard-coded. It's a bug found in compaction branch. This diff will be pushed into master. Test Plan: export ROCKSDB_TESTS=Compact make db_test -j32 ./db_test also passed the tests in compaction branch Reviewers: igor, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D20577	2014-07-24 17:06:00 -07:00
sdong	f6b7e1ed1a	Allow user to specify DB path of output file of manual compaction Summary: Add a parameter path_id to DB::CompactRange(), to indicate where the output file should be placed to. Test Plan: add a unit test Reviewers: yhchiang, ljin Reviewed By: ljin Subscribers: xjin, igor, dhruba, MarkCallaghan, leveldb Differential Revision: https://reviews.facebook.net/D20085	2014-07-21 19:06:00 -07:00
Yueh-Hsuan Chiang	052ddbe0e2	Add MaxInputLevel() to CompactionPicker Summary: Having if-then branch for different compaction strategies is considered hacky and make CompactionPicker less pluggable. This diff removes two of such if-then branches in version_set.cc by adding MaxInputLevel() to CompactionPicker. // Given the current number of levels, returns the lowest allowed level // for compaction input. virtual int MaxInputLevel(int current_num_levels) const; Test Plan: make db_test export ROCKSDB_TESTS=Compaction ./db_test Reviewers: igor, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19971	2014-07-17 18:01:04 -07:00
Radheshyam Balasundaram	0d57e3ad7d	Guarding files_ attribute with #ifndef NDEBUG guard in FilePicker class. Summary: Adding guards to files_ attribute of FilePicker class. This attribute is used only in DEBUG mode. This fixes build of static_lib in mac. Test Plan: make static_lib in mac make check all in devserver Reviewers: ljin, igor, sdong Reviewed By: sdong Differential Revision: https://reviews.facebook.net/D20163	2014-07-17 15:07:05 -07:00
Yueh-Hsuan Chiang	296e340753	Add struct CompactionInputFiles to manage compaction input files. Summary: Add struct CompactionInputFiles to manage compaction input files. Test Plan: export ROCKSDB_TESTS=Compact make db_test ./db_test Reviewers: ljin, igor, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D20061	2014-07-16 18:12:17 -07:00
Radheshyam Balasundaram	0418e66e2a	Refactoring Version::Get() Summary: Refactoring Version::Get() method to move file picker logic to a separate class. Test Plan: make check all Reviewers: igor, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19713	2014-07-16 13:33:02 -07:00
Feng Zhu	c11d604ab3	store file_indexer info in sequential memory Summary: use arena to allocate space for next_level_index_ and level_rb_ Thus increasing data locality and make Version::Get faster. Benchmark detail Base version: commit `d2a727c182` command used: ./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=2097152 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=fillseq, readrandom,readrandom,readrandom --use_existing_db=0 --num=52428800 --threads=1 Result: cpu running percentage: Version::Get, improved from 7.98% to 7.42% FileIndexer::GetNextLevelIndex, improved from 1.18% to 0.68%. Test Plan: make all check Reviewers: ljin, haobo, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, igor Differential Revision: https://reviews.facebook.net/D19845	2014-07-16 11:21:30 -07:00
sdong	0abaed2e08	Support multiple DB directories in universal compaction style Summary: This patch adds a target size parameter in options.db_paths and universal compaction will base it to determine which DB path to place a new file. Level-style stays the same. Test Plan: Add new unit tests Reviewers: ljin, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, dhruba, igor, leveldb Differential Revision: https://reviews.facebook.net/D19869	2014-07-15 12:06:28 -07:00
Feng Zhu	178fd6f9db	use FileLevel in LevelFileNumIterator Summary: Use FileLevel in LevelFileNumIterator, thus use new version of findFile. Old version of findFile function is deleted. Write a function in version_set.cc to generate FileLevel from files_. Add GenerateFileLevelTest in version_set_test.cc Test Plan: make all check Reviewers: ljin, haobo, yhchiang, sdong Reviewed By: sdong Subscribers: igor, dhruba Differential Revision: https://reviews.facebook.net/D19659	2014-07-11 12:52:41 -07:00
Feng Zhu	f697cad15c	create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster Summary: Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena Thus increase the file meta data locality, speed up "Get" and "FindFile" benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing" benchmark command: ./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920 Read Random: From 1.8363 ms/op, improve to 1.7587 ms/op. Read while writing: From 2.985 ms/op, improve to 2.924 ms/op. Test Plan: make all check Reviewers: ljin, haobo, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, igor Differential Revision: https://reviews.facebook.net/D19419	2014-07-09 22:14:39 -07:00
Yueh-Hsuan Chiang	70828557ef	Some fixes on size compensation logic for deletion entry in compaction Summary: This patch include two fixes: 1. newly created Version will now takes the aggregated stats for average-value-size from the latest Version. 2. compensated size of a file is now computed only for newly created / loaded file, this addresses the issue where files are already sorted by their compensated file size but might sometimes observe some out-of-order due to later update on compensated file size. Test Plan: export ROCKSDB_TESTS=CompactionDele ./db_test Reviewers: ljin, igor, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19557	2014-07-09 12:46:08 -07:00
Igor Canadi	4203431e71	Fix mac os compile error	2014-07-03 23:03:24 +02:00
sdong	2459f7ec4e	Support Multiple DB paths (without having an interface to expose to users) Summary: In this patch, we allow RocksDB to support multiple DB paths internally. No user interface is supported yet so this patch is silent to users. Test Plan: make all check Reviewers: igor, haobo, ljin, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D18921	2014-07-02 21:14:44 -07:00
Igor Canadi	a2e0d890ed	No need for files_by_size_ in universal compaction Summary: files_by_size_ is sorted by time in case of universal compaction. However, Version::files_ is also sorted by time. So no need for files_by_size_ Test Plan: 1) make check with the change 2) make check with `assert(last_index == c->input_version_->files_[level].size() - 1);` in compaction picker Reviewers: dhruba, haobo, yhchiang, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19125	2014-07-01 08:55:04 +02:00
Yueh-Hsuan Chiang	e813f5b6d9	Allow compaction to reclaim storage more effectively. Summary: This diff allows compaction to reclaim storage more effectively. In the current design, compactions are mainly triggered based on the file sizes. However, since deletion entries does not have value, files which have many deletion entries are less likely to be compacted. As a result, it may took a while to make deletion entries to be compacted. This diff address issue by compensating the size of deletion entries during compaction process: the size of each deletion entry in the compaction process is augmented by 2x average value size. The diff applies to both leveled and universal compacitons. Test Plan: develop CompactionDeletionTrigger make db_test ./db_test Reviewers: haobo, igor, ljin, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19029	2014-06-24 16:37:06 -06:00
Igor Canadi	d4a8423334	Remove seek compaction Summary: As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases. This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction. There is one test case that relied on didIO information. I'll try to find another way to implement it. Test Plan: make check Reviewers: sdong, haobo, yhchiang, ljin, dhruba Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19161	2014-06-20 10:23:02 +02:00
Igor Canadi	107e08baa7	Use same sorting for all level 0 files Summary: We decided that one of the long term goals is to unify level and universal compaction. As a small first step, I'm unifying level 0 sorting methods. Previously, we used to sort level 0 files in level compaction by file number and in universal compaction by sequence number. But it turns out that in level compaction, sorting by file number is exactly the same as sorting by sequence number. Test Plan: Ran make check with bunch of asserts to verify the sorting order is exactly the same. Also, make check with this patch Reviewers: haobo, yhchiang, ljin, dhruba, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19131	2014-06-20 09:12:14 +02:00
sdong	cadc1adffa	Refactor: group metadata needed to open an SST file to a separate copyable struct Summary: We added multiple fields to FileMetaData recently and are planning to add more. This refactoring separate the minimum information for accessing the file. This object is copyable (FileMetaData is not copyable since the ref counter). I hope this refactoring can enable further improvements: (1) use it to design a more efficient data structure to speed up read queries. (2) in the future, when we add information of storage level, we can easily do the encoding, instead of enlarge this structure, which might expand memory work set for file meta data. The definition is same as current EncodedFileMetaData used in two level iterator, so now the logic in two level iterator is easier to understand. Test Plan: make all check Reviewers: haobo, igor, ljin Reviewed By: ljin Subscribers: leveldb, dhruba, yhchiang Differential Revision: https://reviews.facebook.net/D18933	2014-06-16 16:10:52 -07:00
sdong	983c93d731	VersionSet::Get(): Bring back the logic of skipping key range check when there are <=3 level 0 files Summary: https://reviews.facebook.net/D17205 removed the logic of skipping file key range check when there are less than 3 level 0 files. This patch brings it back. Other than that, add another small optimization to avoid to check all the levels if most higher levels don't have any file. Test Plan: make all check Reviewers: ljin Reviewed By: ljin Subscribers: yhchiang, igor, haobo, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D19035	2014-06-13 15:51:44 -07:00
Lei Jin	c83b085770	prefetch bloom filter data block for L0 files Summary: as title Test Plan: db_bench the initial result is very promising. I will post results of complete runs Reviewers: dhruba, haobo, sdong, igor Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D18867	2014-06-12 10:06:18 -07:00
sdong	df9069d23f	In DB::NewIterator(), try to allocate the whole iterator tree in an arena Summary: In this patch, try to allocate the whole iterator tree starting from DBIter from an arena 1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it. 2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator. 3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it. Limitations: (1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc (2) Two level iterator itself is allocated in arena, but not iterators inside it. Test Plan: make all check Reviewers: ljin, haobo Reviewed By: haobo Subscribers: leveldb, dhruba, yhchiang, igor Differential Revision: https://reviews.facebook.net/D18513	2014-06-02 17:44:57 -07:00
Igor Canadi	6de6a06631	FIFO compaction style Summary: Introducing new compaction style -- FIFO. FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values. FIFO compaction style is suited for storing high-frequency event logs. Test Plan: Added a unit test Reviewers: dhruba, haobo, sdong Reviewed By: dhruba Subscribers: alberts, leveldb Differential Revision: https://reviews.facebook.net/D18765	2014-05-21 11:43:35 -07:00
Igor Canadi	f4574449e9	Clean up compaction logging Summary: Cleaned up compaction logging a little bit. Now file sizes are easier to read. Also, removed the trailing space. Test Plan: verified that i'm happy with logging output: files_size[#33(seq=101,sz=98KB,0) #31(seq=81,sz=159KB,0) #26(seq=0,sz=637KB,0)] Reviewers: sdong, haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D18549	2014-05-14 12:13:50 -07:00
sdong	9efbd85ac9	fsync directory after creating current file in NewDB() Summary: One of our users reported current file corruption. The machine was rebooted during the time. This is the only think I can think of which could cause current file corruption. Just add this paranoid check. Test Plan: make all check Reviewers: haobo, igor Reviewed By: haobo CC: yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D18495	2014-05-06 17:51:33 -07:00
Igor Canadi	096f5be0ed	Put column family information in LiveFileMetaData Summary: As summary Test Plan: compiles :) Reviewers: dhruba, haobo, sdong, yhchiang Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D18405	2014-04-30 16:24:52 -04:00
Lei Jin	ccaca59bee	avoid calling FindFile twice in TwoLevelIterator for PlainTable Summary: this is to reclaim the regression introduced in https://reviews.facebook.net/D17853 Test Plan: make all check Reviewers: igor, haobo, sdong, dhruba, yhchiang Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17985	2014-04-25 12:23:07 -07:00
Lei Jin	d642c60bdc	Check PrefixMayMatch on Seek() Summary: As a follow-up diff for https://reviews.facebook.net/D17805, add optimization to check PrefixMayMatch on Seek() Test Plan: make all check Reviewers: igor, haobo, sdong, yhchiang, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17853	2014-04-25 12:22:23 -07:00
Lei Jin	3995e801ab	kill ReadOptions.prefix and .prefix_seek Summary: also add an override option total_order_iteration if you want to use full iterator with prefix_extractor Test Plan: make all check Reviewers: igor, haobo, sdong, yhchiang Reviewed By: haobo CC: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D17805	2014-04-25 12:21:34 -07:00
Igor Canadi	ad3cd39ccd	Column family logging Summary: Now that we have column families involved, we need to add extra context to every log message. They now start with "[column family name] log message" Also added some logging that I think would be useful, like level summary after every flush (I often needed that when going through the logs). Test Plan: make check + ran db_bench to confirm I'm happy with log output Reviewers: dhruba, haobo, ljin, yhchiang, sdong Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D18303	2014-04-25 09:51:16 -04:00
Lei Jin	0f2d768191	hints for narrowing down FindFile range and avoiding checking unrelevant L0 files Summary: The file tree structure in Version is prebuilt and the range of each file is known. On the Get() code path, we do binary search in FindFile() by comparing target key with each file's largest key and also check the range for each L0 file. With some pre-calculated knowledge, each key comparision that has been done can serve as a hint to narrow down further searches: (1) If a key falls within a L0 file's range, we can safely skip the next file if its range does not overlap with the current one. (2) If a key falls within a file's range in level L0 - Ln-1, we should only need to binary search in the next level for files that overlap with the current one. (1) will be able to skip some files depending one the key distribution. (2) can greatly reduce the range of binary search, especially for bottom levels, given that one file most likely only overlaps with N files from the level below (where N is max_bytes_for_level_multiplier). So on level L, we will only look at ~N files instead of N^L files. Some inital results: measured with 500M key DB, when write is light (10k/s = 1.2M/s), this improves QPS ~7% on top of blocked bloom. When write is heavier (80k/s = 9.6M/s), it gives us ~13% improvement. Test Plan: make all check Reviewers: haobo, igor, dhruba, sdong, yhchiang Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17205	2014-04-21 09:10:12 -07:00
sdong	651792251a	Fix bugs introduced by D17961 Summary: D17961 has two bugs: (1) two level iterator fails to populate FileMetaData.table_reader, causing performance regression. (2) table cache handle the !status.ok() case in the wrong place, causing seg fault which shouldn't happen. Test Plan: make all check Reviewers: ljin, igor, haobo Reviewed By: ljin CC: yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D17991	2014-04-17 17:25:28 -07:00
sdong	fa430bfd04	Minimize accessing multiple objects in Version::Get() Summary: One of our profilings shows that Version::Get() sometimes is slow when getting pointer of user comparators or other global objects. In this patch: (1) we keep pointers of immutable objects in Version to avoid accesses them though option objects or cfd objects (2) table_reader is directly cached in FileMetaData so that table cache don't have to go through handle first to fetch it (3) If level 0 has less than 3 files, skip the filtering logic based on SST tables' key range. Smallest and largest key are stored in separated memory locations, which has potential cache misses Test Plan: make all check Reviewers: haobo, ljin Reviewed By: haobo CC: igor, yhchiang, nkg-, leveldb Differential Revision: https://reviews.facebook.net/D17739	2014-04-17 14:14:00 -07:00
Igor Canadi	588bca2020	RocksDBLite Summary: Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile. Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping) Test Plan: compiles :) Reviewers: dhruba, haobo, ljin, sdong, yhchiang Reviewed By: yhchiang CC: leveldb Differential Revision: https://reviews.facebook.net/D17835	2014-04-15 13:39:26 -07:00
Igor Canadi	e6acb874cd	Don't roll empty logs Summary: With multiple column families, especially when manual Flush is executed, we might roll the log file, although the current log file is empty (no data has been written to the log). After the diff, we won't create new log file if current is empty. Next, I will write an algorithm that will flush column families that reference old log files (i.e., that weren't flushed in a while) Test Plan: Added an unit test. Confirmed that unit test failes in master Reviewers: dhruba, haobo, ljin, sdong Reviewed By: ljin CC: leveldb Differential Revision: https://reviews.facebook.net/D17631	2014-04-15 09:57:25 -07:00
Igor Canadi	2014915d32	Fix ASAN issue	2014-04-09 10:38:05 -07:00
Igor Canadi	b947fdc89d	Column family support for DB::OpenForReadOnly() Summary: When opening DB in read-only mode, client can choose to only specify a subset of column families ("default" column family can't be omitted, though) Test Plan: added a unit test in column_family_test Reviewers: haobo, sdong, ljin, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17565	2014-04-09 09:56:17 -07:00
Igor Canadi	5b345b76cb	Remove env_ from MergingIterator Summary: env_ is not used. Compiling for iOS complains. Test Plan: compiles now Reviewers: ljin, haobo, sdong, dhruba Reviewed By: ljin CC: leveldb Differential Revision: https://reviews.facebook.net/D17589	2014-04-08 13:40:42 -07:00
Igor Canadi	3d2fe844ab	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/memtable_list.cc db/version_set.cc	2014-04-07 11:31:11 -07:00
sdong	284c365b77	Fix valgrind error caused by FileMetaData as two level iterator's index block handle Summary: It is a regression valgrind bug caused by using FileMetaData as index block handle. One of the fields of FileMetaData is not initialized after being contructed and copied, but I'm not able to find which one. Also, I realized that it's not a good idea to use FileMetaData as in TwoLevelIterator::InitDataBlock(), a copied FileMetaData can be compared with the one in version set byte by byte, but the refs can be changed. Also comparing such a large structure is slightly more expensive. Use a simpler structure instead Test Plan: Run the failing valgrind test (Harness.RandomizedLongDB) make all check Reviewers: igor, haobo, ljin Reviewed By: igor CC: yhchiang, leveldb Differential Revision: https://reviews.facebook.net/D17409	2014-04-02 14:38:28 -07:00
Igor Canadi	ddbd1ece88	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_test.cc db/internal_stats.cc db/internal_stats.h db/version_edit.cc db/version_edit.h db/version_set.cc include/rocksdb/options.h util/options.cc	2014-03-31 13:39:24 -07:00
Igor Canadi	577556d5f9	Don't store version number in MANIFEST Summary: Talked to <insert internal project name> folks and they found it really scary that they won't be able to roll back once they upgrade to 2.8. We should fix this. Test Plan: make check Reviewers: haobo, ljin Reviewed By: ljin CC: leveldb Differential Revision: https://reviews.facebook.net/D17343	2014-03-31 11:33:09 -07:00
Igor Canadi	6a08bc042a	Fix no return warning in FileComparator	2014-03-26 14:46:07 -07:00
Igor Canadi	1e9621d4e5	Sort files correctly in Builder::SaveTo Summary: Previously, we used to sort all files by BySmallestFirst comparator and then re-sort level0 files in the Finalize() (recently moved to end of SaveTo). In this diff, I chose the correct comparator at the beginning and sort the files correctly in Builder::SaveTo. I also added a verification that all files are sorted correctly in CheckConsistency() NOTE: This diff depends on D17037 Test Plan: make check. Will also run db_stress Reviewers: dhruba, haobo, sdong, ljin Reviewed By: ljin CC: leveldb Differential Revision: https://reviews.facebook.net/D17049	2014-03-26 13:30:14 -07:00
Igor Canadi	ad9a39c9b4	[RocksDB] Preallocate new MANIFEST files Summary: We don't preallocate MANIFEST file, even though we have an option for that. This diff preallocates manifest file every time we create it Test Plan: make check Reviewers: dhruba, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17163	2014-03-26 09:37:53 -07:00
sdong	6b2e7a2a01	When Options.max_num_files=-1, non level0 files also by pass table cache Summary: This is the part that was not finished when doing the Options.max_num_files=-1 feature. For iterating non level0 SST files (which was done using two level iterator), table cache is not bypassed. With this patch, the leftover feature is done. Test Plan: make all check; change Options.max_num_files=-1 in one of the tests to cover the codes. Reviewers: haobo, igor, dhruba, ljin, yhchiang Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17001	2014-03-25 18:40:52 -07:00
Igor Canadi	e8168382c4	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc include/rocksdb/options.h util/options.cc	2014-03-25 11:09:40 -07:00
Yueh-Hsuan Chiang	cda4006e87	Enhance partial merge to support multiple arguments Summary: * PartialMerge api now takes a list of operands instead of two operands. * Add min_pertial_merge_operands to Options, indicating the minimum number of operands to trigger partial merge. * This diff is based on Schalk's previous diff (D14601), but it also includes necessary changes such as updating the pure C api for partial merge. Test Plan: * make check all * develop tests for cases where partial merge takes more than two operands. TODOs (from Schalk): * Add test with min_partial_merge_operands > 2. * Perform benchmarks to measure the performance improvements (can probably use results of task #2837810.) * Add description of problem to doc/index.html. * Change wiki pages to reflect the interface changes. Reviewers: haobo, igor, vamsi Reviewed By: haobo CC: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D16815	2014-03-24 17:57:13 -07:00
Igor Canadi	e20fa3f8a4	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/internal_stats.cc db/internal_stats.h db/version_set.cc	2014-03-19 17:22:20 -07:00
Igor Canadi	fcd5c5e828	ComputeCompactionScore in CompactionPicker Summary: As it turns out, we need the call to ComputeCompactionScore (previously: Finalize) in CompactionPicker. The issue caused a deadlock in db_stress: http://ci-builds.fb.com/job/rocksdb_crashtest/290/console The last two lines before a deadlock were: 2014/03/18-22:43:41.481029 7facafbee700 (Original Log Time 2014/03/18-22:43:41.480989) Compaction nothing to do 2014/03/18-22:43:41.481041 7faccf7fc700 wait for fewer level0 files... "Compaction nothing to do" and other thread waiting for fewer level0 files. Hm hm. I moved the pre-sorting to SaveTo, which should fix both the original and the new issue. Test Plan: make check for now, will run db_stress in jenkins Reviewers: dhruba, haobo, sdong Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17037	2014-03-19 16:52:26 -07:00
Lei Jin	6dc940d4c9	avoid shared_ptr assignment in Version::Get() Summary: This is a 500ns operation while the whole Get() call takes only a few micro! Test Plan: ran db_bench, for a DB with 50M keys, QPS jumps from 5.2M/s to 7.2M/s Reviewers: haobo, igor, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17007	2014-03-19 10:54:32 -07:00
Igor Canadi	69aa6ecb26	Finalize fist version in column family	2014-03-18 14:23:47 -07:00
Igor Canadi	e25819a185	Merge branch 'master' into columnfamilies Conflicts: db/version_set.cc	2014-03-18 14:00:20 -07:00
Igor Canadi	758fa8c359	Don't Finalize in CompactionPicker Summary: Finalize re-sorts (read: mutates) the files_ in Version* and it is called by CompactionPicker during normal runtime. At the same time, this same Version* lives in the SuperVersion* and is accessed without the mutex in GetImpl() code path. Mutating the files_ in one thread and reading the same files_ in another thread is a bad idea. It caused this issue: http://ci-builds.fb.com/job/rocksdb_crashtest/285/console Long-term, we need to be more careful with method contracts and clearly document what state can be mutated when. Now that we are much faster because we don't lock in GetImpl(), we keep running into data races that were not a problem before when we were slower. db_stress has been very helpful in detecting those. Short-term, I removed Finalize() from CompactionPicker. Note: I believe this is an issue in current 2.7 version running in production. Test Plan: make check Will also run db_stress to see if issue is gone Reviewers: sdong, ljin, dhruba, haobo Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D16983	2014-03-18 13:59:59 -07:00
Igor Canadi	3055a15b29	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/version_edit.cc db/version_edit.h db/version_set.cc	2014-03-18 13:24:27 -07:00
Lei Jin	63cef90078	disable the log_number check in Recover() Summary: There is a chance that an old MANIFEST is corrupted in 2.7 but just not noticed. This check would fail them. Change it to log instead of returning a Corruption status. Test Plan: make Reviewers: haobo, igor Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D16923	2014-03-18 12:46:29 -07:00
Igor Canadi	bcea9c1296	Finalize version in dumpmanifest	2014-03-18 09:45:52 -07:00
Igor Canadi	f26cb0f093	Optimize fallocation Summary: Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads. This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x. At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported. Test Plan: make check Reviewers: dhruba, sdong, haobo, ljin Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D16761	2014-03-17 21:52:14 -07:00
Igor Canadi	ae25742af9	Fix race condition in manifest roll Summary: When the manifest is getting rolled the following happens: 1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current) 2) mutex is unlocked 3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp 4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT 5) mutex is locked If FindObsoleteFiles happens between (3) and (4) it will: 1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_) 2) Delete old manifest (because the manifest_file_number_ already points to a new one) I introduce the concept of prev_manifest_file_number_ that will avoid the race condition. However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it? Test Plan: make check Reviewers: ljin, dhruba, haobo, sdong Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16929	2014-03-17 21:50:15 -07:00
Lei Jin	0cf6c8f7ce	fix: use the correct edit when comparing log_number Summary: In the last fix, I forgot to point to the writer when comparing edit, which is apparently not correct. Test Plan: still running make whitebox_crash_test Reviewers: igor, haobo, igor2 Reviewed By: igor2 CC: leveldb Differential Revision: https://reviews.facebook.net/D16911	2014-03-15 23:30:43 -07:00
Lei Jin	453ec52ca1	journal log_number correctly in MANIFEST Summary: Here is what it can cause probelm: There is one memtable flush and one compaction. Both call LogAndApply(). If both edits are applied in the same batch with flush edit first and the compaction edit followed. LogAndApplyHelper() will assign compaction edit current VersionSet's log number(which should be smaller than the log number from flush edit). It cause log_numbers in MANIFEST to be not monotonic increasing, which violates the assume Recover() makes. What is more is after comitting to MANIFEST file, log_number_ in VersionSet is updated to the log_number from the last edit, which is the compaction one. It ends up not updating the log_number. Test Plan: make whitebox_crash_test got another assertion about iter->valid(), not sure if that is related to this. Reviewers: igor, haobo Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D16875	2014-03-14 18:36:47 -07:00
Igor Canadi	a782bb989e	Fix log_number in LogAndApply	2014-03-14 13:45:30 -07:00
Igor Canadi	f0e1e3ebf1	CF cleanup part 2	2014-03-13 14:34:01 -07:00
Igor Canadi	b5d6ad69fc	Bug fixes introduced by code cleanup	2014-03-12 11:10:26 -07:00
Igor Canadi	fb2346fc1f	[CF] Code cleanup part 1 Summary: I'm cleaning up some code preparing for the big diff review tomorrow. This is the first part of the cleanup. Changes are mostly cosmetic. The goal is to decrease amount of code difference between columnfamilies and master branch. This diff also fixes race condition when dropping column family. Test Plan: Ran db_stress with variety of parameters Reviewers: dhruba, haobo Differential Revision: https://reviews.facebook.net/D16833	2014-03-12 09:56:53 -07:00
Igor Canadi	dad8603fc4	[CF] Fix column family dropping Summary: Column family should be dropped after the change has been commited Test Plan: db stress Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16779	2014-03-11 09:47:29 -07:00
Igor Canadi	9634ba42ac	Merge branch 'master' into columnfamilies Conflicts: db/compaction_picker.cc db/db_impl.cc db/db_impl.h db/tailing_iter.cc db/version_set.h include/rocksdb/options.h util/options.cc	2014-03-10 17:26:09 -07:00
Igor Canadi	04d2c26e17	Add option verify_checksums_in_compaction Summary: If verify_checksums_in_compaction is true, compaction will verify checksums. This is default. If it's false, compaction doesn't verify checksums. This is useful for in-memory workloads. Test Plan: corruption_test Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D16695	2014-03-10 10:06:34 -07:00
sdong	ecb1ffa2a8	Buffer info logs when picking compactions and write them out after releasing the mutex Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex. Test Plan: make all check check the log lines while running some tests that trigger compactions. Reviewers: haobo, igor, dhruba Reviewed By: dhruba CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg- Differential Revision: https://reviews.facebook.net/D16515	2014-03-05 15:36:32 -08:00
Igor Canadi	9625acbf70	[CF] Dont reuse dropped column family IDs Summary: Column family IDs should be unique, even if column family is dropped. To achieve this, we save max column family in manifest. Note that the diff is still not ready. I'm only using differential to move the patch to my Mac machine. Test Plan: added a test to column_family_test Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16581	2014-03-05 12:13:44 -08:00
Igor Canadi	9d0577a6be	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/transaction_log_impl.cc db/transaction_log_impl.h include/rocksdb/options.h util/env.cc util/options.cc	2014-03-03 18:29:03 -08:00
Igor Canadi	5142b37000	Fix a group commit bug in LogAndApply Summary: EncodeTo(&record) does not overwrite, it appends to it. This means that group commit log and apply will look something like: record1 record1record2 record1record2record3 I'm surprised this didn't show up in production, but I think the reason is that MANIFEST group commit almost never happens. This bug turned up in column family work, where opening a database failed with "adding a same column family twice". Test Plan: Tested the change in column family branch and observed that the problem is gone (with db_stress) Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D16461	2014-03-03 17:10:43 -08:00
kailiu	bf86af5174	Remove the terrible hack in for flush_block_policy_factory Summary: Previous code is too convoluted and I must be drunk for letting such code to be written without a second thought. Thanks to the discussion with @sdong, I added the `Options` when generating the flusher, thus avoiding the tricks. Just FYI: I resisted to add Options in flush_block_policy.h since I wanted to avoid cyclic dependencies: FlushBlockPolicy dpends on Options and Options also depends FlushBlockPolicy... While I appreciate my effort to prevent it, the old design turns out creating more troubles than it tried to avoid. Test Plan: ran ./table_test Reviewers: sdong Reviewed By: sdong CC: sdong, leveldb Differential Revision: https://reviews.facebook.net/D16503	2014-02-28 16:39:27 -08:00
Igor Canadi	8ea21a778b	[CF] Rething LogAndApply for column families Summary: I though I might get away with as little changes to LogAndApply() as possible. It turns out this is not the case. This diff introduces different behavior of LogAndApply() for three cases: 1. column family add 2. column family drop 3. no-column family manipulation (1) and (2) don't support group commit yet. There were a lot of problems with old version od LogAndApply, detected by db_stress. The biggest was non-atomicity of manifest writes and metadata changes (i.e. if column family add is in manifest, it also has to be in in-memory data structure). Test Plan: db_stress Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16491	2014-02-28 14:46:48 -08:00
Igor Canadi	58ca641d53	Make Log::Reader more robust Summary: This diff does two things: (1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8# (2) Turn off mmap writes for all writes to log and manifest files (2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing. Test Plan: Added unit tests from LevelDB Actually recovered a "corrupted" MANIFEST file. Reviewers: dhruba, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16119	2014-02-28 13:19:47 -08:00
Igor Canadi	12966ec1bb	Fix LogAndApply() group commit	2014-02-28 12:22:45 -08:00
Igor Canadi	f6a257b6a1	Set dropped column family before persisting in the manifest	2014-02-28 11:49:32 -08:00
Igor Canadi	670f3ba212	[CF] Small refactor of Recover() and DumpManifest()	2014-02-28 11:25:38 -08:00
Igor Canadi	099ad94306	Set log number for column family	2014-02-28 11:08:24 -08:00
Igor Canadi	510f84b686	[CF] CreateColumnFamily fix Summary: This fixes few bugs with CreateColumnFamily * We first have to LogAndApply and then call VersionSet::CreateColumnFamily. Otherwise, WriteSnapshot might be invoked, writing out column family add inside of LogAndApply, even though it's not really committed * Fix LogAndApplyHelper() to not apply log number to column_family_data, which is in case of column family add, just a dummy (default) column family * Create SuperVerion when creating column family Test Plan: column_family_test Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16443	2014-02-28 10:40:52 -08:00
Igor Canadi	492c9f71c6	[CF] Column family support for LDB tool Summary: Added list_column_family command and also updated dump_manifest Test Plan: no Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16419	2014-02-27 16:39:23 -08:00
Igor Canadi	6e7cae7711	[CF] More tests Summary: New unit tests for column families Test Plan: this is a test Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16359	2014-02-26 14:16:23 -08:00
Igor Canadi	9bce2b2a84	[CF] Fix lint errors in CF code Summary: Big CF diff uncovered some lint errors. This diff fixes some of them. Not much to see here Test Plan: make check Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16347	2014-02-26 10:10:00 -08:00
Igor Canadi	422bb09cb0	Fix table properties Summary: Adapt table properties to column family world Test Plan: make check Reviewers: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D16161	2014-02-14 17:13:10 -08:00
Igor Canadi	76c048183c	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_test.cc include/rocksdb/db.h	2014-02-14 16:46:03 -08:00
kailiu	63690625cd	Expose the table properties to application Summary: Provide a public API for users to access the table properties for each SSTable. Test Plan: Added a unit tests to test the function correctness under differnet conditions. Reviewers: haobo, dhruba, sdong Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16083	2014-02-13 16:28:21 -08:00
Igor Canadi	ccdb93e775	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/memtable_list.cc db/memtable_list.h db/version_set.cc db/version_set.h	2014-02-12 14:01:30 -08:00
Igor Canadi	b06840aa7d	[CF] Rethinking ColumnFamilyHandle and fix to dropping column families Summary: The change to the public behavior: * When opening a DB or creating new column family client gets a ColumnFamilyHandle. * As long as column family handle is alive, client can do whatever he wants with it, even drop it * Dropped column family can still be read from (using the column family handle) * Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB * As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls) Internally: * Ref-counting ColumnFamilyData * New thread-safety for ColumnFamilySet * Dropped column families are now completely dropped and their memory cleaned-up Test Plan: added some tests to column_family_test Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D16101	2014-02-12 13:47:09 -08:00
Lei Jin	5fbf2ef42d	preload table handle on Recover() when max_open_files == -1 Summary: This covers existing table files before DB open happens and avoids contention on table cache Test Plan: db_test Reviewers: haobo, sdong, igor, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16089	2014-02-12 10:43:27 -08:00
Igor Canadi	0143abdbb0	Merge branch 'master' into columnfamilies Conflicts: HISTORY.md db/db_impl.cc db/db_impl.h db/db_iter.cc db/db_test.cc db/dbformat.h db/memtable.cc db/memtable_list.cc db/memtable_list.h db/table_cache.cc db/table_cache.h db/version_edit.h db/version_set.cc db/version_set.h db/write_batch.cc db/write_batch_test.cc include/rocksdb/options.h util/options.cc	2014-02-06 15:58:20 -08:00
kailiu	84f8185fc0	Merge branch 'master' into performance Conflicts: HISTORY.md db/db_impl.cc db/memtable.cc	2014-02-05 21:21:00 -08:00
Igor Canadi	f276e0e59d	[CF] Options -> DBOptions Summary: Replaced most of occurrences of Options with more specific DBOptions. This brings us very close to supporting different configuration options for each column family. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15933	2014-02-05 14:56:09 -08:00
Igor Canadi	c24d8c4e90	[CF] Rethink table cache Summary: Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory. To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects. This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15915	2014-02-05 11:55:30 -08:00
Igor Canadi	7b9f134959	[CF] Move InternalStats to ColumnFamilyData Summary: InternalStats is a messy thing, keeping both DB data and column family data. However, it's better off living in ColumnFamilyData than in DBImpl. For now, at least. Test Plan: make check Reviewers: dhruba, kailiu, haobo, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15879	2014-02-04 17:59:50 -08:00
Igor Canadi	73f62255c1	[CF] Split SanitizeOptions into two Summary: There are three SanitizeOption-s now : one for DBOptions, one for ColumnFamilyOptions and one for Options (which just calls the other two) I have also reshuffled some options -- table_cache options and info_log should live in DBOptions, for example. Test Plan: make check doesn't complain Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15873	2014-02-04 17:26:51 -08:00
Igor Canadi	29bacb2eb6	VersionSet cleanup Summary: Removed icmp_ from VersionSet (since it's per-column-family, not per-DB-instance) Unfriended VersionSet and ColumnFamilyData (yay!) Removed VersionSet::NumberLevels() Cleaned up DBImpl Test Plan: make check Reviewers: dhruba, haobo, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15819	2014-02-03 13:10:47 -08:00
Siying Dong	d169b67680	[Performance Branch] PlainTable to encode rows with seqID 0, value type using 1 internal byte. Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases. Test Plan: make all check Reviewers: haobo, dhruba, kailiu Reviewed By: haobo CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D15489	2014-02-03 12:19:30 -08:00
kailiu	4f6cb17bdb	First phase API clean up Summary: Addressed all the issues in https://reviews.facebook.net/D15447. Now most table-related modules are hidden from user land. Test Plan: make check Reviewers: sdong, haobo, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D15525	2014-02-03 00:30:43 -08:00
Igor Canadi	27a8856c23	Compacting column families Summary: This diff enables non-default column families to get compacted both automatically and also by calling CompactRange() Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15813	2014-01-31 19:54:03 -08:00
Igor Canadi	5661ed8b80	Fix reduce_levels_test	2014-01-31 19:44:48 -08:00
Igor Canadi	f7489123e2	Move compaction picker and internal key comparator to ColumnFamilyData Summary: Compaction picker and internal key comparator are different for each column family (not global), so they should live in ColumnFamilyData Test Plan: make check Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15801	2014-01-31 16:06:55 -08:00
Igor Canadi	5693db2a02	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc	2014-01-31 14:45:15 -08:00
Igor Canadi	3615f534d1	Enable flushing memtables from arbitrary column families Summary: Removed default_cfd_ from all flush code paths. This means we can now flush memtables from arbitrary column families! Test Plan: Added a new unit test Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15789	2014-01-31 14:42:52 -08:00
Dhruba Borthakur	abd70ecc2b	The default settings enable checksum verification on every read. Summary: The default settings enable checksum verification on every read. Test Plan: make check Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15591	2014-01-30 19:14:03 -08:00
Igor Canadi	6973bb1722	MakeRoomForWrite() support for column families Summary: Making room for write will be the hardest part of the column family implementation. For now, I just iterate through all column families and run MakeRoomForWrite() for every one. Test Plan: make check does not complain Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15597	2014-01-30 16:12:08 -08:00
Igor Canadi	ac92420fc5	Merge branch 'master' into performance Conflicts: db/db_impl.h	2014-01-30 10:09:23 -08:00
Igor Canadi	514e42c7cc	Fix some lint warnings	2014-01-29 15:27:27 -08:00
Igor Canadi	fa99d53e55	Change ColumnFamilyData from struct to class Summary: ColumnFamilyData grew a lot, there's much more data that it holds now. It makes more sense to encapsulate it better by making it a class. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15579	2014-01-29 15:18:36 -08:00
Igor Canadi	f24a3ee52d	Read from and write to different column families Summary: This one is big. It adds ability to write to and read from different column families (see the unit test). It also supports recovery of different column families from log, which was the hardest part to reason about. We need to make sure to never delete the log file which has unflushed data from any column family. To support that, I added another concept, which is versions_->MinLogNumber() Test Plan: Added a unit test in column_family_test Reviewers: dhruba, haobo, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15537	2014-01-29 11:38:16 -08:00
Igor Canadi	c1071ed95c	Merge branch 'master' into columnfamilies	2014-01-28 16:04:00 -08:00
Igor Canadi	5d2c62822e	Only get the manifest file size if there is no error Summary: I came across this while working on column families. CorruptionTest::RecoverWriteError threw a SIGSEG because the descriptor_log_->file() was nullptr. I'm not sure why it doesn't happen in master, but better safe than sorry. @kailiu, can we get this in release, too? Test Plan: make check Reviewers: kailiu, dhruba, haobo Reviewed By: haobo CC: leveldb, kailiu Differential Revision: https://reviews.facebook.net/D15513	2014-01-28 16:02:51 -08:00
kailiu	a5e220f5ef	Merge branch 'master' into performance Conflicts: Makefile db/db_impl.cc db/db_test.cc db/memtable_list.cc db/memtable_list.h table/block_based_table_reader.cc table/table_test.cc util/cache.cc util/coding.cc	2014-01-28 10:35:55 -08:00
Igor Canadi	4bf25357ae	[column families] Removing VersionSet::current() Summary: Instead of VersionSet::current(), DBImpl uses default_cfd_->current directly. Test Plan: make check Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15483	2014-01-27 17:04:46 -08:00
Igor Canadi	511b03a5b5	LogAndApply to take ColumnFamilyData Summary: This removes the default implementation of LogAndApply that applied the changed to the default column family by default. It is mostly simple reformatting. Test Plan: make check Reviewers: dhruba, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15465	2014-01-27 13:57:58 -08:00
Igor Canadi	eb055609e4	[column families] Move memtable and immutable memtable list to column family data Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family. Test Plan: make check Reviewers: dhruba, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15459	2014-01-27 13:37:14 -08:00
Igor Canadi	ae16606f98	Merge branch 'master' into columnfamilies Conflicts: db/version_set.cc db/version_set.h	2014-01-27 11:11:51 -08:00
Igor Canadi	832158e7f7	Fsync directory after we create a new file Summary: @dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder. Should I also add FsyncDir() to new files that get created by a compaction? Test Plan: Confirmed that FsyncDir is returning Status::OK() Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D14751	2014-01-27 11:02:21 -08:00
Igor Canadi	cf783c674a	Merge branch 'master' into columnfamilies Conflicts: db/version_set.h	2014-01-27 10:00:24 -08:00
Igor Canadi	6c2ca1d3e6	Move NeedsCompaction() from VersionSet to Version Summary: There is no reason to have functions NeedCompaction(), MaxCompactionScore() and MaxCompactionScoreLevel() in VersionSet, since they don't access any data in VersionSet. Test Plan: make check Reviewers: kailiu, haobo, sdong Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15333	2014-01-27 09:59:00 -08:00
Igor Canadi	5356b2a680	Merge branch 'master' into columnfamilies	2014-01-24 18:34:48 -08:00
Igor Canadi	04afa32134	Fix reduce levels ReduceNumberOfLevels had segmentation fault in WriteSnapshot() since we didn't change the number of levels in VersionSet (we consider them immutable from now on). This fixes the problem.	2014-01-24 18:32:38 -08:00
Igor Canadi	1423e7c9de	Merge branch 'master' into columnfamilies Conflicts: db/version_set.cc db/version_set_reduce_num_levels.cc util/ldb_cmd.cc	2014-01-24 15:03:54 -08:00
Igor Canadi	677fee27c6	Make VersionSet::ReduceNumberOfLevels() static Summary: A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method. Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file. Test Plan: reduce_levels_test Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15303	2014-01-24 14:57:04 -08:00
Kai Liu	054c5dda8c	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/db_test.cc db/memtable.cc db/version_set.cc include/rocksdb/statistics.h util/statistics_imp.h	2014-01-23 16:32:49 -08:00
Igor Canadi	7c5e583a27	ColumnFamilySet Summary: I created a separate class ColumnFamilySet to keep track of column families. Before we did this in VersionSet and I believe this approach is cleaner. Let me know if you have any comments. I will commit tomorrow. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15357	2014-01-23 14:03:38 -08:00
Igor Canadi	92a022ad07	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/db_impl_readonly.cc db/version_set.cc	2014-01-22 10:59:07 -08:00
Igor Canadi	fb01755aa4	Unfriending classes Summary: In this diff I made some effort to reduce usage of friending. To do that, I had to expose Compaction::inputs_ through a method inputs(). Not sure if this is a good idea, there is a trade-off. I think it's less confusing than having lots of friends. I also thought about other friendship relationships, but they are too much tangled at this point. Once you friend two classes, it's very hard to unfriend them :) Test Plan: make check Reviewers: haobo, kailiu, sdong, dhruba Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15267	2014-01-22 10:55:16 -08:00
Igor Canadi	23f6791c9e	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl_readonly.cc db/db_test.cc db/version_edit.cc db/version_edit.h db/version_set.cc db/version_set.h db/version_set_reduce_num_levels.cc	2014-01-21 17:01:52 -08:00
Kai Liu	d4f65f1683	Merge branch 'master' into performance This patch merges master's changes on build_tools/format-diff.sh. Conflicts: db/version_edit.cc	2014-01-16 14:31:18 -08:00
Igor Canadi	6d6fb70960	Remove compaction pointers Summary: The only thing we do with compaction pointers is set them to some values, we never actually read them. I don't know what we used them for, but it doesn't look like we use them anymore. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15225	2014-01-16 14:06:53 -08:00
Igor Canadi	c699c84af4	CompactionPicker Summary: This is a big one. This diff moves all the code related to picking compactions from VersionSet to new class CompactionPicker. Column families' compactions will be completely separate processes, so we need to have multiple CompactionPickers. To make this easier to review, most of the code change is just copy/paste. There is also a small change not to use VersionSet::current_, but rather to take `Version* version` as a parameter. Most of the other code is exactly the same. In future diffs, I will also make some improvements to CompactionPickers. I think the most important part will be encapsulating it better. Currently Version, VersionSet, Compaction and CompactionPicker are all friend classes, which makes it harder to change the implementation. This diff depends on D15171, D15183, D15189 and D15201 Test Plan: `make check` Reviewers: kailiu, sdong, dhruba, haobo Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15207	2014-01-16 13:03:52 -08:00
kailiu	1304d8c8ce	Merge branch 'master' into performance Conflicts: Makefile db/db_impl.cc db/db_impl.h db/db_test.cc db/memtable.cc db/memtable.h db/version_edit.h db/version_set.cc include/rocksdb/options.h util/hash_skiplist_rep.cc util/options.cc	2014-01-15 23:12:31 -08:00
kailiu	eae1804f29	Remove the unnecessary use of shared_ptr Summary: shared_ptr is slower than unique_ptr (which literally comes with no performance cost compare with raw pointers). In memtable and memtable rep, we use shared_ptr when we'd actually should use unique_ptr. According to igor's previous work, we are likely to make quite some performance gain from this diff. Test Plan: make check Reviewers: dhruba, igor, sdong, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15213	2014-01-15 18:22:01 -08:00
Igor Canadi	787f11bb3b	Move more functions from VersionSet to Version Summary: This moves functions: * VersionSet::Finalize() -> Version::UpdateCompactionStats() * VersionSet::UpdateFilesBySize() -> Version::UpdateFilesBySize() The diff depends on D15189, D15183 and D15171 Test Plan: make check Reviewers: kailiu, sdong, haobo, dhruba Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15201	2014-01-15 16:23:36 -08:00
Igor Canadi	615d1ea2f4	Moving Compaction class to separate header file Summary: I'm sure we'll all agree that version_set.cc needs simplifying. This diff moves Compaction class to a separate file. The diff depends on D15171 and D15183 Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15189	2014-01-15 16:22:34 -08:00
Igor Canadi	2f4eda7890	Move functions from VersionSet to Version Summary: There were some functions in VersionSet that had no reason to be there instead of Version. Moving them to Version will make column families implementation easier. The functions moved are: * NumLevelBytes * LevelSummary * LevelFileSummary * MaxNextLevelOverlappingBytes * AddLiveFiles (previously AddLiveFilesCurrentVersion()) * NeedSlowdownForNumLevel0Files The diff continues on (and depends on) D15171 Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong, emayanke Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15183	2014-01-15 16:18:04 -08:00
Igor Canadi	65a8a52b54	Decrease reliance on VersionSet::NumberLevels() Summary: With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels() This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet. I have also slightly changed how VersionSet keeps track of manifest size. This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor. In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_ Test Plan: make check Reviewers: haobo, dhruba, kailiu, sdong Reviewed By: sdong Differential Revision: https://reviews.facebook.net/D15171	2014-01-15 16:15:43 -08:00
Igor Canadi	d9cd7a063f	Fix CompactRange to apply filter to every key Summary: When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`. This patch fixed the unit test. Test Plan: Added a failing unit test and a fix, so it's not failing anymore. Reviewers: dhruba, haobo, sdong Reviewed By: haobo CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D14421	2014-01-14 16:19:09 -08:00
Igor Canadi	055e6df45b	VersionEdit not to take NumLevels() Summary: I will submit a sequence of diffs that are preparing master branch for column families. There are a lot of implicit assumptions in the code that are making column family implementation hard. If I make the change only in column family branch, it will make merging back to master impossible. Most of the diffs will be simple code refactorings, so I hope we can have fast turnaround time. Feel free to grab me in person to discuss any of them. This diff removes number of level check from VersionEdit. It is used only when VersionEdit is read, not written, but has to be set when it is written. I believe it is a right thing to make VersionEdit dumb and check consistency on the caller side. This will also make it much easier to implement Column Families, since different column families can have different number of levels. Test Plan: make check Reviewers: dhruba, haobo, sdong, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15159	2014-01-14 15:27:09 -08:00
Siying Dong	fbbf0d1456	Pre-calculate whether to slow down for too many level 0 files Summary: Currently in DBImpl::MakeRoomForWrite(), we do "versions_->NumLevelFiles(0) >= options_.level0_slowdown_writes_trigger" to check whether the writer thread needs to slow down. However, versions_->NumLevelFiles(0) is slightly more expensive than we expected. By caching the result of the comparison when installing a new version, we can avoid this function call every time. Test Plan: make all check Manually trigger this behavior by applying universal compaction style and make sure inserts are made slow after there are certain number of files. Reviewers: haobo, kailiu, igor Reviewed By: kailiu CC: nkg-, leveldb Differential Revision: https://reviews.facebook.net/D15141	2014-01-14 11:23:02 -08:00
Igor Canadi	a107691711	[column families] Implement refcounting ColumnFamilyData Summary: We don't want to delete ColumnFamilyData object if somebody has references to it. Test Plan: `make check` for now, but will need to implement bigger column family test case Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15111	2014-01-13 09:22:44 -08:00
Igor Canadi	d076cef347	[column families] Get rid of VersionSet::current_ and keep current Version for each column family Summary: The biggest change here is getting rid of current_ Version and adding a column_family_data->current Version to each column family. I have also fixed some smaller things in VersionSet that made it easier to implement Column family support. Test Plan: make check Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15105	2014-01-13 08:59:15 -08:00
Siying Dong	aa0ef6602d	[Performance Branch] If options.max_open_files set to be -1, cache table readers in FileMetadata for Get() and NewIterator() Summary: In some use cases, table readers for all live files should always be cached. In that case, there will be an opportunity to avoid the table cache look-up while Get() and NewIterator(). We define options.max_open_files = -1 to be the mode that table readers for live files will always be kept. In that mode, table readers are cached in FileMetaData (with a reference count hold in table cache). So that when executing table_cache.Get() and table_cache.newInterator(), LRU cache checking can be by-passed, to reduce latency. Test Plan: add a test case in db_test Reviewers: haobo, kailiu Reviewed By: haobo CC: dhruba, igor, leveldb Differential Revision: https://reviews.facebook.net/D15039	2014-01-10 15:57:49 -08:00
Igor Canadi	19e3ee64ac	Add column family information to WAL Summary: I have added three new value types: * kTypeColumnFamilyDeletion * kTypeColumnFamilyValue * kTypeColumnFamilyMerge which include column family Varint32 before the data (value, deletion and merge). These values are used only in WAL (not in memtables yet). This endeavour required changing some WriteBatch internals. Test Plan: Added a unittest Reviewers: dhruba, haobo, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15045	2014-01-08 12:53:33 -08:00
Igor Canadi	72918efffe	[column families] Implement DB::OpenWithColumnFamilies() Summary: In addition to implementing OpenWithColumnFamilies, this diff also includes some minor changes: * Changed all column family names from Slice() to std::string. The performance of column family name handling is not critical, and it's more convenient and cleaner to have names as std::strings * Implemented ColumnFamilyOptions(const Options&) and DBOptions(const Options&) * Added ColumnFamilyOptions to VersionSet::ColumnFamilyData. ColumnFamilyOptions are specified on OpenWithColumnFamilies() and CreateColumnFamily() I will keep the diff in the Phabricator for a day or two and will push to the branch then. Feel free to comment even after the diff has been pushed. Test Plan: Added a simple unit test Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15033	2014-01-07 11:05:50 -08:00
Igor Canadi	ef6ad1708d	[column families] Support to create and drop column families Summary: This diff provides basic implementations of CreateColumnFamily(), DropColumnFamily() and ListColumnFamilies(). It builds on top of https://reviews.facebook.net/D14733 It also includes a bug fix for DBImplReadOnly, where Get implementation would be redirected to DBImpl instead of DBImplReadOnly. Test Plan: Added unit test Reviewers: dhruba, haobo, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15021	2014-01-03 01:12:16 -08:00
kailiu	476416c27c	Some minor refactoring on the code Summary: I made some cleanup while reading the source code in `db`. Most changes are about style, naming or C++ 11 new features. Test Plan: ran `make check` Reviewers: haobo, dhruba, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15009	2014-01-02 16:32:31 -08:00
Igor Canadi	7535443083	[RocksDB] Support for column families in manifest Summary: <This diff is for Column Family branch> Added fields in manifest file to support adding and deleting column families. Pretty simple change, each version edit record can be: 1. add column family 2. drop column family 3. add and delete N files from a single column family (compactions and flushes will generate such records) Test Plan: make check works, the code is backward compatible Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14733	2014-01-02 04:18:28 -08:00
kailiu	f1cec73a76	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/db_test.cc db/memtable.cc db/version_set.cc include/rocksdb/statistics.h	2013-12-27 12:23:17 -08:00
Siying Dong	18df47b79a	Avoid malloc in NotFound key status if no message is given. Summary: In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case. The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary. Test Plan: make all check Reviewers: dhruba, haobo, igor Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14691	2013-12-26 16:23:10 -08:00
Siying Dong	14995a8ff3	Move level0 sorting logic from Version::SaveTo() to Version::Finalize() Summary: I realized that "D14409 Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()" is not done in an optimized place. SaveTo() is usually inside mutex. Move it to Finalize(), which is called out of mutex. Test Plan: make all check Reviewers: dhruba, haobo, kailiu Reviewed By: dhruba CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D14607	2013-12-17 18:06:58 -08:00
Siying Dong	aaf9c6203c	[RocksDB][Performance Branch]Iterator Cleanup method only tries to find obsolete files if it has the last reference to a version Summary: When deconstructing an iterator, no need to check obsolete file if it doesn't hold last reference of any version. Test Plan: make all check Reviewers: haobo, igor, dhruba, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14595	2013-12-11 13:59:43 -08:00
Siying Dong	a8029fdc75	Introduce MergeContext to Lazily Initialize merge operand list Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases Test Plan: make all check Reviewers: haobo, kailiu, dhruba Reviewed By: haobo CC: igor, nkg-, leveldb Differential Revision: https://reviews.facebook.net/D14415 Conflicts: db/db_impl.cc db/memtable.cc	2013-12-11 11:37:28 -08:00
Siying Dong	bc5dd19b14	[RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo() Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get(). Test Plan: make all check Reviewers: haobo, kailiu, dhruba Reviewed By: dhruba CC: nkg-, igor, leveldb Differential Revision: https://reviews.facebook.net/D14409	2013-12-11 10:50:09 -08:00
Siying Dong	41349d9ef1	[RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo() Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get(). Test Plan: make all check Reviewers: haobo, kailiu, dhruba Reviewed By: dhruba CC: nkg-, igor, leveldb Differential Revision: https://reviews.facebook.net/D14409	2013-12-11 10:49:49 -08:00
Siying Dong	ef2211a9ca	[RocksDB Performance Branch] Introduce MergeContext to Lazily Initialize merge operand list Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases Test Plan: make all check Reviewers: haobo, kailiu, dhruba Reviewed By: haobo CC: igor, nkg-, leveldb Differential Revision: https://reviews.facebook.net/D14415	2013-12-06 10:28:59 -08:00
Kai Liu	1966b63137	Merge branch 'master' into perf	2013-11-27 11:47:40 -08:00
Dhruba Borthakur	8478f380a0	During benchmarking, I see excessive use of vector.reserve(). Summary: This code path can potentially accumulate multiple important_files for level 0. But for other levels, it should have only one file in the important_files, so it is ok not to reserve excessive space, is it not? Test Plan: make check Reviewers: haobo Reviewed By: haobo CC: reconnect.grayhat, leveldb Differential Revision: https://reviews.facebook.net/D14349	2013-11-26 07:47:08 -08:00
Haobo Xu	5b825d6964	[RocksDB] Use raw pointer instead of shared pointer when passing Statistics object internally Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient. Test Plan: make check Reviewers: dhruba, MarkCallaghan, igor Reviewed By: dhruba CC: leveldb, reconnect.grayhat Differential Revision: https://reviews.facebook.net/D14289	2013-11-25 10:38:15 -08:00
Siying Dong	3e35aa6412	Revert "Allow users to profile a query and see bottleneck of the query" This reverts commit `3d8ac31d71`.	2013-11-21 17:40:39 -08:00
Siying Dong	b135d01e7b	Allow users to profile a query and see bottleneck of the query Summary: Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later. Test Plan: Enable this profiling in seveal existing tests. Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14001 Conflicts: table/merger.cc	2013-11-21 17:39:19 -08:00
Siying Dong	3d8ac31d71	Allow users to profile a query and see bottleneck of the query Summary: Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later. Test Plan: Enable this profiling in seveal existing tests. Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14001	2013-11-21 16:29:57 -08:00
Kai Liu	35460ccb53	Fix the string format issue Summary: mac and our dev server has totally differnt definition of uint64_t, therefore fixing the warning in mac has actually made code in linux uncompileable. Test Plan: make clean && make -j32	2013-11-12 21:05:39 -08:00
kailiu	21587760b9	Fixing the warning messages captured under mac os # Consider using `git commit -m 'One line title' && arc diff`. # You will save time by running lint and unit in the background. Summary: The work to make sure mac os compiles rocksdb is not completed yet. But at least we can start cleaning some warnings captured only by g++ from mac os.. Test Plan: ran make in mac os Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14049	2013-11-12 20:05:28 -08:00
Igor Canadi	9bc4a26f56	Small changes in Deleting obsolete files Summary: @haobo's suggestions from https://reviews.facebook.net/D13827 Renaming some variables, deprecating purge_log_after_flush, changing for loop into auto for loop. I have not implemented deleting objects outside of mutex yet because it would require a big code change - we would delete object in db_impl, which currently does not know anything about object because it's defined in version_edit.h (FileMetaData). We should do it at some point, though. Test Plan: Ran deletefile_test Reviewers: haobo Reviewed By: haobo CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D14025	2013-11-12 11:53:26 -08:00
Igor Canadi	1510339e52	Speed up FindObsoleteFiles Summary: Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file. It might speed things up a bit, but makes code uglier. I don't really like it. Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though. Let's discuss offline today. Test Plan: Ran ./db_stress, verified that files are getting deleted Reviewers: dhruba, haobo, kailiu, emayanke Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D13827	2013-11-08 15:23:46 -08:00
Igor Canadi	444cf88a56	Flush the log outside of lock Summary: Added a new call LogFlush() that flushes the log contents to the OS buffers. We never call it with lock held. We call it once for every Read/Write and often in compaction/flush process so the frequency should not be a problem. Test Plan: db_test Reviewers: dhruba, haobo, kailiu, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13935	2013-11-07 11:31:56 -08:00
Igor Canadi	beeb74be6f	Move I/O outside of lock Summary: I'm figuring out how Version[Set, Edit, ] classes work and I stumbled on this. It doesn't seem that the comment is accurate anymore. What I read is when the manifest grows too big, create a new file (and not only when we call LogAndApply for the first time). Test Plan: make check (currently running) Reviewers: dhruba, haobo, kailiu, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13839	2013-11-01 12:32:27 -07:00
Haobo Xu	8cbe5bb56b	[RocksDB] Add OnCompactionStart to CompactionFilter class Summary: This is to give application compaction filter a chance to access context information of a specific compaction run. For example, depending on whether a compaction goes through all data files, the application could do things differently. Test Plan: make check Reviewers: dhruba, kailiu, sdong Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13683	2013-10-31 13:36:43 -07:00
Siying Dong	f03b2df010	Follow-up Cleaning-up After D13521 Summary: This patch is to address @haobo's comments on D13521: 1. rename Table to be TableReader and make its factory function to be GetTableReader 2. move the compression type selection logic out of TableBuilder but to compaction logic 3. more accurate comments 4. Move stat name constants into BlockBasedTable implementation. 5. remove some uncleaned codes in simple_table_db_test Test Plan: pass test suites. Reviewers: haobo, dhruba, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13785	2013-10-30 10:52:33 -07:00
Siying Dong	d4eec30ed0	Make "Table" pluggable Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder. Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented. Reviewers: dhruba, haobo, kailiu, emayanke, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13521	2013-10-28 17:54:09 -07:00
Siying Dong	9edda37027	Universal Compaction to Have a Size Percentage Threshold To Decide Whether to Compress Summary: This patch adds a option for universal compaction to allow us to only compress output files if the files compacted previously did not yet reach a specified ratio, to save CPU costs in some cases. Compression is always skipped for flushing. This is because the size information is not easy to evaluate for flushing case. We can improve it later. Test Plan: add test DBTest.UniversalCompactionCompressRatio1 and DBTest.UniversalCompactionCompressRatio12 Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13467	2013-10-17 13:33:39 -07:00
Dhruba Borthakur	9cd221094c	Add appropriate LICENSE and Copyright message. Summary: Add appropriate LICENSE and Copyright message. Test Plan: make check Reviewers: CC: Task ID: # Blame Rev:	2013-10-16 17:48:41 -07:00
Siying Dong	073cbfc8f0	Enable background flush thread by default and fix issues related to it Summary: Enable background flush thread in this patch and fix unit tests with: (1) After background flush, schedule a background compaction if condition satisfied; (2) Fix a bug that if universal compaction is enabled and number of levels are set to be 0, compaction will not be automatically triggered (3) Fix unit tests to wait for compaction to finish instead of flush, before checking the compaction results. Test Plan: pass all unit tests Reviewers: haobo, xjin, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13461	2013-10-16 13:32:53 -07:00
Mayank Agarwal	da2fd001a6	Fix rocksdb->levledb BytewiseComparator and inverted order of error in db/version_set.cc Summary: This is needed to make existing dbs be able to open and also because BytewiseComparator was not changed since leveldb. The inverted order in the error message caused confusion prebiously Test Plan: make; open existing db Reviewers: leveldb, dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D13449	2013-10-14 18:16:54 -07:00
Dhruba Borthakur	a143ef9b38	Change namespace from leveldb to rocksdb Summary: Change namespace from leveldb to rocksdb. This allows a single application to link in open-source leveldb code as well as rocksdb code into the same process. Test Plan: compile rocksdb Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13287	2013-10-04 11:59:26 -07:00
Haobo Xu	1d8c57db23	[RocksDB] Universal compaction trigger condition minor fix Summary: Currently, when total number of files reaches level0_file_num_compaction_trigger, universal compaction will schedule a compaction job, but the job will not honor the compaction until the total number of files is level0_file_num_compaction_trigger+1. Fixed the condition for consistent behavior (start compaction on reaching level0_file_num_compaction_trigger). Test Plan: make check; db_stress Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12945	2013-09-15 22:35:59 -07:00
Dhruba Borthakur	4012ca1c7b	Added a parameter to limit the maximum space amplification for universal compaction. Summary: Added a new field called max_size_amplification_ratio in the CompactionOptionsUniversal structure. This determines the maximum percentage overhead of space amplification. The size amplification is defined to be the ratio between the size of the oldest file to the sum of the sizes of all other files. If the size amplification exceeds the specified value, then min_merge_width and max_merge_width are ignored and a full compaction of all files is done. A value of 10 means that the size a database that stores 100 bytes of user data could occupy 110 bytes of physical storage. Test Plan: Unit test DBTest.UniversalCompactionSpaceAmplification added. Reviewers: haobo, emayanke, xjin Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D12825	2013-09-13 16:27:18 -07:00
Mayank Agarwal	ab5c5c28fe	Fix build caused by DeleteFile not tolerating / at the beginning Summary: db->DeleteFile calls ParseFileName to check name that was returned for sst file. Now, sst filename is returned using TableFileName which uses MakeFileName. This puts a / at the front of the name and ParseFileName doesn't like that. Changed ParseFileName to tolerate /s at the beginning. The test delet_file_test used to pass earlier because this behaviour of MakeFileName had been changed a while back to not return a / during which delete_file_test was checked in. But MakeFileName had to be reverted to add / at the front because GetLiveFiles used at many places outside rocksdb used the previous behaviour of MakeFileName. Test Plan: make;./delete_filetest;make all check Reviewers: dhruba, haobo, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12663	2013-09-01 17:59:13 -07:00
Dhruba Borthakur	59de2dbad7	Cleanup DeleteFile API Summary: The DeleteFile API was removing files inside the db-lock. This is now changed to remove files outside the db-lock. The GetLiveFilesMetadata() returns the smallest and largest seqnuence number of each file as well. Test Plan: deletefile_test Reviewers: emayanke, haobo Reviewed By: haobo CC: leveldb Maniphest Tasks: T63 Differential Revision: https://reviews.facebook.net/D12567	2013-08-28 21:18:58 -07:00
Dhruba Borthakur	fc0c399d2e	Introduced a new flag non_blocking_io in ReadOptions. Summary: If ReadOptions.non_blocking_io is set to true, then KeyMayExists and Iterators will return data that is cached in RAM. If the Iterator needs to do IO from storage to serve the data, then the Iterator.status() will return Status::IsRetry(). Test Plan: Enhanced unit test DBTest.KeyMayExist to detect if there were are IOs issues from storage. Added DBTest.NonBlockingIteration to verify nonblocking Iterations. Reviewers: emayanke, haobo Reviewed By: haobo CC: leveldb Maniphest Tasks: T63 Differential Revision: https://reviews.facebook.net/D12531	2013-08-28 10:49:14 -07:00
Mayank Agarwal	b1074ac24f	Use initializer list for VersionSet Summary: initialiszer list is fasteri/preferable because it can straightaway call the constructor for this object, otherwise it will be created first and then again initialized. Although gain may not be much in this case because files_ is just a pointer and not a complex object, this is recommended practice. Test Plan: make all check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12519	2013-08-24 18:16:01 -07:00
Tyler Harter	4504c99030	Internal/user key bug fix. Summary: Fix code so that the filter_block layer only assumes keys are internal when prefix_extractor is set. Test Plan: ./filter_block_test Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12501	2013-08-23 14:49:57 -07:00
Dhruba Borthakur	1186192ed1	Replace include/leveldb with include/rocksdb. Summary: Replace include/leveldb with include/rocksdb. Test Plan: make clean; make check make clean; make release Differential Revision: https://reviews.facebook.net/D12489	2013-08-23 10:51:00 -07:00
Tyler Harter	94cf218720	Revert "Prefix scan: db_bench and bug fixes" This reverts commit `c2bd8f4824`.	2013-08-22 18:01:11 -07:00
Tyler Harter	c2bd8f4824	Prefix scan: db_bench and bug fixes Summary: If use_prefix_filters is set and read_range>1, then the random seeks will set a the prefix filter to be the prefix of the key which was randomly selected as the target. Still need to add statistics (perhaps in a separate diff). Test Plan: ./db_bench --benchmarks=fillseq,prefixscanrandom --num=10000000 --statistics=1 --use_prefix_blooms=1 --use_prefix_api=1 --bloom_bits=10 Reviewers: dhruba Reviewed By: dhruba CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D12273	2013-08-22 16:06:50 -07:00
Simha Venkataramaiah	60bf2b7d4a	Add APIs to query SST file metadata and to delete specific SST files Summary: An api to query the level, key ranges, size etc for each SST file and an api to delete a specific file from the db and all associated state in the bookkeeping datastructures. Notes: Editing the manifest version does not release the obsolete files right away. However deleting the file directly will mess up the iterator. We may need a more aggressive/timely file deletion api. I have used std::unique_ptr - will switch to boost:: since this is external. thoughts? Unit test is fragile right now as it expects the compaction at certain levels. Test Plan: unittest Reviewers: dhruba, vamsi, emayanke CC: zshao, leveldb, haobo Task ID: # Blame Rev:	2013-08-22 15:27:19 -07:00
Haobo Xu	f9e2decf7c	[RocksDB] Minor iterator cleanup Summary: Was going through the iterator related code, did some cleanup along the way. Basically replaced array with vector and adopted range based loop where applicable. Test Plan: make check; make valgrind_check Reviewers: dhruba, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D12435	2013-08-21 16:54:48 -07:00
Deon Nicholas	b87dcae1a3	Made merge_oprator a shared_ptr; and added TTL unit tests Test Plan: - make all check; - make release; - make stringappend_test; ./stringappend_test Reviewers: haobo, emayanke Reviewed By: haobo CC: leveldb, kailiu Differential Revision: https://reviews.facebook.net/D12381	2013-08-20 13:35:28 -07:00
Deon Nicholas	e1346968d8	Merge operator fixes part 1. Summary: -Added null checks and revisions to DBIter::MergeValuesNewToOld() -Added DBIter test to stringappend_test -Major fix with Merge and TTL More plans for fixes later. Test Plan: -make clean; make stringappend_test -j 32; ./stringappend_test -make all check; Reviewers: haobo, emayanke, vamsi, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D12315	2013-08-19 11:42:47 -07:00
Jim Paton	0307c5fe3a	Implement log blobs Summary: This patch adds the ability for the user to add sequences of arbitrary data (blobs) to write batches. These blobs are saved to the log along with everything else in the write batch. You can add multiple blobs per WriteBatch and the ordering of blobs, puts, merges, and deletes are preserved. Blobs are not saves to SST files. RocksDB ignores blobs in every way except for writing them to the log. Before committing this patch, I need to add some test code. But I'm submitting it now so people can comment on the API. Test Plan: make -j32 check Reviewers: dhruba, haobo, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12195	2013-08-14 16:32:46 -07:00
Mayank Agarwal	f1bf169484	Counter for merge failure Summary: With Merge returning bool, it can keep failing silently(eg. While faling to fetch timestamp in TTL). We need to detect this through a rocksdb counter which can get bumped whenever Merge returns false. This will also be super-useful for the mcrocksdb-counter service where Merge may fail. Added a counter NUMBER_MERGE_FAILURES and appropriately updated db/merge_helper.cc I felt that it would be better to directly add counter-bumping in Merge as a default function of MergeOperator class but user should not be aware of this, so this approach seems better to me. Test Plan: make all check Reviewers: dnicholas, haobo, dhruba, vamsi CC: leveldb Differential Revision: https://reviews.facebook.net/D12129	2013-08-13 14:25:42 -07:00
Dhruba Borthakur	03bd4461ad	Merge branch 'performance' of github.com:facebook/rocksdb into performance	2013-08-12 10:31:21 -07:00
Dhruba Borthakur	f3967a5132	Merge remote-tracking branch 'origin' into performance	2013-08-12 09:58:50 -07:00
Dhruba Borthakur	93d77a27d2	Universal Compaction should keep DeleteMarkers unless it is the earliest file. Summary: The pre-existing code was purging a DeleteMarker if thay key did not exist in deeper levels. But in the Universal Compaction Style, all files are in Level0. For compaction runs that did not include the earliest file, we were erroneously purging the DeleteMarkers. The fix is to purge DeleteMarkers only if the compaction includes the earlist file. Test Plan: DBTest.Randomized triggers this code path. Differential Revision: https://reviews.facebook.net/D12081	2013-08-09 14:03:57 -07:00
Haobo Xu	3a3b1c3e6c	[RocksDB] Improve manifest dump to print internal keys in hex for version edits. Summary: Currently, VersionEdit::DebugString always display internal keys in the original ascii format. This could cause manifest dump to be truncated if internal keys contain special charactors (like null). Also added an option --input_key_hex for ldb idump to indicate that the passed in user keys are in hex. Test Plan: run ldb manifest_dump Reviewers: dhruba, emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D12111	2013-08-08 16:19:01 -07:00
Xing Jin	17b8f786a3	Fix unit tests/bugs for universal compaction (first step) Summary: This is the first step to fix unit tests and bugs for universal compactiion. I added universal compaction option to ChangeOptions(), and fixed all unit tests calling ChangeOptions(). Some of these tests obviously assume more than 1 level and check file number/values in level 1 or above levels. I set kSkipUniversalCompaction for these tests. The major bug I found is manual compaction with universal compaction never stops. I have put a fix for it. I have also set universal compaction as the default compaction and found at least 20+ unit tests failing. I haven't looked into the details. The next step is to check all unit tests without calling ChangeOptions(). Test Plan: make all check Reviewers: dhruba, haobo Differential Revision: https://reviews.facebook.net/D12051	2013-08-07 14:05:44 -07:00
Dhruba Borthakur	f5fa26b6a9	Merge branch 'performance' of github.com:facebook/rocksdb into performance Conflicts: db/builder.cc db/db_impl.cc db/version_set.cc include/leveldb/statistics.h	2013-08-07 11:58:06 -07:00
Deon Nicholas	c2d7826ced	[RocksDB] [MergeOperator] The new Merge Interface! Uses merge sequences. Summary: Here are the major changes to the Merge Interface. It has been expanded to handle cases where the MergeOperator is not associative. It does so by stacking up merge operations while scanning through the key history (i.e.: during Get() or Compaction), until a valid Put/Delete/end-of-history is encountered; it then applies all of the merge operations in the correct sequence starting with the base/sentinel value. I have also introduced an "AssociativeMerge" function which allows the user to take advantage of associative merge operations (such as in the case of counters). The implementation will always attempt to merge the operations/operands themselves together when they are encountered, and will resort to the "stacking" method if and only if the "associative-merge" fails. This implementation is conjectured to allow MergeOperator to handle the general case, while still providing the user with the ability to take advantage of certain efficiencies in their own merge-operator / data-structure. NOTE: This is a preliminary diff. This must still go through a lot of review, revision, and testing. Feedback welcome! Test Plan: -This is a preliminary diff. I have only just begun testing/debugging it. -I will be testing this with the existing MergeOperator use-cases and unit-tests (counters, string-append, and redis-lists) -I will be "desk-checking" and walking through the code with the help gdb. -I will find a way of stress-testing the new interface / implementation using db_bench, db_test, merge_test, and/or db_stress. -I will ensure that my tests cover all cases: Get-Memtable, Get-Immutable-Memtable, Get-from-Disk, Iterator-Range-Scan, Flush-Memtable-to-L0, Compaction-L0-L1, Compaction-Ln-L(n+1), Put/Delete found, Put/Delete not-found, end-of-history, end-of-file, etc. -A lot of feedback from the reviewers. Reviewers: haobo, dhruba, zshao, emayanke Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11499	2013-08-05 20:14:32 -07:00
Dhruba Borthakur	711a30cb30	Merge branch 'master' into performance Conflicts: include/leveldb/options.h include/leveldb/statistics.h util/options.cc	2013-08-02 10:22:08 -07:00
Mayank Agarwal	59d0b02f8b	Expand KeyMayExist to return the proper value if it can be found in memory and also check block_cache Summary: Removed KeyMayExistImpl because KeyMayExist demanded Get like semantics now. Removed no_io from memtable and imm because we need the proper value now and shouldn't just stop when we see Merge in memtable. Added checks to block_cache. Updated documentation and unit-test Test Plan: make all check;db_stress for 1 hour Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11853	2013-08-01 09:07:46 -07:00
Dhruba Borthakur	a91fdf1b99	The target file size for L0 files was incorrectly set to LLONG_MAX. Summary: The target file size should be valid value. Only if UniversalCompactionStyle is enabled then set max file size to be LLONG_MAX. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-07-24 14:28:00 -07:00
Mayank Agarwal	bf66c10b13	Use KeyMayExist for WriteBatch-Deletes Summary: Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete. Added code to skip getting Table from disk if not already present in table_cache. Some renaming of variables. Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch. Changed KeyMayExist to not be pure virtual and provided a default implementation. Expanded unit-tests in db_test to check appropriately. Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1. Test Plan: db_stress;make check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D11745	2013-07-23 13:36:50 -07:00
Dhruba Borthakur	4a745a5666	Merge branch 'master' into performance Conflicts: db/version_set.cc include/leveldb/options.h util/options.cc	2013-07-17 15:05:57 -07:00
Dhruba Borthakur	76a4923307	Make file-sizes and grandparentoverlap to be unsigned to avoid bad comparisions. Summary: The maxGrandParentOverlapBytes_ was signed which was causing an erroneous comparision between signed and unsigned longs. This, in turn, was causing compaction-created-output-files to be very small in size. Test Plan: make check Differential Revision: https://reviews.facebook.net/D11727	2013-07-17 14:26:06 -07:00
Mayank Agarwal	2a986919d6	Make rocksdb-deletes faster using bloom filter Summary: Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving: 1. Put of delete type 2. Space in the db,and 3. Compaction time Test Plan: make all check; will run db_stress and db_bench and enhance unit-test once the basic design gets approved Reviewers: dhruba, haobo, vamsi Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11607	2013-07-11 12:11:11 -07:00
Xing Jin	8a5341ec7d	Newbie code question Summary: This diff is more about my question when reading compaction codes, instead of a normal diff. I don't quite understand the logic here. Test Plan: I didn't do any test. If this is a bug, I will continue doing some test. Reviewers: haobo, dhruba, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11661	2013-07-11 09:03:40 -07:00
Dhruba Borthakur	289efe9922	Update statistics only if needed. Summary: Update statistics only if needed. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-07-09 16:17:00 -07:00
Dhruba Borthakur	7cb8d462d5	Rename PickCompactionHybrid to PickCompactionUniversal. Summary: Rename PickCompactionHybrid to PickCompactionUniversal. Changed a few LOG message from "Hybrid:" to "Universal:". Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-07-09 16:08:54 -07:00
Dhruba Borthakur	116ec527f2	Renamed 'hybrid_compaction' tp be "Universal Compaction'. Summary: All the universal compaction parameters are encapsulated in a new file universal_compaction.h Test Plan: make check	2013-07-03 15:47:53 -07:00
Dhruba Borthakur	47c4191fe8	Reduce write amplification by merging files in L0 back into L0 Summary: There is a new option called hybrid_mode which, when switched on, causes HBase style compactions. Files from L0 are compacted back into L0. This meat of this compaction algorithm is in PickCompactionHybrid(). All files reside in L0. That means all files have overlapping keys. Each file has a time-bound, i.e. each file contains a range of keys that were inserted around the same time. The start-seqno and the end-seqno refers to the timeframe when these keys were inserted. Files that have contiguous seqno are compacted together into a larger file. All files are ordered from most recent to the oldest. The current compaction algorithm starts to look for candidate files starting from the most recent file. It continues to add more files to the same compaction run as long as the sum of the files chosen till now is smaller than the next candidate file size. This logic needs to be debated and validated. The above logic should reduce write amplification to a large extent... will publish numbers shortly. Test Plan: dbstress runs for 6 hours with no data corruption (tested so far). Differential Revision: https://reviews.facebook.net/D11289	2013-06-30 20:07:04 -07:00
Abhishek Kona	00124683de	[rocksdb] do not trim range for level0 in manual compaction Summary: https://code.google.com/p/leveldb/issues/detail?can=1&q=178&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary&id=178 Ported the solution as is to RocksDB. Test Plan: moved the unit test as manual_compaction_test Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11331	2013-06-17 13:58:17 -07:00
Haobo Xu	bdf1085944	[RocksDB] cleanup EnvOptions Summary: This diff simplifies EnvOptions by treating it as POD, similar to Options. - virtual functions are removed and member fields are accessed directly. - StorageOptions is removed. - Options.allow_readahead and Options.allow_readahead_compactions are deprecated. - Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite Test Plan: make check; db_stress Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11175	2013-06-12 11:17:19 -07:00
Dhruba Borthakur	e673d5d26d	Do not submit multiple simultaneous seek-compaction requests. Summary: The code was such that if multi-threaded-compactions as well as seek compaction are enabled then it submits multiple compaction request for the same range of keys. This causes extraneous sst-files to accumulate at various levels. Test Plan: I am not able to write a very good unit test for this one but can easily reproduce this bug with 'dbstress' with the following options. batch=1;maxk=100000000;ops=100000000;ro=0;fm=2;bpl=10485760;of=500000; wbn=3; mbc=20; mb=2097152; wbs=4194304; dds=1; sync=0; t=32; bs=16384; cs=1048576; of=500000; ./db_stress --disable_seek_compaction=0 --mmap_read=0 --threads=$t --block_size=$bs --cache_size=$cs --open_files=$of --verify_checksum=1 --db=/data/mysql/leveldb/dbstress.dir --sync=$sync --disable_wal=1 --disable_data_sync=$dds --write_buffer_size=$wbs --target_file_size_base=$mb --target_file_size_multiplier=$fm --max_write_buffer_number=$wbn --max_background_compactions=$mbc --max_bytes_for_level_base=$bpl --reopen=$ro --ops_per_thread=$ops --max_key=$maxk --test_batches_snapshots=$batch Reviewers: leveldb, emayanke Reviewed By: emayanke Differential Revision: https://reviews.facebook.net/D11055	2013-06-10 15:49:19 -07:00
Abhishek Kona	d91b42ee27	[Rocksdb] Measure all FSYNC/SYNC times Summary: Add stop watches around all sync calls. Test Plan: db_bench check if respective histograms are printed Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11073	2013-06-05 11:06:21 -07:00
Haobo Xu	ab8d2f6ab2	[RocksDB] [Performance] Allow different posix advice to be applied to the same table file Summary: Current posix advice implementation ties up the access pattern hint with the creation of a file. It is not possible to apply different advice for different access (random get vs compaction read), without keeping two open files for the same table. This patch extended the RandomeAccessFile interface to accept new access hint at anytime. Particularly, we are able to set different access hint on the same table file based on when/how the file is used. Two options are added to set the access hint, after the file is first opened and after the file is being compacted. Test Plan: make check; db_stress; db_bench Reviewers: dhruba Reviewed By: dhruba CC: MarkCallaghan, leveldb Differential Revision: https://reviews.facebook.net/D10905	2013-05-30 19:08:44 -07:00
Dhruba Borthakur	d1aaaf718c	Ability to set different size fanout multipliers for every level. Summary: There is an existing field Options.max_bytes_for_level_multiplier that sets the multiplier for the size of each level in the database. This patch introduces the ability to set different multipliers for every level in the database. The size of a level is determined by using both max_bytes_for_level_multiplier as well as the per-level fanout. size of level[i] = size of level[i-1] * max_bytes_for_level_multiplier * fanout[i-1] The default value of fanout is 1, so that it is backward compatible. Test Plan: make check Reviewers: haobo, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D10863	2013-05-21 13:50:20 -07:00
Dhruba Borthakur	a8d3aa2c26	Assertion failure for L0-L1 compactions. Summary: For level-0 compactions, we try to find if can include more L0 files in the same compaction run. This causes the 'smallest' and 'largest' key to get extended to a larger range. But the suceeding call to ParentRangeInCompaction() was still using the earlier values of 'smallest' and 'largest', Because of this bug, a file in L1 can be part of two concurrent compactions: one L0-L1 compaction and the other L1-L2 compaction. This should not cause any data loss, but will cause an assertion failure with debug builds. Test Plan: make check Differential Revision: https://reviews.facebook.net/D10677	2013-05-08 17:10:11 -07:00
Haobo Xu	05e8854085	[Rocksdb] Support Merge operation in rocksdb Summary: This diff introduces a new Merge operation into rocksdb. The purpose of this review is mostly getting feedback from the team (everyone please) on the design. Please focus on the four files under include/leveldb/, as they spell the client visible interface change. include/leveldb/db.h include/leveldb/merge_operator.h include/leveldb/options.h include/leveldb/write_batch.h Please go over local/my_test.cc carefully, as it is a concerete use case. Please also review the impelmentation files to see if the straw man implementation makes sense. Note that, the diff does pass all make check and truly supports forward iterator over db and a version of Get that's based on iterator. Future work: - Integration with compaction - A raw Get implementation I am working on a wiki that explains the design and implementation choices, but coding comes just naturally and I think it might be a good idea to share the code earlier. The code is heavily commented. Test Plan: run all local tests Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb, zshao, sheki, emayanke, MarkCallaghan Differential Revision: https://reviews.facebook.net/D9651	2013-05-03 16:59:02 -07:00
Dhruba Borthakur	9b81d3c406	Simplified level_ptrs by using a std:vector Summary: Simplified level_ptrs by using a std:vector Test Plan: make check Reviewers: sheki, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D10245	2013-04-15 13:52:51 -07:00
Haobo Xu	013e9ebbf1	[RocksDB] [Performance] Speed up FindObsoleteFiles Summary: FindObsoleteFiles was slow, holding the single big lock, resulted in bad p99 behavior. Didn't profile anything, but several things could be improved: 1. VersionSet::AddLiveFiles works with std::set, which is by itself slow (a tree). You also don't know how many dynamic allocations occur just for building up this tree. switched to std::vector, also added logic to pre-calculate total size and do just one allocation 2. Don't see why env_->GetChildren() needs to be mutex proteced, moved to PurgeObsoleteFiles where mutex could be unlocked. 3. switched std::set to std:unordered_set, the conversion from vector is also inside PurgeObsoleteFiles I have a feeling this should pretty much fix it. Test Plan: make check; db_stress Reviewers: dhruba, heyongqiang, MarkCallaghan Reviewed By: dhruba CC: leveldb, zshao Differential Revision: https://reviews.facebook.net/D10197	2013-04-12 11:29:27 -07:00
Dhruba Borthakur	ad96563b79	Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database. Summary: This patch allows an application to specify whether to use bufferedio, reads-via-mmaps and writes-via-mmaps per database. Earlier, there was a global static variable that was used to configure this functionality. The default setting remains the same (and is backward compatible): 1. use bufferedio 2. do not use mmaps for reads 3. use mmap for writes 4. use readaheads for reads needed for compaction I also added a parameter to db_bench to be able to explicitly specify whether to do readaheads for compactions or not. Test Plan: make check Reviewers: sheki, heyongqiang, MarkCallaghan Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D9429	2013-03-20 23:14:03 -07:00
Mayank Agarwal	487168cdcf	Fixed sign-comparison in rocksdb code-base and fixed Makefile Summary: Makefile had options to ignore sign-comparisons and unused-parameters, which should be there. Also fixed the specific errors in the code-base Test Plan: make Reviewers: chip, dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D9531	2013-03-19 14:35:23 -07:00
Abhishek Kona	7b9db9c98e	DO not report level size as zero when there are no files in L0 Summary: Instead of checking for number of files in L0. Check for number of files in the requested level. Bug introduced in D4929 (diff trying to do too many things). Test Plan: db_test. Reviewers: dhruba, MarkCallaghan Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D9483	2013-03-18 12:04:38 -07:00
Dhruba Borthakur	ebf16f57c9	Prevent segfault because SizeUnderCompaction was called without any locks. Summary: SizeBeingCompacted was called without any lock protection. This causes crashes, especially when running db_bench with value_size=128K. The fix is to compute SizeUnderCompaction while holding the mutex and passing in these values into the call to Finalize. (gdb) where #4 leveldb::VersionSet::SizeBeingCompacted (this=this@entry=0x7f0b490931c0, level=level@entry=4) at db/version_set.cc:1827 #5 0x000000000043a3c8 in leveldb::VersionSet::Finalize (this=this@entry=0x7f0b490931c0, v=v@entry=0x7f0b3b86b480) at db/version_set.cc:1420 #6 0x00000000004418d1 in leveldb::VersionSet::LogAndApply (this=0x7f0b490931c0, edit=0x7f0b3dc8c200, mu=0x7f0b490835b0, new_descriptor_log=<optimized out>) at db/version_set.cc:1016 #7 0x00000000004222b2 in leveldb::DBImpl::InstallCompactionResults (this=this@entry=0x7f0b49083400, compact=compact@entry=0x7f0b2b8330f0) at db/db_impl.cc:1473 #8 0x0000000000426027 in leveldb::DBImpl::DoCompactionWork (this=this@entry=0x7f0b49083400, compact=compact@entry=0x7f0b2b8330f0) at db/db_impl.cc:1757 #9 0x0000000000426690 in leveldb::DBImpl::BackgroundCompaction (this=this@entry=0x7f0b49083400, madeProgress=madeProgress@entry=0x7f0b41bf2d1e, deletion_state=...) at db/db_impl.cc:1268 #10 0x0000000000428f42 in leveldb::DBImpl::BackgroundCall (this=0x7f0b49083400) at db/db_impl.cc:1170 #11 0x000000000045348e in BGThread (this=0x7f0b49023100) at util/env_posix.cc:941 #12 leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper (arg=0x7f0b49023100) at util/env_posix.cc:874 #13 0x00007f0b4a7cf10d in start_thread (arg=0x7f0b41bf3700) at pthread_create.c:301 #14 0x00007f0b49b4b11d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Test Plan: make check I am running db_bench with a value size of 128K to see if the segfault is fixed. Reviewers: MarkCallaghan, sheki, emayanke Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D9279	2013-03-11 14:09:01 -07:00
Dhruba Borthakur	6d812b6afb	A mechanism to detect manifest file write errors and put db in readonly mode. Summary: If there is an error while writing an edit to the manifest file, the manifest file is closed and reopened to check if the edit made it in. However, if the re-opening of the manifest is unsuccessful and options.paranoid_checks is set t true, then the db refuses to accept new puts, effectively putting the db in readonly mode. In a future diff, I would like to make the default value of paranoid_check to true. Test Plan: make check Reviewers: sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D9201	2013-03-07 09:45:49 -08:00
Mark Callaghan	993543d1be	Add rate_delay_limit_milliseconds Summary: This adds the rate_delay_limit_milliseconds option to make the delay configurable in MakeRoomForWrite when the max compaction score is too high. This delay is called the Ln slowdown. This change also counts the Ln slowdown per level to make it possible to see where the stalls occur. From IO-bound performance testing, the Level N stalls occur: * with compression -> at the largest uncompressed level. This makes sense because compaction for compressed levels is much slower. When Lx is uncompressed and Lx+1 is compressed then files pile up at Lx because the (Lx,Lx+1)->Lx+1 compaction process is the first to be slowed by compression. * without compression -> at level 1 Task ID: #1832108 Blame Rev: Test Plan: run with real data, added test Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D9045	2013-03-04 07:41:15 -08:00
Abhishek Kona	c41f1e995c	Codemod NULL to nullptr Summary: scripted NULL to nullptr in * include/leveldb/ * db/ * table/ * util/ Test Plan: make all check Reviewers: dhruba, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D9003	2013-02-28 18:04:58 -08:00
Dhruba Borthakur	fd367e677e	Fix unit test failure in db_filename.cc Summary: c_test: db/filename.cc:74: std::string leveldb::DescriptorFileName(const string&,.... Test Plan: this is a failure in a unit test Differential Revision: https://reviews.facebook.net/D8667	2013-02-18 21:53:56 -08:00
Chip Turner	2fdf91a4f8	Fix a number of object lifetime/ownership issues Summary: Replace manual memory management with std::unique_ptr in a number of places; not exhaustive, but this fixes a few leaks with file handles as well as clarifies semantics of the ownership of file handles with log classes. Test Plan: db_stress, make check Reviewers: dhruba Reviewed By: dhruba CC: zshao, leveldb, heyongqiang Differential Revision: https://reviews.facebook.net/D8043	2013-01-23 16:54:11 -08:00
Abhishek Kona	7d5a4383bb	rollover manifest file. Summary: Check in LogAndApply if the file size is more than the limit set in Options. Things to consider : will this be expensive? Test Plan: make all check. Inputs on a new unit test? Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7701	2013-01-16 12:09:44 -08:00
Chip Turner	a2dcd79c1e	Add optional clang compile mode Summary: clang is an alternate compiler based on llvm. It produces nicer error messages and finds some bugs that gcc doesn't, such as the size_t change in this file (which caused some write return values to be misinterpreted!) Clang isn't the default; to try it, do "USE_CLANG=1 make" or "export USE_CLANG=1" then make as normal Test Plan: "make check" and "USE_CLANG=1 make check" Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D7899	2013-01-15 18:48:37 -08:00
Chip Turner	9bbcab57a9	Fix broken build Summary: Mis-merged from HEAD, had a duplicate declaration. Test Plan: make -j32 OPT=-g Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D7911	2013-01-15 14:05:49 -08:00
Kosie van der Merwe	28fe86c48a	Fixed bug with seek compactions on Level 0 Summary: Due to how the code handled compactions in Level 0 in `PickCompaction()` it could be the case that two compactions on level 0 ran that produced tables in level 1 that overlap. However, this case seems like it would only occur on a seek compaction which is unlikely on level 0. Furthermore, level 0 and level 1 had to have a certain arrangement of files. Test Plan: make check Reviewers: dhruba, vamsi Reviewed By: dhruba CC: leveldb, sheki Differential Revision: https://reviews.facebook.net/D7923	2013-01-15 12:43:09 -08:00
Chip Turner	c0cb289d57	Various build cleanups/improvements Summary: Specific changes: 1) Turn on -Werror so all warnings are errors 2) Fix some warnings the above now complains about 3) Add proper dependency support so changing a .h file forces a .c file to rebuild 4) Automatically use fbcode gcc on any internal machine rather than whatever system compiler is laying around 5) Fix jemalloc to once again be used in the builds (seemed like it wasn't being?) 6) Fix issue where 'git' would fail in build_detect_version because of LD_LIBRARY_PATH being set in the third-party build system Test Plan: make, make check, make clean, touch a header file, make sure rebuild is expected Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D7887	2013-01-14 18:40:22 -08:00
Mark Callaghan	2ba125faf6	fix warning for unused variable Test Plan: compile Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7857	2013-01-11 15:00:47 -08:00
Abhishek Kona	85ad13be1a	Port fix for Leveldb manifest writing bug from Open-Source Summary: Pretty much a blind copy of the patch in open source. Hope to get this in before we make a release Test Plan: make clean check Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7809	2013-01-10 12:06:03 -08:00
Dhruba Borthakur	d7d43ae21a	ExtendOverlappingInputs too slow for large databases. Summary: There was a bug in the ExtendOverlappingInputs method so that the terminating condition for the backward search was incorrect. Test Plan: make clean check Reviewers: sheki, emayanke, MarkCallaghan Reviewed By: MarkCallaghan CC: leveldb Differential Revision: https://reviews.facebook.net/D7725	2013-01-02 13:19:06 -08:00
Zheng Shao	c28097538a	manifest_dump: Add --hex=1 option Summary: Without this option, manifest_dump does not print binary keys for files in a human-readable way. Test Plan: ./manifest_dump --hex=1 --verbose=0 --file=/data/users/zshao/fdb_comparison/leveldb/fbobj.apprequest-0_0_original/MANIFEST-000002 manifest_file_number 589 next_file_number 590 last_sequence 2311567 log_number 543 prev_log_number 0 --- level 0 --- version# 0 --- 532:1300357['0000455BABE20000' @ 2183973 : 1 .. 'FFFCA5D7ADE20000' @ 2184254 : 1] 536:1308170['000198C75CE30000' @ 2203313 : 1 .. 'FFFCF94A79E30000' @ 2206463 : 1] 542:1321644['0002931AA5E50000' @ 2267055 : 1 .. 'FFF77B31C5E50000' @ 2270754 : 1] 544:1286390['000410A309E60000' @ 2278592 : 1 .. 'FFFE470A73E60000' @ 2289221 : 1] 538:1298778['0006BCF4D8E30000' @ 2217050 : 1 .. 'FFFD77DAF7E30000' @ 2220489 : 1] 540:1282353['00090D5356E40000' @ 2231156 : 1 .. 'FFFFF4625CE40000' @ 2231969 : 1] --- level 1 --- version# 0 --- 510:2112325['000007F9C2D40000' @ 1782099 : 1 .. '146F5B67B8D80000' @ 1905458 : 1] 511:2121742['146F8A3023D60000' @ 1824388 : 1 .. '28BC8FBB9CD40000' @ 1777993 : 1] 512:801631['28BCD396F1DE0000' @ 2080191 : 1 .. '3082DBE9ADDB0000' @ 1989927 : 1] Reviewers: dhruba, sheki, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7425	2012-12-16 08:58:28 -08:00
Dhruba Borthakur	c847a31727	Print compaction score for every compaction run. Summary: A compaction is picked based on its score. It is useful to print the compaction score in the LOG because it aids in debugging. If one looks at the logs, one can find out why a compaction was preferred over another. Test Plan: make clean check Differential Revision: https://reviews.facebook.net/D7137	2012-12-04 10:03:47 -08:00
Abhishek Kona	d29f181923	Fix all the lint errors. Summary: Scripted and removed all trailing spaces and converted all tabs to spaces. Also fixed other lint errors. All lint errors from this point of time should be taken seriously. Test Plan: make all check Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7059	2012-11-28 17:18:41 -08:00
Dhruba Borthakur	2a39699900	Assertion failure while running with unit tests with OPT=-g Summary: When we expand the range of keys for a level 0 compaction, we need to invoke ParentFilesInCompaction() only once for the entire range of keys that is being compacted. We were invoking it for each file that was being compacted, but this triggers an assertion because each file's range were contiguous but non-overlapping. I renamed ParentFilesInCompaction to ParentRangeInCompaction to adequately represent that it is the range-of-keys and not individual files that we compact in a single compaction run. Here is the assertion that is fixed by this patch. db_test: db/version_set.cc:585: void leveldb::Version::ExtendOverlappingInputs(int, const leveldb::Slice&, const leveldb::Slice&, std::vector<leveldb::FileMetaData, std::allocator<leveldb::FileMetaData> >*, int): Assertion `user_cmp->Compare(flimit, user_begin) >= 0' failed. Test Plan: make clean check OPT=-g Reviewers: sheki Reviewed By: sheki CC: MarkCallaghan, emayanke, leveldb Differential Revision: https://reviews.facebook.net/D6963	2012-11-26 14:00:39 -08:00
Dhruba Borthakur	7632fdb5cb	Support taking a configurable number of files from the same level to compact in a single compaction run. Summary: The compaction process takes some files from LevelK and merges it into LevelK+1. The number of files it picks from LevelK was capped such a way that the total amount of data picked does not exceed the maxfilesize of that level. This essentially meant that only one file from LevelK is picked for a single compaction. For bulkloads, we would like to take many many file from LevelK and compact them using a single compaction run. This patch introduces a option called the 'source_compaction_factor' (similar to expanded_compaction_factor). It is a multiplier that is multiplied by the maxfilesize of that level to arrive at the limit that is used to throttle the number of source files from LevelK. For bulk loads, set source_compaction_factor to a very high number so that multiple files from the same level are picked for compaction in a single compaction. The default value of source_compaction_factor is 1, so that we can keep backward compatibilty with existing compaction semantics. Test Plan: make clean check Reviewers: emayanke, sheki Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D6867	2012-11-21 08:37:03 -08:00
Dhruba Borthakur	3754f2f4ff	A major bug that was not considering the compaction score of the n-1 level. Summary: The method Finalize() recomputes the compaction score of each level and then sorts these score from largest to smallest. The idea is that the level with the largest compaction score will be a better candidate for compaction. There are usually very few levels, and a bubble sort code was used to sort these compaction scores. There existed a bug in the sorting code that skipped looking at the score for the n-1 level. This meant that even if the compaction score of the n-1 level is large, it will not be picked for compaction. This patch fixes the bug and also introduces "asserts" in the code to detect any possible inconsistencies caused by future bugs. This bug existed in the very first code change that introduced multi-threaded compaction to the leveldb code. That version of code was committed on Oct 19th via `1ca0584345` Test Plan: make clean check OPT=-g Reviewers: emayanke, sheki, MarkCallaghan Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D6837	2012-11-20 15:44:21 -08:00
Dhruba Borthakur	dde70898a1	Fix asserts Summary: make check OPT=-g fails with the following assert. ==== Test DBTest.ApproximateSizes db_test: db/version_set.cc:765: void leveldb::VersionSet::Builder::CheckConsistencyForDeletes(leveldb::VersionEdit, int, int): Assertion `found' failed. The assertion was that file #7 that was being deleted did not preexists, but actualy it did pre-exist as shown in the manifest dump shows below. The bug was that we did not check for file existance at the same level. **********************Edit[0] = VersionEdit { Comparator: leveldb.BytewiseComparator } *********************Edit[1] = VersionEdit { LogNumber: 8 PrevLogNumber: 0 NextFile: 9 LastSeq: 80 AddFile: 0 7 8005319 'key000000' @ 1 : 1 .. 'key000079' @ 80 : 1 } ***********************Edit[2] = VersionEdit { LogNumber: 8 PrevLogNumber: 0 NextFile: 13 LastSeq: 80 CompactPointer: 0 'key000079' @ 80 : 1 DeleteFile: 0 7 AddFile: 1 9 2101425 'key000000' @ 1 : 1 .. 'key000020' @ 21 : 1 AddFile: 1 10 2101425 'key000021' @ 22 : 1 .. 'key000041' @ 42 : 1 AddFile: 1 11 2101425 'key000042' @ 43 : 1 .. 'key000062' @ 63 : 1 AddFile: 1 12 1701165 'key000063' @ 64 : 1 .. 'key000079' @ 80 : 1 } Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2012-11-19 14:51:22 -08:00
Dhruba Borthakur	48dafb2c59	Fix compilation error introduced by previous commit `7889e09455` Summary: Fix compilation error introduced by previous commit `7889e09455` Test Plan: make clean check	2012-11-19 12:16:45 -08:00
Dhruba Borthakur	7889e09455	Enhance manifest_dump to print each individual edit. Summary: The manifest file contains a series of edits. If the verbose option is switched on, then print each individual edit in the manifest file. This helps in debugging. Test Plan: make clean manifest_dump Reviewers: emayanke, sheki Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D6807	2012-11-19 12:04:35 -08:00
Dhruba Borthakur	43d9a8225a	Fix asserts so that "make check OPT=-g" works on performance branch Summary: Compilation used to fail with the error: db/version_set.cc:1773: error: ‘number_of_files_to_sort_’ is not a member of ‘leveldb::VersionSet’ I created a new method called CheckConsistencyForDeletes() so that all the high cost checking is done only when OPT=-g is specified. I also fixed a bug in PickCompactionBySize that was triggered when OPT=-g was switched on. The base_index in the compaction record was not set correctly. Test Plan: make check OPT=-g Differential Revision: https://reviews.facebook.net/D6687	2012-11-13 10:40:52 -08:00
Dhruba Borthakur	95dda37858	Move filesize-based-sorting to outside the Mutex Summary: When a new version is created, we sort all the files at every level based on their size. This is necessary because we want to compact the largest file first. The sorting takes quite a bit of CPU. Moved the sorting code to be outside the mutex. Also, the earlier code was sorting files at all levels but we do not need to sort the highest-number level because those files are never the cause of any compaction. To reduce sorting costs, we sort only the first few files in each level because it is likely that those are the only files in that level that will be picked for compaction. At steady state, I have seen that this patch increase throughout from 1500 writes/sec to 1700 writes/sec at the end of a 72 hour run. The cpu saving by not sorting the last level was not distinctive in this test run because there were only 100K files in the highest numbered level. I expect the cpu saving to be significant when the number of files is much higher. This is mostly an early preview and not ready for rigorous review. With this patch, the writs/sec is now bottlenecked not by the sorting code but by GetOverlappingInputs. I am working on a patch to optimize GetOverlappingInputs. Test Plan: make check Reviewers: MarkCallaghan, heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D6411	2012-11-07 15:39:44 -08:00
Dhruba Borthakur	8143062edd	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/version_set.cc util/options.cc	2012-11-07 15:11:37 -08:00
Dhruba Borthakur	9b87a2bae8	Avoid doing a exhaustive search when looking for overlapping files. Summary: The Version::GetOverlappingInputs() is called multiple times in the compaction code path. Eack invocation does a binary search for overlapping files in the specified key range. This patch remembers the offset of an overlapped file when GetOverlappingInputs() is called the first time within a compaction run. Suceeding calls to GetOverlappingInputs() uses the remembered index to avoid the binary search. I measured that 1000 iterations of GetOverlappingInputs takes around 4500 microseconds without this patch. If I use this patch with the hint on every invocation, then 1000 iterations take about 3900 microsecond. Test Plan: make check OPT=-g Reviewers: heyongqiang Reviewed By: heyongqiang CC: MarkCallaghan, emayanke, sheki Differential Revision: https://reviews.facebook.net/D6513	2012-11-07 11:47:17 -08:00
Dhruba Borthakur	aa42c66814	Fix all warnings generated by -Wall option to the compiler. Summary: The default compilation process now uses "-Wall" to compile. Fix all compilation error generated by gcc. Test Plan: make all check Reviewers: heyongqiang, emayanke, sheki Reviewed By: heyongqiang CC: MarkCallaghan Differential Revision: https://reviews.facebook.net/D6525	2012-11-06 14:07:31 -08:00
Dhruba Borthakur	5f91868cee	Merge branch 'master' into performance Conflicts: db/version_set.cc util/options.cc	2012-11-05 16:51:55 -08:00
Dhruba Borthakur	cb7a00227f	The method GetOverlappingInputs should use binary search. Summary: The method Version::GetOverlappingInputs used a sequential search to map a kay-range to a set of files. But the files are arranged in ascending order of key, so a biary search is more effective. This patch implements Version::GetOverlappingInputsBinarySearch that finds one file that corresponds to the specified key range and then iterates backwards and forwards to find all overlapping files. This patch is critical for making compactions efficient, especially when there are thousands of files in a single level. I measured that 1000 iterations of TEST_MaxNextLevelOverlappingBytes takes 16000 microseconds without this patch. With this patch, the same method takes about 4600 microseconds. Test Plan: Almost all unit tests in db_test uses this method to lookup keys. Reviewers: heyongqiang Reviewed By: heyongqiang CC: MarkCallaghan, emayanke, sheki Differential Revision: https://reviews.facebook.net/D6465	2012-11-05 16:08:01 -08:00
heyongqiang	f1a7c735b5	fix complie error Summary: as subject Test Plan:n/a	2012-11-05 10:30:19 -08:00
heyongqiang	d55c2ba305	Add a tool to change number of levels Summary: as subject. Test Plan: manually test it, will add a testcase Reviewers: dhruba, MarkCallaghan Differential Revision: https://reviews.facebook.net/D6345	2012-11-05 10:17:39 -08:00
Dhruba Borthakur	53e04311b1	Merge branch 'master' into performance Conflicts: db/db_bench.cc util/options.cc	2012-10-29 14:18:00 -07:00
Dhruba Borthakur	5b0fe6c73b	Greedy algorithm for picking files to compact. Summary: It is best if we pick the largest file to compact in a level. This reduces the write amplification factor for compactions. Each level has an auxiliary data structure called files_by_size_ that sorts all files by their size. This data structure is updated when a new version is created. Test Plan: make check Differential Revision: https://reviews.facebook.net/D6195	2012-10-25 18:27:53 -07:00
Dhruba Borthakur	8fb5f40468	firstIndex fix for multi-threaded compaction code. Summary: Prior to multi-threaded compaction, wrap-around would be done by using current_->files_[level[0]. With this change we should be using the first file for which f->being_compacted is not true. `1ca0584345 (commitcomment-2041516)` Test Plan: make check Differential Revision: https://reviews.facebook.net/D6165	2012-10-25 08:44:47 -07:00
Dhruba Borthakur	3b06f94fa2	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/db_impl.h db/version_set.cc	2012-10-23 22:30:07 -07:00
heyongqiang	5010daa7a8	add "seek_compaction" to log for better debug Summary: Summary: as subject Test Plan: compile Reviewers: dhruba Reviewed By: dhruba CC: MarkCallaghan Differential Revision: https://reviews.facebook.net/D6117	2012-10-22 10:00:25 -07:00
Dhruba Borthakur	1ca0584345	This is the mega-patch multi-threaded compaction published in https://reviews.facebook.net/D5997. Summary: This patch allows compaction to occur in multiple background threads concurrently. If a manual compaction is issued, the system falls back to a single-compaction-thread model. This is done to ensure correctess and simplicity of code. When the manual compaction is finished, the system resumes its concurrent-compaction mode automatically. The updates to the manifest are done via group-commit approach. Test Plan: run db_bench	2012-10-19 14:00:53 -07:00
Dhruba Borthakur	c1bb32e1ba	Trigger read compaction only if seeks to storage are incurred. Summary: In the current code, a Get() call can trigger compaction if it has to look at more than one file. This causes unnecessary compaction because looking at more than one file is a penalty only if the file is not yet in the cache. Also, th current code counts these files before the bloom filter check is applied. This patch counts a 'seek' only if the file fails the bloom filter check and has to read in data block(s) from the storage. This patch also counts a 'seek' if a file is not present in the file-cache, because opening a file means that its index blocks need to be read into cache. Test Plan: unit test attached. I will probably add one more unti tests. Reviewers: heyongqiang Reviewed By: heyongqiang CC: MarkCallaghan Differential Revision: https://reviews.facebook.net/D5709	2012-09-28 11:10:52 -07:00
Dhruba Borthakur	24eea931ef	If ReadCompaction is switched off, then it is better to not even submit background compaction jobs. Summary: If ReadCompaction is switched off, then it is better to not even submit background compaction jobs. I see about 3% increase in read-throughput on a pure memory database. Test Plan: run db_bench Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5673	2012-09-25 11:07:01 -07:00
Dhruba Borthakur	ae36e509f8	The BackupAPI should also list the length of the manifest file. Summary: The GetLiveFiles() api lists the set of sst files and the current MANIFEST file. But the database continues to append new data to the MANIFEST file even when the application is backing it up to the backup location. This means that the database-version that is stored in the MANIFEST FILE in the backup location does not correspond to the sst files returned by GetLiveFiles. This API adds a new parameter to GetLiveFiles. This new parmeter returns the current size of the MANIFEST file. Test Plan: Unit test attached. Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5631	2012-09-25 03:13:25 -07:00
Dhruba Borthakur	bb2dcd2457	Segfault in DoCompactionWork caused by buffer overflow Summary: The code was allocating 200 bytes on the stack but it writes 256 bytes into the array. x8a8ea5 std::_Rb_tree<>::erase() @ 0x7f134bee7eb0 (unknown) @ 0x8a8ea5 std::_Rb_tree<>::erase() @ 0x8a35d6 leveldb::DBImpl::CleanupCompaction() @ 0x8a7810 leveldb::DBImpl::BackgroundCompaction() @ 0x8a804d leveldb::DBImpl::BackgroundCall() @ 0x8c4eff leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper() @ 0x7f134b3c010d start_thread @ 0x7f134bf9f10d clone Test Plan: run db_bench with overwrite option Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5595	2012-09-21 10:55:38 -07:00
heyongqiang	a8464ed820	add an option to disable seek compaction Summary: as subject. This diff should be good for benchmarking. will send another diff to make it better in the case the seek compaction is enable. In that coming diff, will not count a seek if the bloomfilter filters. Test Plan: build Reviewers: dhruba, MarkCallaghan Reviewed By: MarkCallaghan Differential Revision: https://reviews.facebook.net/D5481	2012-09-17 13:59:57 -07:00
Dhruba Borthakur	ba55d77b5d	Ability to take a file-lvel snapshot from leveldb. Summary: A set of apis that allows an application to backup data from the leveldb database based on a set of files. Test Plan: unint test attached. more coming soon. Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5439	2012-09-17 09:14:50 -07:00
Dhruba Borthakur	fe93631678	Clean up compiler warnings generated by -Wall option. Summary: Clean up compiler warnings generated by -Wall option. make clean all OPT=-Wall This is a pre-requisite before making a new release. Test Plan: compile and run unit tests Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5019	2012-08-29 14:24:51 -07:00
heyongqiang	a4f9b8b49e	merge 1.5 Summary: as subject Test Plan: db_test table_test Reviewers: dhruba	2012-08-28 11:43:33 -07:00
heyongqiang	d3759ca121	fix db_test error with scribe logger turned on Summary: as subject Test Plan: db_test Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D4929	2012-08-28 11:22:58 -07:00
Dhruba Borthakur	fc20273e73	Introduce a new method Env->Fsync() that issues fsync (instead of fdatasync). Summary: Introduce a new method Env->Fsync() that issues fsync (instead of fdatasync). This is needed for data durability when running on ext3 filesystems. Added options to the benchmark db_bench to generate performance numbers with either fsync or fdatasync enabled. Cleaned up Makefile to build leveldb_shell only when building the thrift leveldb server. Test Plan: build and run benchmark Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D4911	2012-08-27 21:24:17 -07:00
heyongqiang	1de83cc2ac	add more logs Summary: as subject add a tool to read sst file as subject. ./sst_reader --command=check --file= ./sst_reader --command=scan --file= Test Plan: db_test run this command Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D4881	2012-08-24 15:20:49 -07:00
heyongqiang	1c99b0a6b3	add more logs Summary: as subject Test Plan:db_test Reviewers: dhruba	2012-08-24 15:18:43 -07:00
Dhruba Borthakur	f3ee54526f	Utility to dump manifest contents. Summary: ./manifest_dump --file=/tmp/dbbench/MANIFEST-000002 Output looks like manifest_file_number 30 next_file_number 31 last_sequence 388082 log_number 28 prev_log_number 0 --- level 0 --- --- level 1 --- --- level 2 --- 5:3244155['0000000000000000' @ 1 : 1 .. '0000000000028220' @ 28221 : 1] 7:3244177['0000000000028221' @ 28222 : 1 .. '0000000000056441' @ 56442 : 1] 9:3244156['0000000000056442' @ 56443 : 1 .. '0000000000084662' @ 84663 : 1] 11:3244178['0000000000084663' @ 84664 : 1 .. '0000000000112883' @ 112884 : 1] 13:3244158['0000000000112884' @ 112885 : 1 .. '0000000000141104' @ 141105 : 1] 15:3244176['0000000000141105' @ 141106 : 1 .. '0000000000169325' @ 169326 : 1] 17:3244156['0000000000169326' @ 169327 : 1 .. '0000000000197546' @ 197547 : 1] 19:3244178['0000000000197547' @ 197548 : 1 .. '0000000000225767' @ 225768 : 1] 21:3244155['0000000000225768' @ 225769 : 1 .. '0000000000253988' @ 253989 : 1] 23:3244179['0000000000253989' @ 253990 : 1 .. '0000000000282209' @ 282210 : 1] 25:3244157['0000000000282210' @ 282211 : 1 .. '0000000000310430' @ 310431 : 1] 27:3244176['0000000000310431' @ 310432 : 1 .. '0000000000338651' @ 338652 : 1] 29:3244156['0000000000338652' @ 338653 : 1 .. '0000000000366872' @ 366873 : 1] --- level 3 --- --- level 4 --- --- level 5 --- --- level 6 --- Test Plan: run on test directory created by dbbench Reviewers: heyongqiang Reviewed By: heyongqiang CC: hustliubo Differential Revision: https://reviews.facebook.net/D4743	2012-08-24 15:17:09 -07:00
heyongqiang	af6fa308b0	regression for trigger compaction logic Summary: as subject Test Plan: manually run db_bench confirmed Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D4809	2012-08-22 11:41:33 -07:00
heyongqiang	6ba1f17789	adding a scribe logger in leveldb to log leveldb deploy stats Summary: as subject. A new log is written to scribe via thrift client when a new db is opened and when there is a compaction. a new option var scribe_log_db_stats is added. Test Plan: manually checked using command "ptail -time 0 leveldb_deploy_stats" Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D4659	2012-08-21 11:43:22 -07:00
Dhruba Borthakur	2aa514ec8c	Utility to dump manifest contents. Summary: ./manifest_dump --file=/tmp/dbbench/MANIFEST-000002 Output looks like manifest_file_number 30 next_file_number 31 last_sequence 388082 log_number 28 prev_log_number 0 --- level 0 --- --- level 1 --- --- level 2 --- 5:3244155['0000000000000000' @ 1 : 1 .. '0000000000028220' @ 28221 : 1] 7:3244177['0000000000028221' @ 28222 : 1 .. '0000000000056441' @ 56442 : 1] 9:3244156['0000000000056442' @ 56443 : 1 .. '0000000000084662' @ 84663 : 1] 11:3244178['0000000000084663' @ 84664 : 1 .. '0000000000112883' @ 112884 : 1] 13:3244158['0000000000112884' @ 112885 : 1 .. '0000000000141104' @ 141105 : 1] 15:3244176['0000000000141105' @ 141106 : 1 .. '0000000000169325' @ 169326 : 1] 17:3244156['0000000000169326' @ 169327 : 1 .. '0000000000197546' @ 197547 : 1] 19:3244178['0000000000197547' @ 197548 : 1 .. '0000000000225767' @ 225768 : 1] 21:3244155['0000000000225768' @ 225769 : 1 .. '0000000000253988' @ 253989 : 1] 23:3244179['0000000000253989' @ 253990 : 1 .. '0000000000282209' @ 282210 : 1] 25:3244157['0000000000282210' @ 282211 : 1 .. '0000000000310430' @ 310431 : 1] 27:3244176['0000000000310431' @ 310432 : 1 .. '0000000000338651' @ 338652 : 1] 29:3244156['0000000000338652' @ 338653 : 1 .. '0000000000366872' @ 366873 : 1] --- level 3 --- --- level 4 --- --- level 5 --- --- level 6 --- Test Plan: run on test directory created by dbbench Reviewers: heyongqiang Reviewed By: heyongqiang CC: hustliubo Differential Revision: https://reviews.facebook.net/D4743	2012-08-17 22:36:59 -07:00
heyongqiang	680e571c4c	add compaction log Summary: Summary: add compaction summary to log log looks like: 2012/08/17-18:18:32.557334 7fdcaa2bb700 Compaction summary: Base level 0, input file:[11 9 7 ],[] Test Plan: tested via db_test Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D4749	2012-08-17 19:29:39 -07:00
heyongqiang	4e4b6812ff	Make some variables configurable for each db instance Summary: Make configurable 'targetFileSize', 'targetFileSizeMultiplier', 'maxBytesForLevelBase', 'maxBytesForLevelMultiplier', 'expandedCompactionFactor', 'maxGrandParentOverlapFactor' Test Plan: N/A Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D3801	2012-06-27 14:36:31 -07:00
Sanjay Ghemawat	85584d497e	Added bloom filter support. In particular, we add a new FilterPolicy class. An instance of this class can be supplied in Options when opening a database. If supplied, the instance is used to generate summaries of keys (e.g., a bloom filter) which are placed in sstables. These summaries are consulted by DB::Get() so we can avoid reading sstable blocks that are guaranteed to not contain the key we are looking for. This change provides one implementation of FilterPolicy based on bloom filters. Other changes: - Updated version number to 1.4. - Some build tweaks. - C binding for CompactRange. - A few more benchmarks: deleteseq, deleterandom, readmissing, seekrandom. - Minor .gitignore update.	2012-04-17 08:36:46 -07:00
Sanjay Ghemawat	239ac9d2de	avoid very large compactions; fix build on Linux	2012-02-02 09:34:14 -08:00
Hans Wennborg	36a5f8ed7f	A number of fixes: - Replace raw slice comparison with a call to user comparator. Added test for custom comparators. - Fix end of namespace comments. - Fixed bug in picking inputs for a level-0 compaction. When finding overlapping files, the covered range may expand as files are added to the input set. We now correctly expand the range when this happens instead of continuing to use the old range. For example, suppose L0 contains files with the following ranges: F1: a .. d F2: c .. g F3: f .. j and the initial compaction target is F3. We used to search for range f..j which yielded {F2,F3}. However we now expand the range as soon as another file is added. In this case, when F2 is added, we expand the range to c..j and restart the search. That picks up file F1 as well. This change fixes a bug related to deleted keys showing up incorrectly after a compaction as described in Issue 44. (Sync with upstream @25072954)	2011-10-31 17:22:06 +00:00
Gabor Cselle	299ccedfec	A number of bugfixes: - Added DB::CompactRange() method. Changed manual compaction code so it breaks up compactions of big ranges into smaller compactions. Changed the code that pushes the output of memtable compactions to higher levels to obey the grandparent constraint: i.e., we must never have a single file in level L that overlaps too much data in level L+1 (to avoid very expensive L-1 compactions). Added code to pretty-print internal keys. - Fixed bug where we would not detect overlap with files in level-0 because we were incorrectly using binary search on an array of files with overlapping ranges. Added "leveldb.sstables" property that can be used to dump all of the sstables and ranges that make up the db state. - Removing post_write_snapshot support. Email to leveldb mailing list brought up no users, just confusion from one person about what it meant. - Fixing static_cast char to unsigned on BIG_ENDIAN platforms. Fixes Issue 35 and Issue 36. - Comment clarification to address leveldb Issue 37. - Change license in posix_logger.h to match other files. - A build problem where uint32 was used instead of uint32_t. Sync with upstream @24408625	2011-10-05 16:30:28 -07:00
gabor@google.com	7263023651	Bugfixes: for Get(), don't hold mutex while writing log. - Fix bug in Get: when it triggers a compaction, it could sometimes mark the compaction with the wrong level (if there was a gap in the set of levels examined for the Get). - Do not hold mutex while writing to the log file or to the MANIFEST file. Added a new benchmark that runs a writer thread concurrently with reader threads. Percentiles ------------------------------ micros/op: avg median 99 99.9 99.99 99.999 max ------------------------------------------------------ before: 42 38 110 225 32000 42000 48000 after: 24 20 55 65 130 1100 7000 - Fixed race in optimized Get. It should have been using the pinned memtables, not the current memtables. git-svn-id: https://leveldb.googlecode.com/svn/trunk@50 62dab493-f737-651d-591e-8d6aee1b9529	2011-09-01 19:08:02 +00:00
dgrogan@chromium.org	a05525d13b	@23023120 git-svn-id: https://leveldb.googlecode.com/svn/trunk@47 62dab493-f737-651d-591e-8d6aee1b9529	2011-08-06 00:19:37 +00:00
gabor@google.com	60bd8015f2	Speed up Snappy uncompression, new Logger interface. - Removed one copy of an uncompressed block contents changing the signature of Snappy_Uncompress() so it uncompresses into a flat array instead of a std::string. Speeds up readrandom ~10%. - Instead of a combination of Env/WritableFile, we now have a Logger interface that can be easily overridden applications that want to supply their own logging. - Separated out the gcc and Sun Studio parts of atomic_pointer.h so we can use 'asm', 'volatile' keywords for Sun Studio. git-svn-id: https://leveldb.googlecode.com/svn/trunk@39 62dab493-f737-651d-591e-8d6aee1b9529	2011-07-21 02:40:18 +00:00
gabor@google.com	6872ace901	Sun Studio support, and fix for test related memory fixes. - LevelDB patch for Sun Studio Based on a patch submitted by Theo Schlossnagle - thanks! This fixes Issue 17. - Fix a couple of test related memory leaks. git-svn-id: https://leveldb.googlecode.com/svn/trunk@38 62dab493-f737-651d-591e-8d6aee1b9529	2011-07-19 23:36:47 +00:00
gabor@google.com	6699c7ebe6	Small tweaks and bugfixes for Issue 18 and 19. Slight tweak to the no-overlap optimization: only push to level 2 to reduce the amount of wasted space when the same small key range is being repeatedly overwritten. Fix for Issue 18: Avoid failure on Windows by avoiding deletion of lock file until the end of DestroyDB(). Fix for Issue 19: Disregard sequence numbers when checking for overlap in sstable ranges. This fixes issue 19: when writing the same key over and over again, we would generate a sequence of sstables that were never merged together since their sequence numbers were disjoint. Don't ignore map/unmap error checks. Miscellaneous fixes for small problems Sanjay found while diagnosing issue/9 and issue/16 (corruption_testr failures). - log::Reader reports the record type when it finds an unexpected type. - log::Reader no longer reports an error when it encounters an expected zero record regardless of the setting of the "checksum" flag. - Added a missing forward declaration. - Documented a side-effects of larger write buffer sizes (longer recovery time). git-svn-id: https://leveldb.googlecode.com/svn/trunk@37 62dab493-f737-651d-591e-8d6aee1b9529	2011-07-15 00:20:57 +00:00
gabor@google.com	ccf0fcd5c2	A number of smaller fixes and performance improvements: - Implemented Get() directly instead of building on top of a full merging iterator stack. This speeds up the "readrandom" benchmark by up to 15-30%. - Fixed an opensource compilation problem. Added --db=<name> flag to control where the database is placed. - Automatically compact a file when we have done enough overlapping seeks to that file. - Fixed a performance bug where we would read from at least one file in a level even if none of the files overlapped the key being read. - Makefile fix for Mac OSX installations that have XCode 4 without XCode 3. - Unified the two occurrences of binary search in a file-list into one routine. - Found and fixed a bug where we would unnecessarily search the last file when looking for a key larger than all data in the level. - A fix to avoid the need for trivial move compactions and therefore gets rid of two out of five syncs in "fillseq". - Removed the MANIFEST file write when switching to a new memtable/log-file for a 10-20% improvement on fill speed on ext4. - Adding a SNAPPY setting in the Makefile for folks who have Snappy installed. Snappy compresses values and speeds up writes. git-svn-id: https://leveldb.googlecode.com/svn/trunk@32 62dab493-f737-651d-591e-8d6aee1b9529	2011-06-22 02:36:45 +00:00
dgrogan@chromium.org	da79909507	sync with upstream @ 21409451 Check the NEWS file for details of what changed. git-svn-id: https://leveldb.googlecode.com/svn/trunk@28 62dab493-f737-651d-591e-8d6aee1b9529	2011-05-21 02:17:43 +00:00
dgrogan@chromium.org	ba6dac0e80	@20776309 * env_chromium.cc should not export symbols. * Fix MSVC warnings. * Removed large value support. * Fix broken reference to documentation file git-svn-id: https://leveldb.googlecode.com/svn/trunk@24 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-20 22:48:11 +00:00
dgrogan@chromium.org	69c6d38342	reverting disastrous MOE commit, returning to r21 git-svn-id: https://leveldb.googlecode.com/svn/trunk@23 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-19 23:11:15 +00:00
dgrogan@chromium.org	b743906eea	Revision created by MOE tool push_codebase. MOE_MIGRATION= git-svn-id: https://leveldb.googlecode.com/svn/trunk@22 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-19 23:01:25 +00:00
dgrogan@chromium.org	b409afe968	chmod a-x git-svn-id: https://leveldb.googlecode.com/svn/trunk@21 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-18 23:15:58 +00:00
dgrogan@chromium.org	f779e7a5d8	@20602303. Default file permission is now 755. git-svn-id: https://leveldb.googlecode.com/svn/trunk@20 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-12 19:38:58 +00:00
jorlow@chromium.org	4671a695fc	Move include files into a leveldb subdir. git-svn-id: https://leveldb.googlecode.com/svn/trunk@18 62dab493-f737-651d-591e-8d6aee1b9529	2011-03-30 18:35:40 +00:00
jorlow@chromium.org	e2da744e12	Upstream changes. git-svn-id: https://leveldb.googlecode.com/svn/trunk@16 62dab493-f737-651d-591e-8d6aee1b9529	2011-03-28 20:43:44 +00:00
jorlow@chromium.org	8303bb1b33	Pull from upstream. git-svn-id: https://leveldb.googlecode.com/svn/trunk@14 62dab493-f737-651d-591e-8d6aee1b9529	2011-03-22 23:24:02 +00:00

... 6 7 8 9 10 ...

752 Commits