rocksdb

Author	SHA1	Message	Date
Ari Ekmekji	b6def58f73	Changed 'num_subcompactions' to the more accurate 'max_subcompactions' Summary: Up until this point we had DbOptions.num_subcompactions, but it is semantically more correct to call this max_subcompactions since we will schedule up to DbOptions.max_subcompactions smaller compactions at a time during a compaction job. I also added a --subcompactions option to db_bench Test Plan: make all make check Reviewers: sdong, igor, anthony, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D45069	2015-08-21 14:25:34 -07:00
Islam AbdelRahman	3bd9db420e	[Cleanup] Remove RandomRWFile Summary: RandomRWFile is not used anywhere in out code base, this patch remove RandomRWFile Test Plan: make check -j64 USE_CLANG=1 make all -j64 OPT=-DROCKSDB_LITE make release -j64 Reviewers: sdong, yhchiang, anthony, kradhakrishnan, rven, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D44091	2015-08-12 10:18:59 -07:00
agiardullo	c2f2cb0214	Pessimistic Transactions Summary: Initial implementation of Pessimistic Transactions. This diff contains the api changes discussed in D38913. This diff is pretty large, so let me know if people would prefer to meet up to discuss it. MyRocks folks: please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues. Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint(). After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex. We can then decide which route is preferable. Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing. Test Plan: Unit tests, db_bench parallel testing. Reviewers: igor, rven, sdong, yhchiang, yoshinorim Reviewed By: sdong Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40869	2015-08-11 17:52:23 -07:00
sdong	254c4fb88f	In HISTORY.md Switch unreleased notes to 3.13 Summary: Prepare release notes for 3.13. Test Plan: Text only Reviewers: yhchiang, IslamAbdelRahman, rven, anthony Reviewed By: anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D43665	2015-08-06 14:02:44 -07:00
sdong	c7742452eb	Add Statistics.getHistogramString() to print more detailed outputs of a histogram Summary: Provide a way for users to know more detailed ditribution of a histogram metrics. Example outputs: Manually add statement fprintf(stdout, "%s\n", dbstats->getHistogramString(SST_READ_MICROS).c_str()); Will print out something like: Count: 989151 Average: 1.7659 StdDev: 1.52 Min: 0.0000 Median: 1.2071 Max: 860.0000 Percentiles: P50: 1.21 P75: 1.70 P99: 5.12 P99.9: 13.67 P99.99: 21.70 ------------------------------------------------------ [ 0, 1 ) 390839 39.513% 39.513% ######## [ 1, 2 ) 500918 50.641% 90.154% ########## [ 2, 3 ) 79358 8.023% 98.177% ## [ 3, 4 ) 6297 0.637% 98.813% [ 4, 5 ) 1712 0.173% 98.986% [ 5, 6 ) 1134 0.115% 99.101% [ 6, 7 ) 1222 0.124% 99.224% [ 7, 8 ) 1529 0.155% 99.379% [ 8, 9 ) 1264 0.128% 99.507% [ 9, 10 ) 988 0.100% 99.607% [ 10, 12 ) 1378 0.139% 99.746% [ 12, 14 ) 1828 0.185% 99.931% [ 14, 16 ) 410 0.041% 99.972% [ 16, 18 ) 72 0.007% 99.980% [ 18, 20 ) 67 0.007% 99.986% [ 20, 25 ) 106 0.011% 99.997% [ 25, 30 ) 24 0.002% 99.999% [ 30, 35 ) 1 0.000% 100.000% [ 250, 300 ) 2 0.000% 100.000% [ 300, 350 ) 1 0.000% 100.000% [ 800, 900 ) 1 0.000% 100.000% Test Plan: Manually add a print in db_bench and make sure it prints out as expected. Will add some codes to cover the function Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D43611	2015-08-05 20:05:56 -07:00
Mike Kolupaev	e06cf1a098	[wal changes 3/3] method in DB to sync WAL without blocking writers Summary: Subj. We really need this feature. Previous diff D40899 has most of the changes to make this possible, this diff just adds the method. Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind. Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong Reviewed By: sdong Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba Differential Revision: https://reviews.facebook.net/D40905	2015-08-05 06:06:39 -07:00
Islam AbdelRahman	c45a57b41e	Support delete rate limiting Summary: Introduce DeleteScheduler that allow enforcing a rate limit on file deletion Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed. I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile Test Plan: added delete_scheduler_test existing unit tests Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43221	2015-08-04 20:45:27 -07:00
Yueh-Hsuan Chiang	3424eeb1ed	Polish HISTORY.md Summary: Polish HISTORY.md Test Plan: no code change. Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43527	2015-08-04 15:04:19 -07:00
Yueh-Hsuan Chiang	24daff6d7a	Fix a typo and update HISTORY.md for NewCompactOnDeletionCollectorFactory(). Summary: Fix a typo and update HISTORY.md for NewCompactOnDeletionCollectorFactory(). Test Plan: no code change. Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43521	2015-08-04 13:51:05 -07:00
Yueh-Hsuan Chiang	14d0bfa429	Add DBOptions::skip_sats_update_on_db_open Summary: UpdateAccumulatedStats() is used to optimize compaction decision esp. when the number of deletion entries are high, but this function can slowdown DBOpen esp. in disk environment. This patch adds DBOptions::skip_sats_update_on_db_open, which skips UpdateAccumulatedStats() in DB::Open() time when it's set to true. Test Plan: Add DBCompactionTest.SkipStatsUpdateTest Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: tnovak, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42843	2015-08-04 13:48:16 -07:00
agiardullo	8161bdb5a0	WriteBatch Save Points Summary: Support RollbackToSavePoint() in WriteBatch and WriteBatchWithIndex. Support for partial transaction rollback is needed for MyRocks. An alternate implementation of Transaction::RollbackToSavePoint() exists in D40869. However, the other implementation is messier because it is implemented outside of WriteBatch. This implementation is much cleaner and also exposes a potentially useful feature to WriteBatch. Test Plan: Added unit tests Reviewers: IslamAbdelRahman, kradhakrishnan, maykov, yoshinorim, hermanlee4, spetrunia, sdong, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42723	2015-07-29 16:54:23 -07:00
agiardullo	064294081b	Improved FileExists API Summary: Add new CheckFileExists method. Considered changing the FileExists api but didn't want to break anyone's builds. Test Plan: unit tests Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42003	2015-07-20 17:20:40 -07:00
Igor Canadi	a96fcd09b7	Deprecate CompactionFilterV2 Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature. Test Plan: make check Reviewers: noetzli, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42405	2015-07-17 18:59:11 +02:00
Igor Canadi	8a9fca2619	Better error handling in BackupEngine Summary: Couple of changes here: * NewBackupEngine() and NewReadOnlyBackupEngine() are now removed. They were deprecated since RocksDB 3.8. Changing these to new functions should be pretty straight-forward. As a followup, I'll fix all fbcode callsights * Instead of initializing backup engine in the constructor, we initialize it in a separate function now. That way, we can catch all errors and return appropriate status code. * We catch all errors during initializations and return them to the client properly. * Added new tests to backupable_db_test, to make sure that we can't open BackupEngine when there are Env errors. * Transitioned backupable_db_test to use BackupEngine rather than BackupableDB. From the two available APIs, judging by the current use-cases, it looks like BackupEngine API won. It's much more flexible since it doesn't require StackableDB. Test Plan: Added a new unit test to backupable_db_test Reviewers: yhchiang, sdong, AaronFeldman Reviewed By: AaronFeldman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41925	2015-07-14 20:51:36 +02:00
Igor Canadi	a9c5109515	Deprecate purge_redundant_kvs_while_flush Summary: This option is guarding the feature implemented 2 and a half years ago: D8991. The feature was enabled by default back then and has been running without issues. There is no reason why any client would turn this feature off. I found no reference in fbcode. Test Plan: none Reviewers: sdong, yhchiang, anthony, dhruba Reviewed By: dhruba Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42063	2015-07-14 13:07:02 +02:00
Igor Canadi	5aea98ddd8	Deprecate WriteOptions::timeout_hint_us Summary: In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments: * Less code == better icache hit rate, smaller builds, simpler code * The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future. * Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users. Test Plan: make check Reviewers: anthony, kradhakrishnan, sdong, yhchiang Reviewed By: yhchiang Subscribers: sdong, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41937	2015-07-14 09:35:48 +02:00
Poornima Chozhiyath Raman	c0b23dd5b0	Enabling trivial move in universal compaction Summary: This change enables trivial move if all the input files are non onverlapping while doing Universal Compaction. Test Plan: ./compaction_picker_test and db_test ran successfully with the new testcases. Reviewers: sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40875	2015-07-07 14:18:55 -07:00
Yueh-Hsuan Chiang	59b50dcef9	Update HISTORY.md for Listener Summary: Update HISTORY.md for Listener Test Plan: no code change Reviewers: igor, sdong, IslamAbdelRahman, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41325	2015-07-07 12:39:36 -07:00
agiardullo	4159f5b87b	Prepare 3.12 Summary: About to cut release Test Plan: none Reviewers: igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41061	2015-07-02 12:20:36 -07:00
Aaron Feldman	a69bc91e37	Multithreaded backup and restore in BackupEngineImpl Summary: Add a new field: BackupableDBOptions.max_background_copies. CreateNewBackup() and RestoreDBFromBackup() will use this number of threads to perform copies. If there is a backup rate limit, then max_background_copies must be 1. Update backupable_db_test.cc to test multi-threaded backup and restore. Update backupable_db_test.cc to test backups when the backup environment is not the same as the database environment. Test Plan: Run ./backupable_db_test Run valgrind ./backupable_db_test Run with TSAN and ASAN Reviewers: yhchiang, rven, anthony, sdong, igor Reviewed By: igor Subscribers: yhchiang, anthony, sdong, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40725	2015-07-02 11:35:51 -07:00
Igor Canadi	0a019d74a0	Use malloc_usable_size() for accounting block cache size Summary: Currently, when we insert something into block cache, we say that the block cache capacity decreased by the size of the block. However, size of the block might be less than the actual memory used by this object. For example, 4.5KB block will actually use 8KB of memory. So even if we configure block cache to 10GB, our actually memory usage of block cache will be 20GB! This problem showed up a lot in testing and just recently also showed up in MongoRocks production where we were using 30GB more memory than expected. This diff will fix the problem. Instead of counting the block size, we will count memory used by the block. That way, a block cache configured to be 10GB will actually use only 10GB of memory. I'm using non-portable function and I couldn't find info on portability on Google. However, it seems to work on Linux, which will cover majority of our use-cases. Test Plan: 1. fill up mongo instance with 80GB of data 2. restart mongo with block cache size configured to 10GB 3. do a table scan in mongo 4. memory usage before the diff: 12GB. memory usage after the diff: 10.5GB Reviewers: sdong, MarkCallaghan, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40635	2015-06-26 11:48:09 -07:00
Islam AbdelRahman	674b1181cf	Bottommost level compaction option Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction. Test Plan: make check Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40527	2015-06-23 13:32:40 -07:00
Giuseppe Ottaviano	782a1590f9	Implement a table-level row cache Summary: Implementation of a table-level row cache. It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache. Supports snapshots and merge operations. Test Plan: Ran `make valgrind_check commit-prereq` Reviewers: igor, philipp, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39849	2015-06-23 10:25:45 -07:00
Igor Canadi	760e9a94de	Fail DB::Open() when the requested compression is not available Summary: Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users. This patch fails DB::Open() if we don't support the compression that is specified in the options. Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails. Reviewers: sdong, MarkCallaghan, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39687	2015-06-18 14:55:05 -07:00
Aaron Feldman	69bb210d58	Add Cache.GetPinnedUsageUsage() Summary: Add the funcion Cache.GetPinnedUsage() to return the memory size of entries that are in use by the system (that is, all the entries not in the LRU list). Test Plan: Run ./cache_test and examine PinnedUsageTest. Reviewers: tnovak, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40305	2015-06-18 13:56:31 -07:00
Islam AbdelRahman	4eabbdb7ec	Skip bottommost level compaction if possible Summary: This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level Changes in this patch - Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction - Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that should use force_bottommost_level_compaction to pass in a deterministic way Test Plan: make check adding new tests Reviewers: igor, sdong, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40059	2015-06-18 11:03:31 -07:00
Islam AbdelRahman	12e030a992	Use CompactRangeOptions for CompactRange Summary: This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters Old CompactRange is still available but deprecated Test Plan: make all check make rocksdbjava USE_CLANG=1 make all OPT=-DROCKSDB_LITE make release Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40209	2015-06-17 14:36:14 -07:00
sdong	40f562e747	Allow GetApproximateSize() to include mem table size if it is skip list memtable Summary: Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables. To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>. Test Plan: Add a test case Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40119	2015-06-16 18:13:23 -07:00
Igor Canadi	d59d90bb1f	db_bench periodically writes QPS to CSV file Summary: This is part of an effort to better understand and optimize RocksDB stalls under high load. I added a feature to db_bench to periodically write QPS to CSV files. That way we can nicely see how our QPS changes in time (especially when DB is stalled) and can do a better job of evaluating our stall system (i.e. we want the QPS to be as constant as possible, as opposed to having bunch of stalls) Cool part of CSV files is that we can easily graph them -- there are a bunch of tools available. Test Plan: Ran ./db_bench --report_interval_seconds=10 --benchmarks=fillrandom --num=10000000 and observed this in report.csv: secs_elapsed,interval_qps 10,2725860 20,1980480 30,1863456 40,1454359 50,1460389 Reviewers: sdong, MarkCallaghan, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40047	2015-06-12 14:31:53 -07:00
sdong	7842920be5	Slow down writes by bytes written Summary: We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch. The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work hard_rate_limit is deprecated. options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up. Test Plan: Add new unit tests in db_test Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor Reviewed By: igor Subscribers: ikabiljo, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D36351	2015-06-11 20:42:18 -07:00
Igor Canadi	821cff114e	Re-generate WriteEntry on WBWIIterator::Entry() Summary: [This is the resubmit of D39813. Tests were failing, so I reverted the diff. I found the bug and I'm now resubmitting] If we don't do this, any calls to Entry() after WBWI mutation will result in undefined behavior. We need to re-fetch the offset from the skip list and regenerate the new pointer (because string's base pointer can change while mutating). Test Plan: COMPILE_WITH_ASAN=1 make write_batch_with_index_test && ./write_batch_with_index_test Reviewers: sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39897	2015-06-10 12:57:38 -07:00
Igor Canadi	75222d130e	Revert "Fix compile" This reverts commit `51440f83ec`. Revert "Re-generate WriteEntry on WBWIIterator::Entry()" This reverts commit `4949ef08db`.	2015-06-10 11:05:27 -07:00
Igor Canadi	4949ef08db	Re-generate WriteEntry on WBWIIterator::Entry() Summary: If we don't do this, any calls to Entry() after WBWI mutation will result in undefined behavior. We need to re-fetch the offset from the skip list and regenerate the new pointer (because string's base pointer can change while mutating). Test Plan: COMPILE_WITH_ASAN=1 make write_batch_with_index_test && ./write_batch_with_index_test Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39813	2015-06-10 10:35:19 -07:00
Venkatesh Radhakrishnan	406a5682eb	Fix hang when closing a DB after doing loads with WAL disabled. Summary: There is a hang during DB close in the following scenario: a) a load with WAL disabled was done, b) CancelAllBackgroundWork was called, c) DB Close was called This was because in that we will wait for a flush but we cannot do a background flush because we have called CancelAllBackgroundWork which marks the DB as shutting downn. Test Plan: Added DBTest FlushOnDestroy Reviewers: sdong Reviewed By: sdong Subscribers: yoshinorim, hermanlee4, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39747	2015-06-09 10:39:49 -07:00
Islam AbdelRahman	643bbbf081	Use nullptr for default compaction_filter_factory Summary: Replacing the default value for compaction_filter_factory and compaction_filter_factory_v2 to be nullptr instead of DefaultCompactionFilterFactory / DefaultCompactionFilterFactoryV2 The reason for this is to be able to determine easily if we have compaction filter factory or not without depending on RTTI Test Plan: make check Reviewers: yoshinorim, ott, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39693	2015-06-08 16:34:26 -07:00
Yueh-Hsuan Chiang	2e764f06ea	[API Change] Improve EventListener::OnFlushCompleted interface Summary: EventListener::OnFlushCompleted() now passes a structure instead of a list of parameters. This minimizes the API change in the future. Test Plan: listener_test compact_files_test example/compact_files_example Reviewers: kradhakrishnan, sdong, IslamAbdelRahman, rven, igor Reviewed By: rven, igor Subscribers: IslamAbdelRahman, rven, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39543	2015-06-05 12:28:51 -07:00
sdong	4266d4fd90	Allow users to migrate to options.level_compaction_dynamic_level_bytes=true using CompactRange() Summary: In DB::CompactRange(), change parameter "reduce_level" to "change_level". Users can compact all data to the last level if needed. By doing it, users can migrate the DB to options.level_compaction_dynamic_level_bytes=true. Test Plan: Add a unit test for it. Reviewers: yhchiang, anthony, kradhakrishnan, igor, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39099	2015-06-01 18:21:14 -07:00
agiardullo	dc9d70de65	Optimistic Transactions Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported. Test Plan: Added a new test, but still need to look into stress testing. Reviewers: yhchiang, igor, rven, sdong Reviewed By: sdong Subscribers: adamretter, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D33435	2015-05-29 14:36:35 -07:00
agiardullo	c815351038	Support saving history in memtable_list Summary: For transactions, we are using the memtables to validate that there are no write conflicts. But after flushing, we don't have any memtables, and transactions could fail to commit. So we want to someone keep around some extra history to use for conflict checking. In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit. After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure). It seems like the best place for this is abstracted inside the memtable_list. I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much. This diff adds a new parameter to control how much memtable history to keep around after flushing. However, it sounds like people aren't too fond of adding new parameters. So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers. This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit. (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached). So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit). However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions. Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests. Added testing in memtablelist_test and planning on adding more testing here. Reviewers: sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37443	2015-05-28 16:34:24 -07:00
Yueh-Hsuan Chiang	672dda9b3b	[API Change] Move listeners from ColumnFamilyOptions to DBOptions Summary: Move listeners from ColumnFamilyOptions to DBOptions Test Plan: listener_test compact_files_test Reviewers: rven, anthony, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39087	2015-05-28 13:21:39 -07:00
Yueh-Hsuan Chiang	e2c1d4b57f	[Public API Change] Make DB::GetDbIdentity() be const function. Summary: Make DB::GetDbIdentity() be const function. Test Plan: make db_test Reviewers: igor, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38745	2015-05-21 11:01:48 -07:00
Karthikeyan Radhakrishnan	eaf61ba9f3	Minor text correction New features title was repeated twice. Fixed it.	2015-05-21 10:55:58 -07:00
Yueh-Hsuan Chiang	b588505a7f	Update HISTORY.md for GetThreadList() update. Summary: Update HISTORY.md for GetThreadList() update. Test Plan: no code change Reviewers: sdong, rven, anthony, krishnanm86, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38685	2015-05-19 18:41:57 -07:00
Karthikeyan Radhakrishnan	d5de04d20e	Update history for 3.11 Flipped the unreleased section to 3.11	2015-05-19 14:19:11 -07:00
Igor Canadi	4a855c0799	Add an option wal_bytes_per_sync to control sync_file_range for WAL files Summary: sync_file_range is not always asyncronous and thus can block writes if we do this for WAL in the foreground thread. See more here: http://yoshinorimatsunobu.blogspot.com/2014/03/how-syncfilerange-really-works.html Some users don't want us to call sync_file_range on WALs. Some other do. Thus, I'm adding a separate option wal_bytes_per_sync to control calling sync_file_range on WAL files. bytes_per_sync will apply only to table files now. Test Plan: no more sync_file_range for WAL as evidenced by strace Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38253	2015-05-18 17:03:59 -07:00
Aashish Pant	794ccfde89	Task 6532943: Rocksdb - SetCapacity() can dynamically change cache capacity if feasible Summary: When new capacity is larger than existing capacity, simply update the capacity to the new valie When new capacity is less than existing capacity, but more than the usage, simply update the capacity to new value When new capacity is less than the existing capacity and existing usage both, try to purge entries in LRU if feasible to make usage < capacity Test Plan: Created unit tests in cache_test.cc Reviewers: sdong, rven, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D37527	2015-04-24 14:12:58 -07:00
sdong	953a885ebf	A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge Summary: Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it. Also refactor the codes so that (1) make table property collector and internal table property collector two separate data structures with the later one now exposed (2) table builders only receive internal table properties Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector. Reviewers: yhchiang, igor.sugak, rven, igor Reviewed By: rven, igor Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D35373	2015-04-06 10:27:21 -07:00
sdong	b23bbaa82a	Universal Compactions with Small Files Summary: With this change, we use L1 and up to store compaction outputs in universal compaction. The compaction pick logic stays the same. Outputs are stored in the largest "level" as possible. If options.num_levels=1, it behaves all the same as now. Test Plan: 1) convert most of existing unit tests for universal comapaction to include the option of one level and multiple levels. 2) add a unit test to cover parallel compaction in universal compaction and run it in one level and multiple levels 3) add unit test to migrate from multiple level setting back to one level setting 4) add a unit test to insert keys to trigger multiple rounds of compactions and verify results. Reviewers: rven, kradhakrishnan, yhchiang, igor Reviewed By: igor Subscribers: meyering, leveldb, MarkCallaghan, dhruba Differential Revision: https://reviews.facebook.net/D34539	2015-03-30 15:12:02 -07:00
Igor Canadi	d61cb0b9de	db_bench can now disable flashcache for background threads Summary: Most of the approach is copied from WebSQL's MySQL branch. It's nice that we can do this without touching core RocksDB code. Test Plan: Compiles and runs. Didn't test flashback code, as I don't have flashback device and most if it is c/p Reviewers: MarkCallaghan, sdong Reviewed By: sdong Subscribers: rven, lgalanis, kradhakrishnan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35391	2015-03-30 09:51:11 -07:00
Yueh-Hsuan Chiang	39d508e34c	Add a missing section title in HISTORY.md Summary: Add a missing section title in HISTORY.md Test Plan: no code change	2015-03-25 14:14:26 -07:00
Yueh-Hsuan Chiang	2d417e52df	Update HISTORY.md for 3.10.0 Summary: Update HISTORY.md for 3.10.0 Test Plan: no code chagne. Reviewers: sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35871	2015-03-24 16:39:39 -07:00
Igor Canadi	b088c83e6e	Don't delete files when column family is dropped Summary: To understand the bug read t5943287 and check out the new test in column_family_test (ReadDroppedColumnFamily), iter 0. RocksDB contract allowes you to read a drop column family as long as there is a live reference. However, since our iteration ignores dropped column families, AddLiveFiles() didn't mark files of a dropped column families as live. So we deleted them. In this patch I no longer ignore dropped column families in the iteration. I think this behavior was confusing and it also led to this bug. Now if an iterator client wants to ignore dropped column families, he needs to do it explicitly. Test Plan: Added a new unit test that is failing on master. Unit test succeeds now. Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32535	2015-03-19 17:04:29 -07:00
agiardullo	81345b90f9	Create an abstract interface for write batches Summary: WriteBatch and WriteBatchWithIndex now both inherit from a common abstract base class. This makes it easier to write code that is agnostic toward the implementation of the particular write batch. In particular, I plan on utilizing this abstraction to allow transactions to support using either implementation of a write batch. Test Plan: modified existing WriteBatchWithIndex tests to test new functions. Running all tests. Reviewers: igor, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34017	2015-03-17 19:23:08 -07:00
Igor Canadi	c88ff4ca76	Deprecate removeScanCountLimit in NewLRUCache Summary: It is no longer used by the implementation, so we should also remove it from the public API. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34971	2015-03-17 15:04:37 -07:00
Igor Canadi	db03739340	options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically. Summary: When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases. In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier. Test Plan: New unit tests and pass tests suites including valgrind. Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo Reviewed By: ikabiljo Subscribers: yoshinorim, ikabiljo, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31437	2015-03-02 22:40:41 -08:00
Igor Canadi	b9ff6b050d	Fix a bug in ReadOnlyBackupEngine Summary: This diff fixes a bug introduced by D28521. Read-only backup engine can delete a backup that is later than the latest -- we never check the condition. I also added a bunch of logging that will help with debugging cases like this in the future. See more discussion at t6218248. Test Plan: Added a unit test that was failing before the change. Also, see new LOG file contents: https://phabricator.fb.com/P19738984 Reviewers: benj, sanketh, sumeet, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33897	2015-02-27 14:03:56 -08:00
sdong	68af7811ea	Remember whole key/prefix filtering on/off in SST file Summary: Remember whole key or prefix filtering on/off in SST files. If user opens the DB with a different setting that cannot be satisfied while reading the SST file, ignore the bloom filter. Test Plan: Add a unit test for it Reviewers: yhchiang, igor, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D32889	2015-02-11 11:20:04 -08:00
fyrz	cfe8837e43	Switch logv with loglevel to virtual	2015-02-09 20:59:29 +01:00
sdong	e63140d52b	Get() to use prefix bloom filter when filter is not block based Summary: Get() now doesn't make use of bloom filter if it is prefix based. Add the check. Didn't touch block based bloom filter. I can't fully reason whether it is correct to do that. But it's straight-forward to for full bloom filter. Test Plan: make all check Add a test case in DBTest Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: MarkCallaghan, leveldb, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D31941	2015-02-04 15:15:41 -08:00
sdong	5917de0bae	CappedFixTransform: return fixed length prefix, or full key if key is shorter than the fixed length Summary: Add CappedFixTransform, which is the same as fixed length prefix extractor, except that when slice is shorter than the fixed length, it will use the full key. Test Plan: Add a test case for db_test options_test and a new test Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: MarkCallaghan, leveldb, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D31887	2015-01-30 16:04:30 -08:00
Igor Canadi	2fd8f750ab	Compile MemEnv with standard RocksDB library Summary: This was a feature request by osquery. See task t5617758 Test Plan: compiles and memenv_test runs Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32115	2015-01-29 16:33:11 -08:00
Yueh-Hsuan Chiang	cd4c071973	Update HISTORY.md for GetThreadStatus() support on compaction.	2015-01-22 15:46:56 -08:00
Igor Canadi	9ab5adfc59	New BlockBasedTable version -- better compressed block format Summary: This diff adds BlockBasedTable format_version = 2. New format version brings better compressed block format for these compressions: 1) Zlib -- encode decompressed size in compressed block header 2) BZip2 -- encode decompressed size in compressed block header 3) LZ4 and LZ4HC -- instead of doing memcpy of size_t encode size as varint32. memcpy is very bad because the DB is not portable accross big/little endian machines or even platforms where size_t might be 8 or 4 bytes. It does not affect format for snappy. If you write a new database with format_version = 2, it will not be readable by RocksDB versions before 3.10. DB::Open() will return corruption in that case. Test Plan: Added a new test in db_test. I will also run db_bench and verify VSIZE when block_cache == 1GB Reviewers: yhchiang, rven, MarkCallaghan, dhruba, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31461	2015-01-14 16:24:24 -08:00
Igor Canadi	62ad0a9b19	Deprecating skip_log_error_on_recovery Summary: Since https://reviews.facebook.net/D16119, we ignore partial tailing writes. Because of that, we no longer need skip_log_error_on_recovery. The documentation says "Skip log corruption error on recovery (If client is ok with losing most recent changes)", while the option actually ignores any corruption of the WAL (not only just the most recent changes). This is very dangerous and can lead to DB inconsistencies. This was originally set up to ignore partial tailing writes, which we now do automatically (after D16119). I have digged up old task t2416297 which confirms my findings. Test Plan: There was actually no tests that verified correct behavior of skip_log_error_on_recovery. Reviewers: yhchiang, rven, dhruba, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30603	2015-01-05 13:35:56 -08:00
Lei Jin	5045c43944	add support for nested BlockBasedTableOptions in config string Summary: Add support to allow nested config for block-based table factory. The format looks like this: "write_buffer_size=1024;block_based_table_factory={block_size=4k};max_write_buffer_num=2" Test Plan: unit test Reviewers: yhchiang, rven, igor, ljin, jonahcohen Reviewed By: jonahcohen Subscribers: jonahcohen, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29223	2014-12-22 16:34:21 -08:00
Igor Canadi	0acc738810	Speed up FindObsoleteFiles() Summary: There are two versions of FindObsoleteFiles(): * full scan, which is executed every 6 hours (and it's terribly slow) * no full scan, which is executed every time a background process finishes and iterator is deleted This diff is optimizing the second case (no full scan). Here's what we do before the diff: * Get the list of obsolete files (files with ref==0). Some files in obsolete_files set might actually be live. * Get the list of live files to avoid deleting files that are live. * Delete files that are in obsolete_files and not in live_files. After this diff: * The only files with ref==0 that are still live are files that have been part of move compaction. Don't include moved files in obsolete_files. * Get the list of obsolete files (which exclude moved files). * No need to get the list of live files, since all files in obsolete_files need to be deleted. I'll post the benchmark results, but you can get the feel of it here: https://reviews.facebook.net/D30123 This depends on D30123. P.S. We should do full scan only in failure scenarios, not every 6 hours. I'll do this in a follow-up diff. Test Plan: One new unit test. Made sure that unit test fails if we don't have a `if (!f->moved)` safeguard in ~Version. make check Big number of compactions and flushes: ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000 Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30249	2014-12-22 12:04:45 +01:00
Venkatesh Radhakrishnan	7661e5a76e	Move the file copy out of the mutex. Summary: We now release the mutex before copying the files in the case of the trivial move. This path does not use the compaction job. Test Plan: DBTest.LevelCompactionThirdPath Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30381	2014-12-16 16:57:22 -08:00
Venkatesh Radhakrishnan	153f4f0719	RocksDB: Allow Level-Style Compaction to Place Files in Different Paths Summary: Allow Level-style compaction to place files in different paths This diff provides the code for task 4854591. We now support level-compaction to place files in different paths by specifying them in db_paths along with the minimum level for files to store in that path. Test Plan: ManualLevelCompactionOutputPathId in db_test.cc Reviewers: yhchiang, MarkCallaghan, dhruba, yoshinorim, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29799	2014-12-15 21:48:16 -08:00
Igor Canadi	06eed650a0	Optimize default compile to compilation platform by default Summary: This diff changes compile to optimize for native platform by default. This will automatically turn on crc32 optimizations for modern processors, which greatly improves rocksdb's performance. I also did some more changes to compilation documentation. Test Plan: compile with `make`, observe -march=native compile with `PORTABLE=1 make`, observe no -march=native Reviewers: sdong, rven, yhchiang, MarkCallaghan Reviewed By: MarkCallaghan Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30225	2014-12-15 11:29:41 +01:00
Alexey Maykov	ee95cae9a4	Modifed the LRU cache eviction code so that it doesn't evict blocks which have exteranl references Summary: Currently, blocks which have more than one reference (ie referenced by something other than cache itself) are evicted from cache. This doesn't make much sense: - blocks are still in RAM, so the RAM usage reported by the cache is incorrect - if the same block is needed by another iterator, it will be loaded and decompressed again This diff changes the reference counting scheme a bit. Previously, if the cache contained the block, this was accounted for in its refcount. After this change, the refcount is only used to track external references. There is a boolean flag which indicates whether or not the block is contained in the cache. This diff also changes how LRU list is used. Previously, both hashtable and the LRU list contained all blocks. After this change, the LRU list contains blocks with the refcount==0, ie those which can be evicted from the cache. Note that this change still allows for cache to grow beyond its capacity. This happens when all blocks are pinned (ie refcount>0). This is consistent with the current behavior. The cache's insert function never fails. I spent lots of time trying to make table_reader and other places work with the insert which might failed. It turned out to be pretty hard. It might really destabilize some customers, so finally, I decided against doing this. table_cache_remove_scan_count_limit option will be unneeded after this change, but I will remove it in the following diff, if this one gets approved Test Plan: Ran tests, made sure they pass Reviewers: sdong, ljin Differential Revision: https://reviews.facebook.net/D25503	2014-12-10 22:28:53 -08:00
Yueh-Hsuan Chiang	1b7fbb9e82	Update HISTORY.md for release 3.9	2014-12-08 15:19:48 -08:00
Jonah Cohen	a14b7873ee	Enforce write buffer memory limit across column families Summary: Introduces a new class for managing write buffer memory across column families. We supplement ColumnFamilyOptions::write_buffer_size with ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer instance that enforces memory limits before flushing out to disk. Test Plan: Added SharedWriteBuffer unit test to db_test.cc Reviewers: sdong, rven, ljin, igor Reviewed By: igor Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim Differential Revision: https://reviews.facebook.net/D22581	2014-12-02 12:09:20 -08:00
Venkatesh Radhakrishnan	004f416b77	Moved checkpoint to utilities Summary: Moved checkpoint to utilities. Addressed comments by Igor, Siying, Dhruba Test Plan: db_test/SnapshotLink Reviewers: dhruba, igor, sdong Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29079	2014-11-20 15:54:47 -08:00
Yueh-Hsuan Chiang	d0c5f28a5c	Introduce GetThreadList API Summary: Add GetThreadList API, which allows developer to track the status of each process. Currently, calling GetThreadList will only get the list of background threads in RocksDB with their thread-id and thread-type (priority) set. Will add more support on this in the later diffs. ThreadStatus currently has the following properties: // An unique ID for the thread. const uint64_t thread_id; // The type of the thread, it could be ROCKSDB_HIGH_PRIORITY, // ROCKSDB_LOW_PRIORITY, and USER_THREAD const ThreadType thread_type; // The name of the DB instance where the thread is currently // involved with. It would be set to empty string if the thread // does not involve in any DB operation. const std::string db_name; // The name of the column family where the thread is currently // It would be set to empty string if the thread does not involve // in any column family. const std::string cf_name; // The event that the current thread is involved. // It would be set to empty string if the information about event // is not currently available. Test Plan: ./thread_list_test export ROCKSDB_TESTS=GetThreadList ./db_test Reviewers: rven, igor, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D25047	2014-11-20 10:49:32 -08:00
Yueh-Hsuan Chiang	7fe247080f	Update HISTORY.md for RocksJava	2014-11-14 11:23:00 -08:00
Igor Canadi	94fa542f82	Update HISTROY.md for 3.8 release	2014-11-14 10:43:12 -08:00
Hasnain Lakhani	31b02dc21d	Improve Backup Engine. Summary: Improve the backup engine by not deleting the corrupted backup when it is detected; instead leaving it to the client to delete the corrupted backup. Also add a BackupEngine::Open() call. Test Plan: Add check to CorruptionTest inside backupable_db_test to check that the corrupt backups are not deleted. The previous version of the code failed this test as backups were deleted, but after the changes in this commit, this test passes. Run make check to ensure that no other tests fail. Reviewers: sdong, benj, sanketh, sumeet, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28521	2014-11-13 09:51:41 -08:00
Lei Jin	c02338a698	update HISOTRY.md for new release Summary: as title Test Plan: n/a	2014-11-06 11:02:30 -08:00
Lei Jin	29a9161f34	Note dynamic options in options.h Summary: as title Test Plan: n/a Reviewers: igor, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28287	2014-11-04 16:23:45 -08:00
sdong	09899f0b51	DB::Open() to automatically increase thread pool size if it is smaller than max number of parallel compactions or flushes Summary: With the patch, thread pool size will be automatically increased if DB's options ask for more parallelism of compactions or flushes. Too many users have been confused by the API. Change it to make it harder for users to make mistakes Test Plan: Add two unit tests to cover the function. Reviewers: yhchiang, rven, igor, MarkCallaghan, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27555	2014-11-03 17:22:34 -08:00
sdong	f441b273ae	WriteBatchWithIndex to support an option to overwrite rows when operating the same key Summary: With a new option, when accepting a new key, WriteBatchWithIndex will find an existing index of the same key, and replace the content of it. Test Plan: Add a unit test case. Reviewers: ljin, yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D24753	2014-10-10 15:19:21 -07:00
Lei Jin	cd0d581ff5	convert Options from string Summary: Allow accepting Options as a string of key/value pairs Test Plan: unit test Reviewers: yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D24597	2014-10-10 10:00:12 -07:00
Igor Canadi	1d525891bd	Update HISTORY for 3.6	2014-10-07 11:59:30 -07:00
Igor Canadi	21ddcf6e4f	Remove allow_thread_local Summary: See https://reviews.facebook.net/D19365 Test Plan: compiles Reviewers: sdong, yhchiang, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23907	2014-09-24 13:12:16 -07:00
Lei Jin	57a32f147f	change target_file_size_base to uint64_t Summary: It contrains the file size to be 4G max with int Test Plan: tried to grep instance and made sure other related variables are also uint64 Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23697	2014-09-22 11:15:03 -07:00
Lei Jin	51af7c326c	CuckooTable: add one option to allow identity function for the first hash function Summary: MurmurHash becomes expensive when we do millions Get() a second in one thread. Add this option to allow the first hash function to use identity function as hash function. It results in QPS increase from 3.7M/s to ~4.3M/s. I did not observe improvement for end to end RocksDB performance. This may be caused by other bottlenecks that I will address in a separate diff. Test Plan: ``` [ljin@dev1964 rocksdb] ./cuckoo_table_reader_test --enable_perf --file_dir=/dev/shm --write --identity_as_first_hash=0 ==== Test CuckooReaderTest.WhenKeyExists ==== Test CuckooReaderTest.WhenKeyExistsWithUint64Comparator ==== Test CuckooReaderTest.CheckIterator ==== Test CuckooReaderTest.CheckIteratorUint64 ==== Test CuckooReaderTest.WhenKeyNotFound ==== Test CuckooReaderTest.TestReadPerformance With 125829120 items, utilization is 93.75%, number of hash functions: 2. Time taken per op is 0.272us (3.7 Mqps) with batch size of 0, # of found keys 125829120 With 125829120 items, utilization is 93.75%, number of hash functions: 2. Time taken per op is 0.138us (7.2 Mqps) with batch size of 10, # of found keys 125829120 With 125829120 items, utilization is 93.75%, number of hash functions: 2. Time taken per op is 0.142us (7.1 Mqps) with batch size of 25, # of found keys 125829120 With 125829120 items, utilization is 93.75%, number of hash functions: 2. Time taken per op is 0.142us (7.0 Mqps) with batch size of 50, # of found keys 125829120 With 125829120 items, utilization is 93.75%, number of hash functions: 2. Time taken per op is 0.144us (6.9 Mqps) with batch size of 100, # of found keys 125829120 With 104857600 items, utilization is 78.12%, number of hash functions: 2. Time taken per op is 0.201us (5.0 Mqps) with batch size of 0, # of found keys 104857600 With 104857600 items, utilization is 78.12%, number of hash functions: 2. Time taken per op is 0.121us (8.3 Mqps) with batch size of 10, # of found keys 104857600 With 104857600 items, utilization is 78.12%, number of hash functions: 2. Time taken per op is 0.123us (8.1 Mqps) with batch size of 25, # of found keys 104857600 With 104857600 items, utilization is 78.12%, number of hash functions: 2. Time taken per op is 0.121us (8.3 Mqps) with batch size of 50, # of found keys 104857600 With 104857600 items, utilization is 78.12%, number of hash functions: 2. Time taken per op is 0.112us (8.9 Mqps) with batch size of 100, # of found keys 104857600 With 83886080 items, utilization is 62.50%, number of hash functions: 2. Time taken per op is 0.251us (4.0 Mqps) with batch size of 0, # of found keys 83886080 With 83886080 items, utilization is 62.50%, number of hash functions: 2. Time taken per op is 0.107us (9.4 Mqps) with batch size of 10, # of found keys 83886080 With 83886080 items, utilization is 62.50%, number of hash functions: 2. Time taken per op is 0.099us (10.1 Mqps) with batch size of 25, # of found keys 83886080 With 83886080 items, utilization is 62.50%, number of hash functions: 2. Time taken per op is 0.100us (10.0 Mqps) with batch size of 50, # of found keys 83886080 With 83886080 items, utilization is 62.50%, number of hash functions: 2. Time taken per op is 0.116us (8.6 Mqps) with batch size of 100, # of found keys 83886080 With 73400320 items, utilization is 54.69%, number of hash functions: 2. Time taken per op is 0.189us (5.3 Mqps) with batch size of 0, # of found keys 73400320 With 73400320 items, utilization is 54.69%, number of hash functions: 2. Time taken per op is 0.095us (10.5 Mqps) with batch size of 10, # of found keys 73400320 With 73400320 items, utilization is 54.69%, number of hash functions: 2. Time taken per op is 0.096us (10.4 Mqps) with batch size of 25, # of found keys 73400320 With 73400320 items, utilization is 54.69%, number of hash functions: 2. Time taken per op is 0.098us (10.2 Mqps) with batch size of 50, # of found keys 73400320 With 73400320 items, utilization is 54.69%, number of hash functions: 2. Time taken per op is 0.105us (9.5 Mqps) with batch size of 100, # of found keys 73400320 [ljin@dev1964 rocksdb] ./cuckoo_table_reader_test --enable_perf --file_dir=/dev/shm --write --identity_as_first_hash=1 ==== Test CuckooReaderTest.WhenKeyExists ==== Test CuckooReaderTest.WhenKeyExistsWithUint64Comparator ==== Test CuckooReaderTest.CheckIterator ==== Test CuckooReaderTest.CheckIteratorUint64 ==== Test CuckooReaderTest.WhenKeyNotFound ==== Test CuckooReaderTest.TestReadPerformance With 125829120 items, utilization is 93.75%, number of hash functions: 2. Time taken per op is 0.230us (4.3 Mqps) with batch size of 0, # of found keys 125829120 With 125829120 items, utilization is 93.75%, number of hash functions: 2. Time taken per op is 0.086us (11.7 Mqps) with batch size of 10, # of found keys 125829120 With 125829120 items, utilization is 93.75%, number of hash functions: 2. Time taken per op is 0.088us (11.3 Mqps) with batch size of 25, # of found keys 125829120 With 125829120 items, utilization is 93.75%, number of hash functions: 2. Time taken per op is 0.083us (12.1 Mqps) with batch size of 50, # of found keys 125829120 With 125829120 items, utilization is 93.75%, number of hash functions: 2. Time taken per op is 0.083us (12.1 Mqps) with batch size of 100, # of found keys 125829120 With 104857600 items, utilization is 78.12%, number of hash functions: 2. Time taken per op is 0.159us (6.3 Mqps) with batch size of 0, # of found keys 104857600 With 104857600 items, utilization is 78.12%, number of hash functions: 2. Time taken per op is 0.078us (12.8 Mqps) with batch size of 10, # of found keys 104857600 With 104857600 items, utilization is 78.12%, number of hash functions: 2. Time taken per op is 0.080us (12.6 Mqps) with batch size of 25, # of found keys 104857600 With 104857600 items, utilization is 78.12%, number of hash functions: 2. Time taken per op is 0.080us (12.5 Mqps) with batch size of 50, # of found keys 104857600 With 104857600 items, utilization is 78.12%, number of hash functions: 2. Time taken per op is 0.082us (12.2 Mqps) with batch size of 100, # of found keys 104857600 With 83886080 items, utilization is 62.50%, number of hash functions: 2. Time taken per op is 0.154us (6.5 Mqps) with batch size of 0, # of found keys 83886080 With 83886080 items, utilization is 62.50%, number of hash functions: 2. Time taken per op is 0.077us (13.0 Mqps) with batch size of 10, # of found keys 83886080 With 83886080 items, utilization is 62.50%, number of hash functions: 2. Time taken per op is 0.077us (12.9 Mqps) with batch size of 25, # of found keys 83886080 With 83886080 items, utilization is 62.50%, number of hash functions: 2. Time taken per op is 0.078us (12.8 Mqps) with batch size of 50, # of found keys 83886080 With 83886080 items, utilization is 62.50%, number of hash functions: 2. Time taken per op is 0.079us (12.6 Mqps) with batch size of 100, # of found keys 83886080 With 73400320 items, utilization is 54.69%, number of hash functions: 2. Time taken per op is 0.218us (4.6 Mqps) with batch size of 0, # of found keys 73400320 With 73400320 items, utilization is 54.69%, number of hash functions: 2. Time taken per op is 0.083us (12.0 Mqps) with batch size of 10, # of found keys 73400320 With 73400320 items, utilization is 54.69%, number of hash functions: 2. Time taken per op is 0.085us (11.7 Mqps) with batch size of 25, # of found keys 73400320 With 73400320 items, utilization is 54.69%, number of hash functions: 2. Time taken per op is 0.086us (11.6 Mqps) with batch size of 50, # of found keys 73400320 With 73400320 items, utilization is 54.69%, number of hash functions: 2. Time taken per op is 0.078us (12.8 Mqps) with batch size of 100, # of found keys 73400320 ``` Reviewers: sdong, igor, yhchiang Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23451	2014-09-18 11:00:48 -07:00
Igor Canadi	4a27a2f193	Don't sync manifest when disableDataSync = true Summary: As we discussed offline Test Plan: compiles Reviewers: yhchiang, sdong, ljin, dhruba Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22989	2014-09-15 11:32:01 -07:00
Igor Canadi	88841bd007	Explicitly cast char to signed char in Hash() Summary: The compilers we use treat char as signed. However, this is not guarantee of C standard and some compilers (for ARM platform for example), treat char as unsigned. Code that assumes that char is either signed or unsigned is wrong. This change explicitly casts the char to signed version. This will not break any of our use cases on x86, which, I believe are all of them. In case somebody out there is using RocksDB on ARM AND using bloom filters, they're going to have a bad time. However, it is very unlikely that this is the case. Test Plan: sanity test with previous commit (with new sanity test) Reviewers: yhchiang, ljin, sdong Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D22767	2014-09-08 18:57:40 -07:00
Igor Canadi	a2bb7c3c33	Push- instead of pull-model for managing Write stalls Summary: Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either: * proceed with all writes without delay * delay all writes by fixed time * stop all writes The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case). When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal. This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write. Test Plan: make check for now. I'll add some unit tests later. Also, perf test. Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22791	2014-09-08 11:20:25 -07:00
Igor Canadi	8de151bb99	Add db_bench with lots of column families to regression tests Summary: That way we can see when this graph goes up and be happy. Couple of changes: 1. title 2. fix db_bench to delete column families before deleting the DB. this was asserting when compiled in debug mode 3. don't sync manifest when disableDataSync. We discussed this offline. I can move it to separate diff if you'd like Test Plan: ran it Reviewers: sdong, yhchiang, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22815	2014-09-05 14:20:18 -07:00
Raghav Pisolkar	e0b99d4f5d	created a new ReadOptions parameter 'iterate_upper_bound'	2014-09-04 11:00:16 -07:00
Lei Jin	1b1d9619ff	update HISTORY.md Summary: as title Test Plan: no Reviewers: sdong, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22761	2014-09-03 17:03:30 -07:00
Lei Jin	1755581f19	improve OptimizeForPointLookup() Summary: also fix HISTORY.md Test Plan: make all check Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22437	2014-08-26 14:15:00 -07:00
Lei Jin	384400128f	move block based table related options BlockBasedTableOptions Summary: I will move compression related options in a separate diff since this diff is already pretty lengthy. I guess I will also need to change JNI accordingly :( Test Plan: make all check Reviewers: yhchiang, igor, sdong Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D21915	2014-08-25 14:22:05 -07:00
sdong	28b5c76004	WriteBatchWithIndex: a wrapper of WriteBatch, with a searchable index Summary: Add WriteBatchWithIndex so that a user can query data out of a WriteBatch, to support MongoDB's read-its-own-write. WriteBatchWithIndex uses a skiplist to store the binary index. The index stores the offset of the entry in the write batch. When searching for a key, the key for the entry is read by read the entry from the write batch from the offset. Define a new iterator class for querying data out of WriteBatchWithIndex. A user can create an iterator of the write batch for one column family, seek to a key and keep calling Next() to see next entries. I will add more unit tests if people are OK about this API. Test Plan: make all check Add unit tests. Reviewers: yhchiang, igor, MarkCallaghan, ljin Reviewed By: ljin Subscribers: dhruba, leveldb, xjin Differential Revision: https://reviews.facebook.net/D21381	2014-08-18 16:37:38 -07:00
sdong	5585e00279	Update release note of 3.4 Summary: N/A Test Plan: N/A Reviewers: yhchiang, ljin Reviewed By: ljin Subscribers: dhruba, leveldb, igor Differential Revision: https://reviews.facebook.net/D22053	2014-08-18 16:37:03 -07:00
Spencer Kimball	bcefedeb6d	Update HISTORY.md	2014-08-13 09:53:32 -04:00
Igor Canadi	e4c3673923	Never CompactRange to level 0 in level compaction Summary: I was bit by this when developing SpatialDB. In case all files are at level 0, CompactRange() will output the compacted files to level 0. This is not ideal, since read amp. is much better at level 1 and higher. Test Plan: Compacted data in SpatialDB, read manifest using ldb, verified that files are now at level 1 instead of 0. Reviewers: sdong, ljin, yhchiang, dhruba Reviewed By: dhruba Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D20901	2014-08-01 06:41:48 -07:00
sdong	f04356e660	Add DB::GetIntProperty() to return integer properties to be returned as integers Summary: We have quite some properties that are integers and we are adding more. Add a function to directly return them as an integer, instead of a string Test Plan: Add several unit test checks Reviewers: yhchiang, igor, dhruba, haobo, ljin Reviewed By: ljin Subscribers: yoshinorim, leveldb Differential Revision: https://reviews.facebook.net/D20637	2014-07-28 16:55:57 -07:00
sdong	f6784766db	Add DB property estimated number of keys Summary: Add a DB property of estimated number of live keys, by adding number of entries of all mem tables and all files, subtracted by all deletions in all files. Test Plan: Add the case in unit tests Reviewers: hobbymanyp, ljin Reviewed By: ljin Subscribers: MarkCallaghan, yoshinorim, leveldb, igor, dhruba Differential Revision: https://reviews.facebook.net/D20631	2014-07-28 15:27:58 -07:00

1 2 3 4 5

210 Commits