rocksdb

Author	SHA1	Message	Date
Pratik Dhandharia	1b4c104a67	replace some reinterpret_cast with static_cast_with_check (#5740 ) Summary: This PR focuses on replacing some of the reinterpret_cast<DBImpl*> to static_cast_with_check<DBImpl, DB>. Files impacted: ./db/db_impl/db_impl_compaction_flush.cc ./db/write_batch.cc ./utilities/blob_db/blob_db_impl.cc ./utilities/transactions/pessimistic_transaction_db.cc ./utilities/transactions/transaction_base.cc ./utilities/transactions/write_prepared_txn_db.cc ./utilities/transactions/write_unprepared_txn_db.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/5740 Differential Revision: D17055691 Pulled By: pdhandharia fbshipit-source-id: 0f8034d1b32eade56e37d59c04b7bf236a81d8e8	2019-08-27 10:59:11 -07:00
Zhongyi Xie	2f41ecfe75	Refactor trimming logic for immutable memtables (#5022 ) Summary: MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory. We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one. The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming. In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022 Differential Revision: D14394062 Pulled By: miasantreble fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5	2019-08-23 13:55:34 -07:00
sdong	e0515607bc	Blacklist TransactionTest.GetWithoutSnapshot from valgrind_test (#5715 ) Summary: In valgrind_test, TransactionTest.GetWithoutSnapshot ran 2 hours and still didn't finish. Black list from valgrind_test to prevent timeout. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5715 Test Plan: run "make valgrind_test" and see whether the test is still generated. Differential Revision: D16866009 fbshipit-source-id: 92c78049b0bc1c2b9a0dfc1b7c8a9206b36f02f0	2019-08-16 15:36:49 -07:00
Manuel Ung	7785f61132	WriteUnPrepared: Fix bug in savepoints (#5703 ) Summary: Fix a bug in write unprepared savepoints. When flushing the write batch according to savepoint boundaries, we were forgetting to flush the last write batch after the last savepoint, meaning that some data was not written to DB. Also, add a small optimization where we avoid flushing empty batches. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5703 Differential Revision: D16811996 Pulled By: lth fbshipit-source-id: 600c7e0e520ad7a8fad32d77e11d932453e68e3f	2019-08-14 16:15:46 -07:00
Manuel Ung	4c70cb7306	WriteUnPrepared: support iterating while writing to transaction (#5699 ) Summary: In MyRocks, there are cases where we write while iterating through keys. This currently breaks WBWIIterator, because if a write batch flushes during iteration, the delta iterator would point to invalid memory. For now, fix by disallowing flush if there are active iterators. In the future, we will loop through all the iterators on a transaction, and refresh the iterators when a write batch is flushed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5699 Differential Revision: D16794157 Pulled By: lth fbshipit-source-id: 5d5bf70688bd68fe58e8a766475ae88fd1be3190	2019-08-14 14:28:53 -07:00
Zhongyi Xie	90cd6c2bb1	Fix double deletion in transaction_test (#5700 ) Summary: Fix the following clang analyze failures: ``` In file included from utilities/transactions/transaction_test.cc:8: ./utilities/transactions/transaction_test.h:174:14: warning: Attempt to delete released memory delete root_db; ^ ``` The destructor of StackableDB already deletes the root db and there is no need to delete the db separately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5700 Test Plan: USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j24 analyze Differential Revision: D16800579 Pulled By: maysamyabandeh fbshipit-source-id: 64c2d70f23e07e6a15242add97c744902ea33be5	2019-08-13 21:54:55 -07:00
Manuel Ung	8a678a50ba	WriteUnPrepared: Relax restriction on iterators and writes with no snapshot (#5697 ) Summary: Currently, if a write is done without a snapshot, then `largest_validated_seq_` is set to `kMaxSequenceNumber`. This is too aggressive, because an iterator with a snapshot created after this write should be valid. Set `largest_validated_seq_` to `GetLastPublishedSequence` instead. The variable means that no keys in the current tracked key set has changed by other transactions since `largest_validated_seq_`. Also, do some extra cleanup in Clear() for safety. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5697 Differential Revision: D16788613 Pulled By: lth fbshipit-source-id: f2aa40b8b12e0c0cf9e38c940fecc8f1cc0d2385	2019-08-13 13:11:51 -07:00
Maysam Yabandeh	64855979ae	WriteUnPrepared: Pass snap_released to the callback (#5691 ) Summary: With changes made in https://github.com/facebook/rocksdb/pull/5664 we meant to pass snap_released parameter of ::IsInSnapshot from the read callbacks. Although the variable was defined, passing it to the callback in WritePreparedTxnReadCallback was missing, which is fixed in this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5691 Differential Revision: D16767310 Pulled By: maysamyabandeh fbshipit-source-id: 3bf53f5964a2756a66ceef7c8f6b3ac75f102f48	2019-08-12 12:20:46 -07:00
Manuel Ung	6f0f82de87	WriteUnPrepared: increase test coverage in transaction_test (#5658 ) Summary: The changes transaction_test to set `txn_db_options.default_write_batch_flush_threshold = 1` in order to give better test coverage for WriteUnprepared. As part of the change, some tests had to be updated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5658 Differential Revision: D16740468 Pulled By: lth fbshipit-source-id: 3821eec20baf13917c8c1fab444332f75a509de9	2019-08-12 12:16:04 -07:00
Maysam Yabandeh	12eaacb71d	WritePrepared: Fix SmallestUnCommittedSeq bug (#5683 ) Summary: SmallestUnCommittedSeq reads two data structures, prepared_txns_ and delayed_prepared_. These two are updated in CheckPreparedAgainstMax when max_evicted_seq_ advances some prepared entires. To avoid the cost of acquiring a mutex, the read from them in SmallestUnCommittedSeq is not atomic. This creates a potential race condition. The fix is to read the two data structures in the reverse order of their update. CheckPreparedAgainstMax copies the prepared entry to delayed_prepared_ before removing it from prepared_txns_ and SmallestUnCommittedSeq looks into prepared_txns_ before reading delayed_prepared_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5683 Differential Revision: D16744699 Pulled By: maysamyabandeh fbshipit-source-id: b1bdb134018beb0b9de58827f512662bea35cad0	2019-08-09 16:40:00 -07:00
Vijay Nadimpalli	d150e01474	New API to get all merge operands for a Key (#5604 ) Summary: This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases: 1. Update subset of columns and read subset of columns - Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU. 2. Updating very few attributes in a value which is a JSON-like document - Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge. ---------------------------------------------------------------------------------------------------- API : Status GetMergeOperands( const ReadOptions& options, ColumnFamilyHandle* column_family, const Slice& key, PinnableSlice* merge_operands, GetMergeOperandsOptions* get_merge_operands_options, int* number_of_operands) Example usage : int size = 100; int number_of_operands = 0; std::vector<PinnableSlice> values(size); GetMergeOperandsOptions merge_operands_info; db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands); Description : Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion. merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604 Test Plan: Added unit test and perf test in db_bench that can be run using the command: ./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist Differential Revision: D16657366 Pulled By: vjnadimpalli fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf	2019-08-06 14:26:44 -07:00
Maysam Yabandeh	208556ee13	WritePrepared: fix Get without snapshot (#5664 ) Summary: if read_options.snapshot is not set, ::Get will take the last sequence number after taking a super-version and uses that as the sequence number. Theoretically max_eviceted_seq_ could advance this sequence number. This could lead ::IsInSnapshot that will be invoked by the ReadCallback to notice the absence of the snapshot. In this case, the ReadCallback should have passed a non-value to snap_released so that it could be set by the ::IsInSnapshot. The patch does that, and adds a unit test to verify it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5664 Differential Revision: D16614033 Pulled By: maysamyabandeh fbshipit-source-id: 06fb3fd4aacd75806ed1a1acec7961f5d02486f2	2019-08-05 13:41:21 -07:00
Maysam Yabandeh	e579e32eaa	Disable ReadYourOwnWriteStress when run under Valgrind (#5671 ) Summary: It sometimes times out when run under valgrind taking around 20m. The patch skips the test under Valgrind. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5671 Differential Revision: D16652382 Pulled By: maysamyabandeh fbshipit-source-id: 0f6f4f76d37337d56226b689e01b14523dd07aae	2019-08-05 13:35:39 -07:00
Manuel Ung	f622ca2c7c	WriteUnPrepared: savepoint support (#5627 ) Summary: Add savepoint support when the current transaction has flushed unprepared batches. Rolling back to savepoint is similar to rolling back a transaction. It requires the set of keys that have changed since the savepoint, re-reading the keys at the snapshot at that savepoint, and the restoring the old keys by writing out another unprepared batch. For this strategy to work though, we must be capable of reading keys at a savepoint. This does not work if keys were written out using the same sequence number before and after a savepoint. Therefore, when we flush out unprepared batches, we must split the batch by savepoint if any savepoints exist. eg. If we have the following: ``` Put(A) Put(B) Put(C) SetSavePoint() Put(D) Put(E) SetSavePoint() Put(F) ``` Then we will write out 3 separate unprepared batches: ``` Put(A) 1 Put(B) 1 Put(C) 1 Put(D) 2 Put(E) 2 Put(F) 3 ``` This is so that when we rollback to eg. the first savepoint, we can just read keys at snapshot_seq = 1. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5627 Differential Revision: D16584130 Pulled By: lth fbshipit-source-id: 6d100dd548fb20c4b76661bd0f8a2647e64477fa	2019-07-31 13:39:39 -07:00
Manuel Ung	d599135a03	WriteUnPrepared: use WriteUnpreparedTxnReadCallback for ValidateSnapshot (#5657 ) Summary: In DeferSnapshotSavePointTest, writes were failing with snapshot validation error because the key with the latest sequence number was an unprepared key from the current transaction. Fix this by passing down the correct read callback. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5657 Differential Revision: D16582466 Pulled By: lth fbshipit-source-id: 11645dac0e7c1374d917ef5fdf757d13c1d1108d	2019-07-31 10:44:56 -07:00
Manuel Ung	399f477818	WriteUnPrepared: Use WriteUnpreparedTxnReadCallback for MultiGet (#5634 ) Summary: The `TransactionTest.MultiGetBatchedTest` were failing with unprepared batches because we were not using the correct callbacks. Override MultiGet to pass down the correct ReadCallback. A similar problem is also fixed in WritePrepared. This PR also fixes an issue similar to (https://github.com/facebook/rocksdb/pull/5147), but for MultiGet instead of Get. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5634 Differential Revision: D16552674 Pulled By: lth fbshipit-source-id: 736eaf8e919c6b13d5f5655b1c0d36b57ad04804	2019-07-29 17:56:13 -07:00
Manuel Ung	80d7067cb2	Use int64_t instead of ssize_t (#5638 ) Summary: The ssize_t type was introduced in https://github.com/facebook/rocksdb/pull/5633, but it seems like it's a POSIX specific type. I just need a signed type to represent number of bytes, so use int64_t instead. It seems like we have a typedef from SSIZE_T for Windows, but it doesn't seem like we ever include "port/port.h" in our public header files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5638 Differential Revision: D16526269 Pulled By: lth fbshipit-source-id: 8d3a5c41003951b74b29bc5f1d949b2b22da0cee	2019-07-26 16:36:49 -07:00
Manuel Ung	41df734830	WriteUnPrepared: Add new variable write_batch_flush_threshold (#5633 ) Summary: Instead of reusing `TransactionOptions::max_write_batch_size` for determining when to flush a write batch for write unprepared, add a new variable called `write_batch_flush_threshold` for this use case instead. Also add `TransactionDBOptions::default_write_batch_flush_threshold` which sets the default value if `TransactionOptions::write_batch_flush_threshold` is unspecified. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5633 Differential Revision: D16520364 Pulled By: lth fbshipit-source-id: d75ae5a2141ce7708982d5069dc3f0b58d250e8c	2019-07-26 12:56:26 -07:00
Manuel Ung	230b909da8	Fix PopSavePoint to merge info into the previous savepoint (#5628 ) Summary: Transaction::RollbackToSavePoint undos the modification made since the SavePoint beginning, and also unlocks the corresponding keys, which are tracked in the last SavePoint. Currently ::PopSavePoint simply discard these tracked keys, leaving them locked in the lock manager. This breaks a subsequent ::RollbackToSavePoint behavior as it loses track of such keys, and thus cannot unlock them. The patch fixes ::PopSavePoint by passing on the track key information to the previous SavePoint. Fixes https://github.com/facebook/rocksdb/issues/5618 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5628 Differential Revision: D16505325 Pulled By: lth fbshipit-source-id: 2bc3b30963ab4d36d996d1f66543c93abf358980	2019-07-26 11:39:30 -07:00
Manuel Ung	66b524a911	Simplify WriteUnpreparedTxnReadCallback and fix some comments (#5621 ) Summary: Simplify WriteUnpreparedTxnReadCallback so we just have one function `CalcMaxVisibleSeq`. Also, there's no need for the read callback to hold onto the transaction any more, so just hold the set of unprep_seqs, reducing about of indirection in `IsVisibleFullCheck`. Also, some comments about using transaction snapshot were out of date, so remove them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5621 Differential Revision: D16459883 Pulled By: lth fbshipit-source-id: cd581323fd18982e817d99af57b6eaba59e599bb	2019-07-24 10:25:26 -07:00
Manuel Ung	eae832740b	WriteUnPrepared: improve read your own write functionality (#5573 ) Summary: There are a number of fixes in this PR (with most bugs found via the added stress tests): 1. Re-enable reseek optimization. This was initially disabled to avoid infinite loops in https://github.com/facebook/rocksdb/pull/3955 but this can be resolved by remembering not to reseek after a reseek has already been done. This problem only affects forward iteration in `DBIter::FindNextUserEntryInternal`, as we already disable reseeking in `DBIter::FindValueForCurrentKeyUsingSeek`. 2. Verify that ReadOption.snapshot can be safely used for iterator creation. Some snapshots would not give correct results because snaphsot validation would not be enforced, breaking some assumptions in Prev() iteration. 3. In the non-snapshot Get() case, reads done at `LastPublishedSequence` may not be enough, because unprepared sequence numbers are not published. Use `std::max(published_seq, max_visible_seq)` to do lookups instead. 4. Add stress test to test reading own writes. 5. Minor bug in the allow_concurrent_memtable_write case where we forgot to pass in batch_per_txn_. 6. Minor performance optimization in `CalcMaxUnpreparedSequenceNumber` by assigning by reference instead of value. 7. Add some more comments everywhere. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5573 Differential Revision: D16276089 Pulled By: lth fbshipit-source-id: 18029c944eb427a90a87dee76ac1b23f37ec1ccb	2019-07-23 08:08:19 -07:00
Manuel Ung	0acaa1a846	WriteUnPrepared: use tracked_keys_ to track keys needed for rollback (#5562 ) Summary: Currently, we are tracking keys we need to rollback via a separate structure specific to WriteUnprepared in write_set_keys_. We already have a data structure called tracked_keys_ used to track which keys to unlock on transaction termination. This is exactly what we want, since we should only rollback keys that we have locked anyway. Save some memory by reusing that data structure instead of making our own. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5562 Differential Revision: D16206484 Pulled By: lth fbshipit-source-id: 5894d2b824a4b19062d84adbd6e6e86f00047488	2019-07-16 15:24:56 -07:00
Maysam Yabandeh	60f3ec2ca5	Fix appveyor compliant about passing const to thread (#5447 ) Summary: CLANG would complain if we pass const to lambda function and appveyor complains if we don't (https://github.com/facebook/rocksdb/pull/5443). The patch fixes that by using the default capture mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5447 Differential Revision: D15788722 Pulled By: maysamyabandeh fbshipit-source-id: 47e7f49264afe31fdafe42cb8bf93da126abfca9	2019-06-12 15:06:22 -07:00
Maysam Yabandeh	4a285d0dd3	Remove passing const variable to thread (#5443 ) Summary: CLANG complains that passing const to thread is not necessary. The patch removes it form PreparedHeap::Concurrent test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5443 Differential Revision: D15781598 Pulled By: maysamyabandeh fbshipit-source-id: 3aceb05d96182fa4726d6d37eed45fd3aac4c016	2019-06-12 09:45:57 -07:00
Maysam Yabandeh	773f914a40	WritePrepared: switch PreparedHeap from priority_queue to deque (#5436 ) Summary: Internally PreparedHeap is currently using a priority_queue. The rationale was the in the initial design PreparedHeap::AddPrepared could be called in arbitrary order. With the recent optimizations, we call ::AddPrepared only from the main write queue, which results into in-order insertion into PreparedHeap. The patch thus replaces the underlying priority_queue with a more efficient deque implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5436 Differential Revision: D15752147 Pulled By: maysamyabandeh fbshipit-source-id: e6960f2b2097e13137dded1ceeff3b10b03b0aeb	2019-06-11 19:55:14 -07:00
Manuel Ung	ca1aee2a19	WriteUnprepared: commit only from the 2nd queue (#5439 ) Summary: This is a port of this PR into WriteUnprepared: https://github.com/facebook/rocksdb/pull/5014 This also reverts this test change to restore some flaky write unprepared tests: https://github.com/facebook/rocksdb/pull/5315 Tested with: $ gtest-parallel ./transaction_test --gtest_filter=MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/9 --repeat=128 [128/128] MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/9 (18250 ms) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5439 Differential Revision: D15761405 Pulled By: lth fbshipit-source-id: ae2581fd942d8a5b3f9278fd6bc3c1ac0b2c964c	2019-06-11 18:01:39 -07:00
sdong	58c4aee42e	TransactionUtil::CheckKey() to skip unnecessary history (#4941 ) Summary: If a memtable definitely covers a key, there isn't a need to check older memtables. We can skip them by checking the earliest sequence number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4941 Differential Revision: D13932666 fbshipit-source-id: b9d52f234b8ad9dd3bf6547645cd457175a3ca9b	2019-06-11 11:46:42 -07:00
Maysam Yabandeh	c292dc8540	WritePrepared: reduce prepared_mutex_ overhead (#5420 ) Summary: The patch reduces the contention over prepared_mutex_ using these techniques: 1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues. 2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop() 3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420 Differential Revision: D15741985 Pulled By: maysamyabandeh fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735	2019-06-10 11:53:31 -07:00
Zhongyi Xie	d68f9f4580	simplify include directive involving inttypes (#5402 ) Summary: When using `PRIu64` type of printf specifier, current code base does the following: ``` #ifndef __STDC_FORMAT_MACROS #define __STDC_FORMAT_MACROS #endif #include <inttypes.h> ``` However, this can be simplified to ``` #include <cinttypes> ``` as long as flag `-std=c++11` is used. This should solve issues like https://github.com/facebook/rocksdb/issues/5159 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402 Differential Revision: D15701195 Pulled By: miasantreble fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03	2019-06-06 13:56:07 -07:00
Maysam Yabandeh	ae05a83e19	Call ValidateOptions from SetOptions (#5368 ) Summary: Currently we validate options in DB::Open. However the validation step is missing when options are dynamically updated in ::SetOptions. The patch fixes that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5368 Differential Revision: D15540101 Pulled By: maysamyabandeh fbshipit-source-id: d27bbffd8f0252d1b50bcf59e0a70a278ed937f4	2019-06-03 19:49:57 -07:00
Siying Dong	000b9ec217	Move some logging related files to logging/ (#5387 ) Summary: Many logging related source files are under util/. It will be more structured if they are together. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387 Differential Revision: D15579036 Pulled By: siying fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f	2019-05-31 17:23:59 -07:00
Vijay Nadimpalli	cae22c53fb	Make format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5395 Differential Revision: D15581698 Pulled By: vjnadimpalli fbshipit-source-id: f415972f16e784b1361714c202b97defcab46767	2019-05-31 15:24:43 -07:00
Vijay Nadimpalli	49c5a12dbe	Organizing rocksdb/db directory Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390 Differential Revision: D15579388 Pulled By: vjnadimpalli fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366	2019-05-31 11:57:01 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Siying Dong	e9e0101ca4	Move test related files under util/ to test_util/ (#5377 ) Summary: There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo ve them to a new directory test_util/, so that util/ is cleaner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377 Differential Revision: D15551366 Pulled By: siying fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c	2019-05-30 11:25:51 -07:00
Maysam Yabandeh	eab4f49a2c	WritePrepared: skip_concurrency_control option (#5330 ) Summary: This enables the user to set TransactionDBOptions::skip_concurrency_control so the standard `DB::Write(const WriteOptions& opts, WriteBatch* updates)` would skip the concurrency control. This would give higher throughput to the users who know their use case doesn't need concurrency control. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5330 Differential Revision: D15525932 Pulled By: maysamyabandeh fbshipit-source-id: 68421ac1ba34f549a4a8de9ce4c2dccf6fb4b06b	2019-05-28 16:29:45 -07:00
Maysam Yabandeh	f5576c3317	WritePrepared: disableWAL in commit without prepare (#5327 ) Summary: When committing a transaction without prepare, WritePrepared simply writes the batch to db and add the commit entry to CommitCache. When two_write_queues=true, following the rule of committing only from 2nd write queue, the first write, writes the batch and the only thing the 2nd write does is to write the commit entry to CommitCache. Currently the write batch in 2nd write is set to an empty LogData entry, while the write to the WAL could simply be entirely disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5327 Differential Revision: D15424546 Pulled By: maysamyabandeh fbshipit-source-id: 3d9ea3922d5196984c584d62a3ed57e1f7ca7b9f	2019-05-28 14:21:52 -07:00
Maysam Yabandeh	5c0e304170	WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313 ) Summary: WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes. The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313 Differential Revision: D15379977 Pulled By: maysamyabandeh fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00	2019-05-20 07:49:20 -07:00
Maysam Yabandeh	c71f5bb9aa	Disable WriteUnPrepared stress tests (#5315 ) Summary: They are kind of flaky at the moment. Will re-enable it when flakiness is fixed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5315 Differential Revision: D15382744 Pulled By: maysamyabandeh fbshipit-source-id: 8b2f9d81a4bb34bfd51481727a682d5cd063c5e3	2019-05-16 15:39:33 -07:00
Maysam Yabandeh	f0e8216197	WritePrepared: Fix deadlock in WriteRecoverableState (#5306 ) Summary: The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306 Differential Revision: D15341458 Pulled By: maysamyabandeh fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323	2019-05-15 13:53:54 -07:00
Thomas Fersch	a42757607d	Use pre-increment instead of post-increment for iterators (#5296 ) Summary: Google C++ style guide indicates pre-increment should be used for iterators: https://google.github.io/styleguide/cppguide.html#Preincrement_and_Predecrement. Replaced all instances of ' it++' by ' ++it' (where type is iterator). So this covers the cases where iterators are named 'it'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5296 Differential Revision: D15301256 Pulled By: tfersch fbshipit-source-id: 2803483c1392504ad3b281d21db615429c71114b	2019-05-15 13:19:15 -07:00
Maysam Yabandeh	f383641a1d	Unordered Writes (#5218 ) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759	2019-05-13 17:47:21 -07:00
anand76	1c8cbf315f	Extend MultiGet batching to Transactions (#5210 ) Summary: MultiGet batching was implemented in #5011 in order to reduce CPU utilization when looking up multiple keys at once. This PR implements corresponding ```MultiGet``` and ```MultiGetSingleCFForUpdate``` in ```rocksdb::Transaction``` that call the underlying batching implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5210 Differential Revision: D15048164 Pulled By: anand1976 fbshipit-source-id: c52f6043102ab0cbc723f4cba2a7b7d1767f6f52	2019-04-23 14:11:26 -07:00
jsteemann	de76909464	refactor SavePoints (#5192 ) Summary: Savepoints are assumed to be used in a stack-wise fashion (only the top element should be used), so they were stored by `WriteBatch` in a member variable `save_points` using an std::stack. Conceptually this is fine, but the implementation had a few issues: - the `save_points_` instance variable was a plain pointer to a heap- allocated `SavePoints` struct. The destructor of `WriteBatch` simply deletes this pointer. However, the copy constructor of WriteBatch just copied that pointer, meaning that copying a WriteBatch with active savepoints will very likely have crashed before. Now a proper copy of the savepoints is made in the copy constructor, and not just a copy of the pointer - `save_points_` was an std::stack, which defaults to `std::deque` for the underlying container. A deque is a bit over the top here, as we only need access to the most recent savepoint (i.e. stack.top()) but never any elements at the front. std::deque is rather expensive to initialize in common environments. For example, the STL implementation shipped with GNU g++ will perform a heap allocation of more than 500 bytes to create an empty deque object. Although the `save_points_` container is created lazily by RocksDB, moving from a deque to a plain `std::vector` is much more memory-efficient. So `save_points_` is now a vector. - `save_points_` was changed from a plain pointer to an `std::unique_ptr`, making ownership more explicit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192 Differential Revision: D15024074 Pulled By: maysamyabandeh fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7	2019-04-19 20:33:04 -07:00
jsteemann	8295d364e2	Improve transaction lock details (#5193 ) Summary: This branch contains two small improvements: * Create `LockMap` entries using `std::make_shared`. This saves one heap allocation per LockMap entry but also locates the control block and the LockMap object closely together in memory, which can help with caching * Reorder the members of `TrackedTrxInfo`, so that the resulting struct uses less memory (at least on 64bit systems) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5193 Differential Revision: D14934536 Pulled By: maysamyabandeh fbshipit-source-id: f7b49812bb4b6029eef9d131e7cd56260df5b28e	2019-04-15 10:44:03 -07:00
Manuel Ung	d655a3aab7	Remove extraneous call to TrackKey (#5173 ) Summary: In `PessimisticTransaction::TryLock`, we were calling `TrackKey` even when assume_tracked=true, which defeats the purpose of assume_tracked. Remove this. For keys that are already tracked, TrackKey will actually bump some counters (num_reads/num_writes) which are consumed in `TransactionBaseImpl::GetTrackedKeysSinceSavePoint`, and this is used to determine which keys were tracked since the last savepoint. I believe this functionality should still work, since I think the user should not call GetForUpdate/Put(assume_tracked=true) across savepoints, and if they do, they should not expect the Put(assume_tracked=true) to show up as a tracked key in the second savepoint. This is another 2-3% cpu improvement. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5173 Differential Revision: D14883809 Pulled By: lth fbshipit-source-id: 7d09f0772da422384af0519773e310c22b0cbca3	2019-04-12 16:37:12 -07:00
Maysam Yabandeh	fe642cbee6	WritePrepared: fix race condition in reading batch with duplicate keys (#5147 ) Summary: When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete. The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache. A unit test is added to reproduce the race condition with duplicate keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147 Differential Revision: D14758815 Pulled By: maysamyabandeh fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719	2019-04-12 14:40:41 -07:00
Manuel Ung	ef0fc1b461	Reduce copies of LockInfo (#5172 ) Summary: The LockInfo struct is not easy to copy because it contains std::vector. Reduce copies by using move constructor and `unordered_map::emplace`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5172 Differential Revision: D14882053 Pulled By: lth fbshipit-source-id: 93999ec6ab1a5841fb5115abb764b6c1831a6de1	2019-04-10 15:58:58 -07:00
Siying Dong	0bb555630f	Consolidate hash function used for non-persistent data in a new function (#5155 ) Summary: Create new function NPHash64() and GetSliceNPHash64(), which are currently implemented using murmurhash. Replace the current direct call of murmurhash() to use the new functions if the hash results are not used in on-disk format. This will make it easier to try out or switch to alternative functions in the uses where data format compatibility doesn't need to be considered. This part shouldn't have any performance impact. Also, the sharded cache hash function is changed to the new format, because it falls into this categoery. It doesn't show visible performance impact in db_bench results. CPU showed by perf is increased from about 0.2% to 0.4% in an extreme benchmark setting (4KB blocks, no-compression, everything cached in block cache). We've known that the current hash function used, our own Hash() has serious hash quality problem. It can generate a lots of conflicts with similar input. In this use case, it means extra lock contention for reads from the same file. This slight CPU regression is worthy to me to counter the potential bad performance with hot keys. And hopefully this will get further improved in the future with a better hash function. cache_test's condition is relaxed a little bit to. The new hash is slightly more skewed in this use case, but I manually checked the data and see the hash results are still in a reasonable range. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5155 Differential Revision: D14834821 Pulled By: siying fbshipit-source-id: ec9a2c0a2f8ae4b54d08b13a5c2e9cc97aa80cb5	2019-04-08 13:32:06 -07:00
Maysam Yabandeh	7441a0ecba	WriteUnPrepared: fix ubsan complaint (#5148 ) Summary: Ubsna complains that in initialization of WriteUnpreparedTxnReadCallback the method of the child class is used before the parent class is constructed. The patch fixes that by making the aforementioned method static. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5148 Differential Revision: D14760098 Pulled By: maysamyabandeh fbshipit-source-id: cf19b7c1fdb5de0a54e62c1deebe09a0fa048ded	2019-04-03 15:51:30 -07:00
Maysam Yabandeh	5234fc1b70	Mark logs with prepare in PreReleaseCallback (#5121 ) Summary: In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121 Differential Revision: D14665210 Pulled By: maysamyabandeh fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534	2019-04-02 15:17:47 -07:00
Maysam Yabandeh	14b3f683a1	WriteUnPrepared: less virtual in iterator callback (#5049 ) Summary: WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads. The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call. Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots. The following benchmark shows +0.26% higher throughput in seekrandom benchmark. Benchmark: ./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench ./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100 seekrandom [AVG 10 runs] : 20355 ops/sec; 225.2 MB/sec seekrandom [MEDIAN 10 runs] : 20425 ops/sec; 225.9 MB/sec ./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100 seekrandom [AVG 10 runs] : 20409 ops/sec; 225.8 MB/sec seekrandom [MEDIAN 10 runs] : 20487 ops/sec; 226.6 MB/sec Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049 Differential Revision: D14366459 Pulled By: maysamyabandeh fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef	2019-04-02 14:47:16 -07:00
Maysam Yabandeh	a703f16da9	WriteUnPrepared: Enable auto-compaction after max_evicted_seq_ init (#5128 ) Summary: Compaction would depend on max_evicted_seq_ value. The ::Initialize method should do that after max_evicted_seq_ is properly initialized. The patch also back ports #4853 from WritePrepared txn to WriteUnPrepared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5128 Differential Revision: D14686562 Pulled By: maysamyabandeh fbshipit-source-id: b2355025712a72676ac3b20a95258adcf4774490	2019-03-29 13:18:57 -07:00
Maysam Yabandeh	04d3ac4e63	Fix tsan compliant on AddPreparedBeforeMax (#5052 ) Summary: Add a mutex to the test to synchronize before accessing the shared txn object. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5052 Differential Revision: D14386861 Pulled By: maysamyabandeh fbshipit-source-id: 5b32e209840b210c35af53848dc77f489a76c95a	2019-03-08 09:39:00 -08:00
Maysam Yabandeh	04a2631dbe	WritePrepared: handle adding prepare before max_evicted_seq_ (#5025 ) Summary: The patch fixes an improbable race condition between AddPrepared from one write queue and AdvanceMaxEvictedSeq from another queue. In this scenario AddPrepared finds prepare_seq lower than max and adding to PrepareHeap as usual while AdvanceMaxEvictedSeq has finished checking PrepareHeap against the future max. Thus when AdvanceMaxEvictedSeq finishes off by updating the max_evicted_seq_, PrepareHeap ends up with a prepared_seq lower than it which breaks the PrepareHeap contract. The fix is that in AddPrepared we check against the future_max_evicted_seq_ instead, which is update before AdvanceMaxEvictedSeq acquire prepare_mutex_ and looks into PrepareHeap. A unit test added to test for the failure scenario. The code is also refactored a bit to remove the duplicate code between AdvanceMaxEvictedSeq and AddPrepared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5025 Differential Revision: D14249028 Pulled By: maysamyabandeh fbshipit-source-id: 072ea56663f40359662c05fafa6ac524417b0622	2019-03-07 07:41:15 -08:00
Maysam Yabandeh	703f1375c2	WritePrepared: Add rollback batch to PreparedHeap (#5026 ) Summary: The patch adds the sequence number of the rollback patch to the PrepareHeap when two_write_queues is enabled. Although the current behavior is still correct, the change simplifies reasoning about the code, by having all uncommitted batches registered with the PreparedHeap. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5026 Differential Revision: D14249401 Pulled By: maysamyabandeh fbshipit-source-id: 1e3424edee5cd14e56ee35931ad3c93ed997cd5a	2019-03-07 07:33:31 -08:00
Maysam Yabandeh	68a2f94d5d	WritePrepared: commit only from the 2nd queue (#5014 ) Summary: When two_write_queues is enabled we call ::AddPrepared only from the main queue, which writes to both WAL and memtable, and call ::AddCommitted from the 2nd queue, which writes only to WAL. This simplifies the logic by avoiding concurrency between AddPrepared and also between AddCommitted. The patch fixes one case that did not conform with the rule above. This would allow future refactoring. For example AdvaneMaxEvictedSeq, which is invoked by AddCommitted, can be simplified by assuming lack of concurrent calls to it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5014 Differential Revision: D14210493 Pulled By: maysamyabandeh fbshipit-source-id: 6db5ba372a294a568a14caa010576460917a4eab	2019-02-28 15:23:34 -08:00
Maysam Yabandeh	a661c0d208	WritePrepared: optimize read path by avoiding virtual (#5018 ) Summary: The read path includes a callback function, ReadCallback, which would eventually calls IsInSnapshot to figure if a particular seq is in the reading snapshot or not. This callback is virtual, which adds the cost of multiple virtual function call to each read. The first few checks in IsInSnapshot, however, are quite trivial and take care of majority of the cases. The patch moves those to a non-virtual function in the the parent class, ReadCallback, to lower the virtual callback cost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5018 Differential Revision: D14226562 Pulled By: maysamyabandeh fbshipit-source-id: 6feed5b34f3b082e52092c5ef143e29b49c46b44	2019-02-26 16:56:19 -08:00
Maysam Yabandeh	cf98df34c1	Change random seed for txn stress tests on each run (#5004 ) Summary: Currently the transaction stress tests use thread id as the seed. Since the thread ids are likely to be the same across multiple runs, the seed is thus going to be the same. The patch includes time in calculating the seed to help covering a very different part of state space in each run of the stress tests. To be able to reproduce the bug in case the stress tests failed, it also prints out the time that was used to calculate the seed value. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5004 Differential Revision: D14144356 Pulled By: maysamyabandeh fbshipit-source-id: 728ed522f550fc8b4f5f9f373259c05fe9a54556	2019-02-19 19:58:55 -08:00
Maysam Yabandeh	0f4244fe00	WritePrepared: Improve stress tests with slow threads (#4974 ) Summary: The transaction stress tests, stress a high concurrency scenario. In WritePrepared/WriteUnPrepared we need to also stress the scenarios where an inserting/reading transaction is very slow. This would stress the corner cases that the caching is not sufficient and other slower data structures are engaged. To emulate such cases we make use of slow inserter/verifier threads and also reduce the size of cache data structures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4974 Differential Revision: D14143070 Pulled By: maysamyabandeh fbshipit-source-id: 81eb674678faf9fae0f654cd60ebcc74e26aeee7	2019-02-19 16:56:49 -08:00
Maysam Yabandeh	bcdc8c8b19	WritePrepared: max_evicted_seq_ update during commit cache lookup (#4955 ) Summary: max_evicted_seq_ could be updated in the middle of the read in ::IsInSnapshot. The code to be correct in presence of this update would be complicated. The patch simplifies it by checking the value of max_evicted_seq_ before and after looking into commit_cache_ and retries in the unlucky case that it was changed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4955 Differential Revision: D13999556 Pulled By: maysamyabandeh fbshipit-source-id: 7a1bdfa95ea8b5d8d73ddff3263ed31d7297b39c	2019-02-19 16:14:08 -08:00
Michael Liu	ca89ac2ba9	Apply modernize-use-override (2nd iteration) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. Reviewed By: Orvid Differential Revision: D14090024 fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a	2019-02-14 14:41:36 -08:00
Maysam Yabandeh	d6b9b3b884	Enhance transaction_test_util with delays (#4970 ) Summary: Enhance ::Insert and ::Verify test functions to add artificial delay between prepare and commit, and take snapshot and reads respectively. A future PR will make use of these to improve stress tests to test against long-running transactions as well as long-running backup jobs. Also randomly sets set_snapshot to false for inserters to skip setting the snapshot in the initialization phase and let the snapshot be taken later explicitly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4970 Differential Revision: D14031342 Pulled By: maysamyabandeh fbshipit-source-id: b52b453751f0b25b81b23c48892bc1d152464cab	2019-02-11 16:02:37 -08:00
Maysam Yabandeh	9144d1f186	WritePrepared: add private options to TransactionDBOptions (#4966 ) Summary: WritePreparedTransactionDB operates with more options which should not be configurable to avoid complicating it for the users. For testing purposes however we need to change the default value of this parameters. This patch makes these parameters private fields in TransactionDBOptions so that the existing ::Open API could use them seamlessly without however exposing them to the users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4966 Differential Revision: D14015986 Pulled By: maysamyabandeh fbshipit-source-id: 13037efa7dfdd6f73ec7a19414b66571e044c633	2019-02-11 14:44:02 -08:00
Maysam Yabandeh	10d14693ac	WritePrepared: fix ValidateSnapshot with long-running txn (#4961 ) Summary: ValidateSnapshot checks if another txn has committed a value to about-to-be-locked key since a particular snapshot. It applies an optimization of looking into only the memtable if snapshot seq is larger than the earliest seq in the memtables. With a long-running txn in WritePrepared, the prepared value might be flushed out to the disk and yet it commits after the snapshot, which breaks this optimization. The patch fixes that by disabling this optimization when the min_uncomitted seq at the time the snapshot was taken is lower than earliest seq in the memtables. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4961 Differential Revision: D14009947 Pulled By: maysamyabandeh fbshipit-source-id: 1d11679950326f7c4094b433e6b821b729f08850	2019-02-08 18:01:25 -08:00
Maysam Yabandeh	199fabc197	WritePrepared: non-atomic commit of delayed prepared (#4947 ) Summary: Commit of delayed prepared has two non-atomic steps: add to commit cache, remove from delayed_prepared_. Similarly in ::IsInSnapshot we read from commit cache first and then look into delayed_prepared_. Due to non-atomicity thus the reader might not find the prep_seq that is just committed neither in commit cache nor in delayed_prepared_. To fix that i) we check if there was any delayed prepared BEFORE looking into commit cache, ii) if there was, we complete the search steps to be these: i) commit cache, ii) delayed prepared, commit cache again. In this way if the first query to commit cache missed the commit, the 2nd will catch it. The cost of the redundant read from commit cache is paid only if delayed_prepared_ is nonempty which should be a very rare scenario. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4947 Differential Revision: D13952754 Pulled By: maysamyabandeh fbshipit-source-id: 8f47826b13f8ce154398d842028342423f4ca2b2	2019-02-06 08:48:06 -08:00
Maysam Yabandeh	dcb73e7735	WritePrepared: release snapshot equal to max (#4944 ) Summary: WritePrepared maintains a list of snapshots that are <= max_evicted_seq_. Based on this list, old_commit_map_ is updated if an evicted commit entry overlaps with such snapshot. Such lists are garbage collected when the release of snapshot is reported to WritePreparedTxnDB, which is the next time max_evicted_seq_ is updated and yet the snapshot is not found is the list returned from DB. This logic was broken since ReleaseSnapshotInternal was using "< max_evicted_seq_" to cleanup old_commit_map_, which would leave a snapshot uncleaned if it "= max_evicted_seq_". The patch fixes that and adds a unit test to check for the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4944 Differential Revision: D13945000 Pulled By: maysamyabandeh fbshipit-source-id: 0c904294f735911f52348a148bf1f945282fc17c	2019-02-04 12:57:23 -08:00
Alexander Zinoviev	32a6dd9a41	Add a new CPU time counter to compaction report (#4889 ) Summary: Measure CPU time consumed for a compaction and report it in the stats report Enable NowCPUNanos() to work for MacOS Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889 Differential Revision: D13701276 Pulled By: zinoale fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055	2019-01-29 17:24:00 -08:00
Yi Wu	b1ad6ebba8	WritePrepared: fix two versions in compaction see different status for released snapshots (#4890 ) Summary: Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios: Scenario 1: key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1. The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key. Scenario 2: key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here. The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890 Differential Revision: D13705538 Pulled By: maysamyabandeh fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4	2019-01-18 17:24:06 -08:00
Maysam Yabandeh	7fd9813b9f	WritePrepared: commit of delayed prepared entries (#4894 ) Summary: Here is the order of ops in a commit: 1) update commit cache 2) publish seq, 3) RemovePrepared. In case of a delayed prepared, there will be a gap between when the commit is visible to snapshots until delayed_prepared_ is cleaned up. To tell apart this case from a delayed uncommitted txn from, the commit entry of a delayed prepared is also stored in delayed_prepared_commits_, which is updated before publishing the commit. Also logic in GetSnapshotInternal that ensures that each new snapshot is always larger than max_evicted_seq_ is updated to check against the upcoming value of max_evicted_seq_ rather than its current one. This is because AdvanceMaxEvictedSeq gets the list of snapshots lower than the new max, before updating max_evicted_seq_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4894 Differential Revision: D13726988 Pulled By: maysamyabandeh fbshipit-source-id: 1e70d78061b50c944c9816bf4b6dac405ab4ccd3	2019-01-18 11:36:36 -08:00
tom wang	73ff15c07b	WritePrepared: fix typo in comments Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4891 Differential Revision: D13718016 Pulled By: miasantreble fbshipit-source-id: 90bd372cff453a1c2d104c1cf49731d5dd770c14	2019-01-17 12:36:36 -08:00
Yanqin Jin	dd9eca1c58	Remove unused variable to fix clang compilation err (#4893 ) Summary: as title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4893 Differential Revision: D13716733 Pulled By: riversand963 fbshipit-source-id: 6811d6a99fe2094d5344f854e8939f01238b2adb	2019-01-17 11:57:31 -08:00
Yi Wu	128f532858	WritePrepared: fix issue with snapshot released during compaction (#4858 ) Summary: Compaction iterator keep a copy of list of live snapshots at the beginning of compaction, and then query snapshot checker to verify if values of a sequence number is visible to these snapshots. However when the snapshot is released in the middle of compaction, the snapshot checker implementation (i.e. WritePreparedSnapshotChecker) may remove info with the snapshot and may report incorrect result, which lead to values being compacted out when it shouldn't. This patch conservatively keep the values if snapshot checker determines that the snapshots is released. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4858 Differential Revision: D13617146 Pulled By: maysamyabandeh fbshipit-source-id: cf18a94f6f61a94bcff73c280f117b224af5fbc3	2019-01-16 09:55:32 -08:00
Yi Wu	5d4fddfa52	WritePrepared: Fix visible key compacted out by compaction (#4883 ) Summary: With WritePrepared transaction, flush/compaction can contain uncommitted keys, and those keys can get committed during compaction. If a snapshot is taken before the key is committed, it should not see the key. On the other hand, compaction grab the list of snapshots at its beginning, and only consider those snapshots to dedup keys. Consider the case: ``` seq = 1: put "foo" = "bar" seq = 2: transaction T: delete "foo", prepare seq = 3: compaction start seq = 4: take snapshot S seq = 5: transaction T: commit. ... seq = N: compaction iterator reached key "foo". ``` When compaction start, the list of snapshot is empty. Compaction doesn't take snapshot S into account. When it reached "foo", transaction T is committed. Compaction may think the value "foo=bar" is not visible by any snapshot (which is wrong), and compact the value out. The fix is to explicitly take a snapshot before compaction grabbing the list of snapshots. Compaction will then has to keep keys visible to this snapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4883 Differential Revision: D13668775 Pulled By: maysamyabandeh fbshipit-source-id: 1cab9615f94b7d3e8522cc3d44c3a14c7d4720e4	2019-01-15 21:34:38 -08:00
Maysam Yabandeh	cad99a6031	WritePrepared: snapshot should be larger than max_evicted_seq_ (#4886 ) Summary: The AdvanceMaxEvictedSeq algorithm assumes that new snapshots always have sequence number larger than the last max_evicted_seq_. To enforce this assumption we make two changes: i) max is not advanced beyond the last published seq, with the exception that the evicted commit entry itself is not published yet, which is quite rare. ii) When obtaining the snapshot if the max_evicted_seq_ is not published yet, commit a dummy entry so that it waits for it to be published and also increased the latest published seq by one above the max. To test these non-realistic corner cases we create a commit cache with size 1 so that every single commit results into eviction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4886 Differential Revision: D13685270 Pulled By: maysamyabandeh fbshipit-source-id: 5461bc09c2a9b75798bfcb9853a256c81cdac0b0	2019-01-15 18:11:52 -08:00
Yi Wu	d50c10ed37	WritePrepared: Fix SmallestUnCommittedSeq() doesn't check delayed_prepared (#4867 ) Summary: When prepared_txns_ heap is empty, SmallestUnCommittedSeq() should check delayed_prepared_ set as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4867 Differential Revision: D13632134 Pulled By: maysamyabandeh fbshipit-source-id: b0423bb0a58dc95f1e636d5ed3f6e619df801fb7	2019-01-15 09:17:53 -08:00
Maysam Yabandeh	856ac24484	WritePrepared: fix race condition on GetSnapshotListFromDB (#4872 ) Summary: Fixes a typo that made mutex_ to remain unlocked when GetSnapshotListFromDB called from WritePreparedTxnDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4872 Differential Revision: D13640381 Pulled By: maysamyabandeh fbshipit-source-id: 50f6600568f9092b4b43115f6ebd96e6c7388ad7	2019-01-11 13:46:23 -08:00
Zhongyi Xie	6a4ec41fed	add assert to silence clang warning (#4871 ) Summary: currently clang analyze fails with the following warning: > utilities/transactions/write_prepared_transaction_test.cc:1451:5: warning: Forming reference to null pointer ASSERT_GT(wp_db->max_evicted_seq_, 0); // max after recovery Pull Request resolved: https://github.com/facebook/rocksdb/pull/4871 Differential Revision: D13638053 Pulled By: miasantreble fbshipit-source-id: b192b0c13c411c58defc9e280b34cdfcab3fa8e3	2019-01-11 12:17:34 -08:00
Maysam Yabandeh	f3a99e8a4d	WritePrepared: Report released snapshots in IsInSnapshot (#4856 ) Summary: Previously IsInSnapshot assumed that the snapshot is valid at the time that the function is called. However there are cases where that might not be valid. Example is background compactions where the compaction algorithm operates with a list of snapshots some of which might be released by the time they are being passed to IsInSnapshot. The patch make two changes to enable the caller to tell difference: i) any live snapshot below max is added to max_committed_seq_, which allows IsInSnapshot to confidently tell whether the passed snapshot is invalid if it below max, ii) extends IsInSnapshot API with a "released" variable that is set true when IsInSnapshot find no such snapshot below max and also find no other way to give a certain return value. In such cases the return value is true but the caller should also check the "released" boolean after the call. In short here is the changes in the API: i) If the snapshot is valid, no change is required. ii) If the snapshot might be invalid, a reference to "released" boolean must be passed to IsInSnapshot. ii-a) If snapshot is above max, IsInSnapshot can figure the return valid using the commit cache. ii-b) otherwise if snapshot is in old_commit_map_, IsInSnapshot can use that to tell if value was visible to the snapshot. ii-c) otherwise it sets "released" to true and returns true as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4856 Differential Revision: D13599847 Pulled By: maysamyabandeh fbshipit-source-id: 1752be28667f886a1efec8cae5714b9b7a8f1e0f	2019-01-08 14:47:29 -08:00
Maysam Yabandeh	cd227d74ba	WritePrepared: improve IsInSnapshotEmptyMapTest (#4853 ) Summary: IsInSnapshotEmptyMapTest tests that IsInSnapshot returns correct value for existing data after a recovery, where max is not zero and yet commit cache is empty. The existing test was preliminary which is improved in this patch. It also increases the db sequence after recovery so that there the snapshot immediately taken after recovery would have a sequence number different than that of max_evicted_seq. This simplifies the logic in IsInSnapshot by not having to consider the special case that an old snapshot might be equal to max_evicted_seq and yet not present in old_commit_map. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4853 Differential Revision: D13595223 Pulled By: maysamyabandeh fbshipit-source-id: 77c12ca8a3f61a47479a93bef2038ff502dc3322	2019-01-08 11:27:11 -08:00
Maysam Yabandeh	0ed98bf89e	WritePrepared: fix snapshot sequence in rollback (#4851 ) Summary: The rollback algorithm in WritePrepared transactions requires reading the values before the transaction start. Currently it uses the prepare_seq -1 as the snapshot sequence number for the read. This is not correct since the passed sequence number must be for a valid snapshot. The patch fixes it by passing kMaxSequenceNumber instead. This is fine since all the writes done by the aborted transaction will be skipped during the read anyway. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4851 Differential Revision: D13592773 Pulled By: maysamyabandeh fbshipit-source-id: ff1bf92ea9909d4cccb173bdff49febc0e9eb7a2	2019-01-07 14:57:03 -08:00
Faustin Lammler	7d65bd5ce4	Fix spelling errors (#4827 ) Summary: Hi, Lintian, the Debian package checker complains about spelling error (spelling-error-in-binary). See https://salsa.debian.org/mariadb-team/mariadb-10.3/-/jobs/98380 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4827 Differential Revision: D13566362 Pulled By: riversand963 fbshipit-source-id: cd4e9212133c73b0591030de6cdedaa47575968d	2019-01-02 11:17:57 -08:00
DorianZheng	2670fe8c73	Get `CompactionJobInfo` from CompactFiles Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4716 Differential Revision: D13207677 Pulled By: ajkr fbshipit-source-id: d0ccf5a66df6cbb07288b0c5ebad81fd9df3926b	2018-12-13 14:21:24 -08:00
Maysam Yabandeh	b878f93c70	Extend Transaction::GetForUpdate with do_validate (#4680 ) Summary: Transaction::GetForUpdate is extended with a do_validate parameter with default value of true. If false it skips validating the snapshot (if there is any) before doing the read. After the read it also returns the latest value (expects the ReadOptions::snapshot to be nullptr). This allows RocksDB applications to use GetForUpdate similarly to how InnoDB does. Similarly ::Merge, ::Put, ::Delete, and ::SingleDelete are extended with assume_exclusive_tracked with default value of false. It true it indicates that call is assumed to be after a ::GetForUpdate(do_validate=false). The Java APIs are accordingly updated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4680 Differential Revision: D13068508 Pulled By: maysamyabandeh fbshipit-source-id: f0b59db28f7f6a078b60844d902057140765e67d	2018-12-06 17:49:00 -08:00
Zhongyi Xie	2f1ca4e838	Revert "BaseDeltaIterator: always check valid() before accessing key(… (#4744 ) Summary: …) (#4702)" This reverts commit `3a18bb3e15`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4744 Differential Revision: D13311869 Pulled By: miasantreble fbshipit-source-id: 6300b12cc34828d8b9274e907a3aef1506d5d553	2018-12-03 23:38:27 -08:00
Zhongyi Xie	3a18bb3e15	BaseDeltaIterator: always check valid() before accessing key() (#4702 ) Summary: Current implementation of `current_over_upper_bound_` fails to take into consideration that keys might be invalid in either base iterator or delta iterator. Calling key() in such scenario will lead to assertion failure and runtime errors. This PR addresses the bug by adding check for valid keys before calling `IsOverUpperBound()`, also added test coverage for iterate_upper_bound usage in BaseDeltaIterator Also recommit https://github.com/facebook/rocksdb/pull/4656 (It was reverted earlier due to bugs) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4702 Differential Revision: D13146643 Pulled By: miasantreble fbshipit-source-id: 6d136929da12d0f2e2a5cea474a8038ec5cdf1d0	2018-11-30 15:35:13 -08:00
Maysam Yabandeh	f1b0841f06	WritePrepared: followup fix for snapshot double release issue (#4734 ) Summary: The fix in #4727 for double snapshot release was incomplete since it does not properly remove the duplicate entires in the snapshot list after finding that a snapshot is still valid. The patch does that and also improves the unit test to show the issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4734 Differential Revision: D13266260 Pulled By: maysamyabandeh fbshipit-source-id: 351e2c40cca45a87b757774c11af74182314911e	2018-11-29 21:01:57 -08:00
Maysam Yabandeh	1a5a93ff74	WritePrepared: Fix double snapshot release issue (#4727 ) Summary: Currently the garbage collection of items in old_commit_map_ was done upon ::ReleaseSnapshot. The assumption behind this method was that the sequence number of snapshots are unique, which is incorrect. In the very rare cases that two consecutive snapshot have the same sequence number this could lead the release of the first snapshot affect the old_commit_map_ that is necessary to service the reads of the second snapshot. The bug would be triggered only if i) two snapshot have the same seq, ii) both of them are very old (older than the last ~4m transactions), and iii) there is commit entry overlapping with the snapshot seq number. It is fixed by doing the cleanup of old_commit_map_ in UpdateSnapshot: the new list of snapshots are compared with the old one and the missing sequence numbers are concluded released. If two snapshots have the same seq number, after the release of one of them, the seq number still appears in the snapshot least and thus not cleaned up prematurely. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4727 Differential Revision: D13246495 Pulled By: maysamyabandeh fbshipit-source-id: 93b87a5042afd8060889df245526d3f5d29de9fe	2018-11-28 19:03:31 -08:00
Zhongyi Xie	a21cb22ee3	Revert "apply ReadOptions.iterate_upper_bound to transaction iterator… (#4705 ) Summary: … (#4656)" This reverts commit `b76398a82b`. Will add test coverage for iterate_upper_bound before re-commit b76398 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4705 Differential Revision: D13148592 Pulled By: miasantreble fbshipit-source-id: 4d1ce0bfd9f7a5359a7688bd780eb06a66f45b1f	2018-11-24 10:46:28 -08:00
Zhongyi Xie	b76398a82b	apply ReadOptions.iterate_upper_bound to transaction iterator (#4656 ) Summary: Currently transaction iterator does not apply `ReadOptions.iterate_upper_bound` when iterating. This PR attempts to fix the problem by having `BaseDeltaIterator` enforcing the upper bound check when iterator state is changed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4656 Differential Revision: D13039257 Pulled By: miasantreble fbshipit-source-id: 909eb9f6b4597a4d80418fb139f32ec82c6ec1d1	2018-11-13 15:44:15 -08:00
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	2018-11-09 11:19:58 -08:00
Siying Dong	566fc8b994	Black list some valgrind tests (#4642 ) Summary: valgrind tests with 1 thread run too long. To make it shorter, black list some long tests. These are already blacklisted in parallel valgrind tests, but they are not in non-parallel mode Pull Request resolved: https://github.com/facebook/rocksdb/pull/4642 Differential Revision: D12945237 Pulled By: siying fbshipit-source-id: 04cf977d435996480fe87aa09f14b17975b74f7d	2018-11-06 14:22:36 -08:00
Maysam Yabandeh	2b5b7bc795	WritePrepared: Fix bug in searching in non-cached snapshots (#4639 ) Summary: When evicting an entry form the commit_cache, it is verified against the list of old snapshots to see if it overlaps with any. The list of old snapshots is split into two lists: an efficient concurrent cache and an slow vector protected by a lock. The patch fixes a bug that would stop the search in the cache if it finds any and yet would not include the larger snapshots in the slower list. An extra info log entry is also removed. The condition to trigger that although very rare is still feasible and should not spam the LOG when that happens. Fixes #4621 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4639 Differential Revision: D12934989 Pulled By: maysamyabandeh fbshipit-source-id: 4e0fe8147ba292b554ae78e94c21c2ef31e03e2d	2018-11-05 23:03:50 -08:00
Simon Grätzer	ad21b1af52	Set WriteCommitted txn id to commit sequence number (#4565 ) Summary: SetId and GetId are the experimental API that so far being used in WritePrepared and WriteUnPrepared transactions, where the id is assigned at the prepare time. The patch extends the API to WriteCommitted transactions, by setting the id at commit time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4565 Differential Revision: D10557862 Pulled By: maysamyabandeh fbshipit-source-id: 2b27a140682b6185a4988fa88f8152628e0d67af	2018-10-24 12:21:38 -07:00
jsteemann	d1c0d3f358	Small issues (#4564 ) Summary: Couple of very minor improvements (typos in comments, full qualification of class name, reordering members of a struct to make it smaller) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4564 Differential Revision: D10510183 Pulled By: maysamyabandeh fbshipit-source-id: c7ddf9bfbf2db08cd31896c3fd93789d3fa68c8b	2018-10-23 10:35:57 -07:00
Maysam Yabandeh	3f5282268f	Skip concurrency control during recovery of pessimistic txn (#4346 ) Summary: TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346 Differential Revision: D9759149 Pulled By: maysamyabandeh fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b	2018-09-10 16:57:53 -07:00
cngzhnp	64324e329e	Support pragma once in all header files and cleanup some warnings (#4339 ) Summary: As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited. Besides this, try to fix some warnings about loss of data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339 Differential Revision: D9654990 Pulled By: ajkr fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848	2018-09-05 18:13:31 -07:00
Mikhail Antonov	927f274939	Avoiding write stall caused by manual flushes (#4297 ) Summary: Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening. One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable. This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297 Differential Revision: D9420705 Pulled By: mikhail-antonov fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc	2018-08-29 12:12:55 -07:00
Yi Wu	4f12d49daf	Suppress clang analyzer error (#4299 ) Summary: Suppress multiple clang-analyzer error. All of them are clang false-positive. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4299 Differential Revision: D9430740 Pulled By: yiwu-arbug fbshipit-source-id: fbdd575bdc214d124826d61d35a117995c509279	2018-08-21 16:43:05 -07:00
jsteemann	90f744941d	adds missing PopSavePoint method to Transaction (#4256 ) Summary: Transaction has had methods to deal with SavePoints already, but was missing the PopSavePoint method provided by WriteBatch and WriteBatchWithIndex. This PR adds PopSavePoint to Transaction as well. Having the method on Transaction-level too is useful for applications that repeatedly execute a sequence of operations that normally succeed, but infrequently need to get rolled back. Using SavePoints here is sensible, but as operations normally succeed the application may pile up a lot of useless SavePoints inside a Transaction, leading to slightly increased memory usage for managing the unneeded SavePoints. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4256 Differential Revision: D9326932 Pulled By: yiwu-arbug fbshipit-source-id: 53a0af18a6c7e87feff8a56f1f3eab9df7f371d6	2018-08-17 11:57:30 -07:00
Siying Dong	9c0c8f5ff6	GetAllKeyVersions() to take an extra argument of `max_num_ikeys`. (#4271 ) Summary: Right now, `ldb idump` may have memory out of control if there is a big range of tombstones. Add an option to cut maxinum number of keys in GetAllKeyVersions(), and push down --max_num_ikeys from ldb. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4271 Differential Revision: D9369149 Pulled By: siying fbshipit-source-id: 7cbb797b7d2fa16573495a7e84937456d3ff25bf	2018-08-16 15:57:08 -07:00
Maysam Yabandeh	caf0f53a74	Index value delta encoding (#3983 ) Summary: Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4. The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size. Results with sysbench read-only using 4k blocks and using 16 index restart interval: Format 2: 19585 rocksdb read-only range=100 Format 3: 19569 rocksdb read-only range=100 Format 4: 19352 rocksdb read-only range=100 Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983 Differential Revision: D8361343 Pulled By: maysamyabandeh fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651	2018-08-09 16:58:40 -07:00
Manuel Ung	ea212e5316	WriteUnPrepared: Implement unprepared batches for transactions (#4104 ) Summary: This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch. Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`. A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104 Differential Revision: D8785717 Pulled By: lth fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91	2018-07-24 00:13:18 -07:00
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	2018-07-13 17:27:39 -07:00
Fosco Marotto	8527012bb6	Converted db/merge_test.cc to use gtest (#4114 ) Summary: Picked up a task to convert this to use the gtest framework. It can't be this simple, can it? It works, but should all the std::cout be removed? ``` [$] ~/git/rocksdb [gft !]: ./merge_test [==========] Running 2 tests from 1 test case. [----------] Global test environment set-up. [----------] 2 tests from MergeTest [ RUN ] MergeTest.MergeDbTest Test read-modify-write counters... a: 3 1 2 a: 3 b: 1225 3 Compaction started ... Compaction ended a: 3 b: 1225 Test merge-based counters... a: 3 1 2 a: 3 b: 1225 3 Test merge in memtable... a: 3 1 2 a: 3 b: 1225 3 Test Partial-Merge Test merge-operator not set after reopen [ OK ] MergeTest.MergeDbTest (93 ms) [ RUN ] MergeTest.MergeDbTtlTest Opening database with TTL Test read-modify-write counters... a: 3 1 2 a: 3 b: 1225 3 Compaction started ... Compaction ended a: 3 b: 1225 Test merge-based counters... a: 3 1 2 a: 3 b: 1225 3 Test merge in memtable... Opening database with TTL a: 3 1 2 a: 3 b: 1225 3 Test Partial-Merge Opening database with TTL Opening database with TTL Opening database with TTL Opening database with TTL Test merge-operator not set after reopen [ OK ] MergeTest.MergeDbTtlTest (97 ms) [----------] 2 tests from MergeTest (190 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test case ran. (190 ms total) [ PASSED ] 2 tests. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4114 Differential Revision: D8822886 Pulled By: gfosco fbshipit-source-id: c299d008e883c3bb911d2b357a2e9e4423f8e91a	2018-07-13 14:13:07 -07:00
Maysam Yabandeh	537a233941	Exclude StackableDB from transaction stress tests (#4132 ) Summary: The transactions are currently tested with and without using StackableDB. This is mostly to check that the code path is consistent with stackable db as well. Slow, stress tests however do not benefit from being run again with StackableDB. The patch excludes StackableDB from such tests. On a single core it reduced the runtime of transaction_test from 199s to 135s. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4132 Differential Revision: D8841655 Pulled By: maysamyabandeh fbshipit-source-id: 7b9aaba2673b542b195439dfb306cef26bd63b19	2018-07-13 13:59:11 -07:00
Manuel Ung	b9846370e9	WriteUnPrepared: Add support for recovering WriteUnprepared transactions (#4078 ) Summary: This adds support for recovering WriteUnprepared transactions through the following changes: - The information in `RecoveredTransaction` is extended so that it can reference multiple batches. - `MarkBeginPrepare` is extended with a bool indicating whether it is an unprepared begin, and this is passed down to `InsertRecoveredTransaction` to indicate whether the current transaction is prepared or not. - `WriteUnpreparedTxnDB::Initialize` is overridden so that it will rollback unprepared transactions from the recovered transactions. This can be done without updating the prepare heap/commit map, because this is before the DB has finished initializing, and after writing the rollback batch, those data structures should not contain information about the rolled back transaction anyway. Commit/Rollback of live transactions is still unimplemented and will come later. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4078 Differential Revision: D8703382 Pulled By: lth fbshipit-source-id: 7e0aada6c23bd39299f1f20d6c060492e0e6b60a	2018-07-06 17:59:13 -07:00
Daniel Black	36fa49ceb5	transaction_test: -Wunused-variable with clang-7 (#4074 ) Summary: clang version 7.0.0- (trunk) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /usr/bin clang++-7 -DROCKSDB_USE_RTTI -g -W -Wextra -Wall -Wsign-compare -Wshadow -Wno-unused-parameter -Werror -I. -I./include -std=c++11 -DROCKSDB_PLATFORM_POSIX -DROCKSDB_LIB_IO_POSIX -DOS_LINUX -fno-builtin-memcmp -DROCKSDB_FALLOCATE_PRESENT -DSNAPPY -DGFLAGS=google -DZLIB -DBZIP2 -DROCKSDB_MALLOC_USABLE_SIZE -DROCKSDB_PTHREAD_ADAPTIVE_MUTEX -DROCKSDB_BACKTRACE -DROCKSDB_RANGESYNC_PRESENT -DROCKSDB_SCHED_GETCPU_PRESENT -Wshorten-64-to-32 -march=native -DHAVE_SSE42 -DROCKSDB_SUPPORT_THREAD_LOCAL -isystem ./third-party/gtest-1.7.0/fused-src -DTRAVIS -O2 -fno-omit-frame-pointer -momit-leaf-frame-pointer -Woverloaded-virtual -Wnon-virtual-dtor -Wno-missing-field-initializers -c utilities/transactions/transaction_test.cc -o utilities/transactions/transaction_test.o utilities/transactions/transaction_test.cc:2282:22: error: unused variable 'txn_options' [-Werror,-Wunused-variable] TransactionOptions txn_options; ^ utilities/transactions/transaction_test.cc:2822:22: error: unused variable 'txn_options' [-Werror,-Wunused-variable] TransactionOptions txn_options; ^ utilities/transactions/transaction_test.cc:2928:22: error: unused variable 'txn_options' [-Werror,-Wunused-variable] TransactionOptions txn_options; ^ utilities/transactions/transaction_test.cc:3109:22: error: unused variable 'txn_options' [-Werror,-Wunused-variable] TransactionOptions txn_options; ^ utilities/transactions/transaction_test.cc:4364:22: error: unused variable 'txn_options' [-Werror,-Wunused-variable] TransactionOptions txn_options; ^ Closes https://github.com/facebook/rocksdb/pull/4074 Differential Revision: D8698051 Pulled By: ajkr fbshipit-source-id: 6255618eefdd189962fbea1b02cf1eb5ae501274	2018-06-29 11:43:36 -07:00
Manuel Ung	8ad63a4b86	WriteUnPrepared: Add new WAL marker kTypeBeginUnprepareXID (#4069 ) Summary: This adds a new WAL marker of type kTypeBeginUnprepareXID. Also, DBImpl now contains a field called batch_per_txn (meaning one WriteBatch per transaction, or possibly multiple WriteBatches). This would also indicate that this DB is using WriteUnprepared policy. Recovery code would be able to make use of this extra field on DBImpl in a separate diff. For now, it is just used to determine whether the WAL is compatible or not. Closes https://github.com/facebook/rocksdb/pull/4069 Differential Revision: D8675099 Pulled By: lth fbshipit-source-id: ca27cae1738e46d65f2bb92860fc759deb874749	2018-06-28 18:58:29 -07:00
chouxi	818c84e116	Store timestamp in deadlock detection (#4060 ) Summary: - Summary Add timestamp into the DeadlockInfo to store the timestamp when deadlock detected on the rocksdb side. - Testplan: `make check -j64` Closes https://github.com/facebook/rocksdb/pull/4060 Differential Revision: D8655380 Pulled By: chouxi fbshipit-source-id: f58e1aa5e09eb1d1eed0a181d4e2304aaf01efe8	2018-06-27 12:27:58 -07:00
Manuel Ung	a16e00b7b9	WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955 ) Summary: This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data. There are the places where reads had to be modified. - Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible. - Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key. - Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case. Closes https://github.com/facebook/rocksdb/pull/3955 Differential Revision: D8286688 Pulled By: lth fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe	2018-06-27 12:23:07 -07:00
Manuel Ung	ab2254bedf	Fix clang analyze Summary: This fixes the errors as reported here: https://github.com/facebook/rocksdb/pull/3941#issuecomment-394424043 Closes https://github.com/facebook/rocksdb/pull/3950 Differential Revision: D8263086 Pulled By: lth fbshipit-source-id: 5e148d489cab2153e5846d16979a0a1f2d677d57	2018-06-04 14:44:23 -07:00
Manuel Ung	01e3c30def	Extend existing unit tests to run with WriteUnprepared as well Summary: As titled. I have not extended the Compatibility tests because the new WAL markers are still unimplemented. Closes https://github.com/facebook/rocksdb/pull/3941 Differential Revision: D8238394 Pulled By: lth fbshipit-source-id: 980e3d44837bbf2cfa64047f9738f559dfac4b1d	2018-06-01 14:58:41 -07:00
Manuel Ung	aaac6cd16f	Add write unprepared classes by inheriting from write prepared Summary: Closes https://github.com/facebook/rocksdb/pull/3907 Differential Revision: D8218325 Pulled By: lth fbshipit-source-id: ff32d8dab4a159cd2762876cba4b15e3dc51ff3b	2018-05-31 10:47:42 -07:00
Siying Dong	4dd80debd0	Remove tests from ROCKSDB_VALGRIND_RUN Summary: In order to make valgrind check test to pass in a day, remove some tests that run prohibitively slow under valgrind. Closes https://github.com/facebook/rocksdb/pull/3924 Differential Revision: D8210184 Pulled By: siying fbshipit-source-id: 5b06fb08f3cf57571d422d05a0dbddc9f9376f7a	2018-05-30 16:15:16 -07:00
Maysam Yabandeh	66c7aa32fb	Clarify the ownership of root db after TransactionDB::Open Summary: The patch clarifies the ownership of the root db after TransactionDB::Open. If it is a success the ownership if with the TransactionDB, and the root db will be deleted when the destructor of the base class, StackableDB, is called. If it is failure, the temporarily created root db will also be deleted properly. The patch also includes lots of useful formatting changes. Closes https://github.com/facebook/rocksdb/pull/3714 upon which this patch is built. Closes https://github.com/facebook/rocksdb/pull/3806 Differential Revision: D7878010 Pulled By: maysamyabandeh fbshipit-source-id: f54f3942e29434143ae5a2423ceec9c7072cd4c2	2018-05-11 15:14:03 -07:00
Siying Dong	d59549298f	Skip deleted WALs during recovery Summary: This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic. Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction) This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2. Closes https://github.com/facebook/rocksdb/pull/3765 Differential Revision: D7747618 Pulled By: siying fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729	2018-05-03 15:43:09 -07:00
Maysam Yabandeh	cfb86659bf	WritePrepared Txn: enable rollback in stress test Summary: Rollback was disabled in stress test since there was a concurrency issue in WritePrepared rollback algorithm. The issue is fixed by caching the column family handles in WritePrepared to skip getting them from the db when needed for rollback. Tested by running transaction stress test under tsan. Closes https://github.com/facebook/rocksdb/pull/3785 Differential Revision: D7793727 Pulled By: maysamyabandeh fbshipit-source-id: d81ab6fda0e53186ca69944cfe0712ce4869451e	2018-05-02 18:13:05 -07:00
Maysam Yabandeh	5bed8a0065	WritePrepared Txn: split SeqAdvanceConcurrentTest Summary: The tsan flavor of SeqAdvanceConcurrentTest times out in our test infra. The patch splits it into 10 tests. On my vm before: [ OK ] WritePreparedTransactionTest/WritePreparedTransactionTest.SeqAdvanceConcurrentTest/0 (5194 ms) after: [ OK ] OneWriteQueue/SeqAdvanceConcurrentTest.SeqAdvanceConcurrentTest/0 (1906 ms) Closes https://github.com/facebook/rocksdb/pull/3799 Differential Revision: D7854515 Pulled By: maysamyabandeh fbshipit-source-id: 4fbac42a1f974326cbc237f8cb9d6232d379c431	2018-05-02 18:13:05 -07:00
Maysam Yabandeh	bb2a2ec731	WritePrepared Txn: rollback via commit Summary: Currently WritePrepared rolls back a transaction with prepare sequence number prepare_seq by i) write a single rollback batch with rollback_seq, ii) add <rollback_seq, rollback_seq> to commit cache, iii) remove prepare_seq from PrepareHeap. This is correct assuming that there is no snapshot taken when a transaction is rolled back. This is the case the way MySQL does rollback which is after recovery. Otherwise if max_evicted_seq advances the prepare_seq, the live snapshot might assume data as committed since it does not find them in CommitCache. The change is to simply add <prepare_seq. rollback_seq> to commit cache before removing prepare_seq from PrepareHeap. In this way if max_evicted_seq advances prpeare_seq, the existing mechanism that we have to check evicted entries against live snapshots will make sure that the live snapshot will not see the data of rolled back transaction. Closes https://github.com/facebook/rocksdb/pull/3745 Differential Revision: D7696193 Pulled By: maysamyabandeh fbshipit-source-id: c9a2d46341ddc03554dded1303520a1cab74ef9c	2018-04-20 15:28:19 -07:00
Maysam Yabandeh	c3d1e36cce	WritePrepared Txn: enable TryAgain for duplicates at the end of the batch Summary: The WriteBatch::Iterate will try with a larger sequence number if the memtable reports a duplicate. This status is specified with TryAgain status. So far the assumption was that the last entry in the batch will never return TryAgain, which is correct when WAL is created via WritePrepared since it always appends a batch separator if a natural one does not exist. However when reading a WAL generated by WriteCommitted this batch separator might not exist. Although WritePrepared is not supposed to be able to read the WAL generated by WriteCommitted we should avoid confusing scenarios in which the behavior becomes unpredictable. The path fixes that by allowing TryAgain even for the last entry of the write batch. Closes https://github.com/facebook/rocksdb/pull/3747 Differential Revision: D7708391 Pulled By: maysamyabandeh fbshipit-source-id: bfaddaa9b14a4cdaff6977f6f63c789a6ab1ee0d	2018-04-20 15:28:19 -07:00
Zhongyi Xie	954b496b3f	fix memory leak in two_level_iterator Summary: this PR fixes a few failed contbuild: 1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406 2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662 3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag. Closes https://github.com/facebook/rocksdb/pull/3718 Reviewed By: maysamyabandeh Differential Revision: D7621192 Pulled By: miasantreble fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146	2018-04-15 17:26:26 -07:00
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	2018-04-12 17:59:16 -07:00
Maysam Yabandeh	d15397ba10	WritePrepared Txn: rollback_merge_operands hack Summary: This is a hack as temporary fix of MyRocks with rollbacking the merge operands. The way MyRocks uses merge operands is without protection of locks, which violates the assumption behind the rollback algorithm. They are ok with not being rolled back as it would just create a gap in the autoincrement column. The hack add an option to disable the rollback of merge operands by default and only enables it to let the unit test pass. Closes https://github.com/facebook/rocksdb/pull/3711 Differential Revision: D7597177 Pulled By: maysamyabandeh fbshipit-source-id: 544be0f666c7e7abb7f651ec8b23124e05056728	2018-04-12 11:58:11 -07:00
Maysam Yabandeh	6f5e6445d9	WritePrepared Txn: fix smallest_prep atomicity issue Summary: We introduced smallest_prep optimization in this commit `b225de7e10`, which enables storing the smallest uncommitted sequence number along with the snapshot. This enables the readers that read from the snapshot to skip further checks and safely assumed the data is committed if its sequence number is less than smallest uncommitted when the snapshot was taken. The problem was that smallest uncommitted and the snapshot must be taken atomically, and the lack of atomicity had led to readers using a smallest uncommitted after the snapshot was taken and hence mistakenly skipping some data. This patch fixes the problem by i) separating the process of removing of prepare entries from the AddCommitted function, ii) removing the prepare entires AFTER the committed sequence number is published, iii) getting smallest uncommitted (from the prepare list) BEFORE taking a snapshot. This guarantees that the smallest uncommitted that is accompanied with a snapshot is less than or equal of such number if it was obtained atomically. Tested by running MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest that was failing sporadically. Closes https://github.com/facebook/rocksdb/pull/3703 Differential Revision: D7581934 Pulled By: maysamyabandeh fbshipit-source-id: dc9d6f4fb477eba75d4d5927326905b548a96a32	2018-04-11 20:11:51 -07:00
Maysam Yabandeh	bde1c1a72a	WritePrepared Txn: add stats Summary: Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex. Closes https://github.com/facebook/rocksdb/pull/3683 Differential Revision: D7529393 Pulled By: maysamyabandeh fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9	2018-04-07 21:56:42 -07:00
Dmitri Smirnov	2a62ca1750	Make Optimistic Tx database stackable Summary: This change models Optimistic Tx db after Pessimistic TX db. The motivation for this change is to make the ptr polymorphic so it can be held by the same raw or smart ptr. Currently, due to the inheritance of the Opt Tx db not being rooted in the manner of Pess Tx from a single DB root it is more difficult to write clean code and have clear ownership of the database in cases when options dictate instantiate of plan DB, Pess Tx DB or Opt tx db. Closes https://github.com/facebook/rocksdb/pull/3566 Differential Revision: D7184502 Pulled By: yiwu-arbug fbshipit-source-id: 31d06efafd79497bb0c230e971857dba3bd962c3	2018-04-03 15:28:40 -07:00
Maysam Yabandeh	b225de7e10	WritePrepared Txn: smallest_prepare optimization Summary: The is an optimization to reduce lookup in the CommitCache when querying IsInSnapshot. The optimization takes the smallest uncommitted data at the time that the snapshot was taken and if the sequence number of the read data is lower than that number it assumes the data as committed. To implement this optimization two changes are required: i) The AddPrepared function must be called sequentially to avoid out of order insertion in the PrepareHeap (otherwise the top of the heap does not indicate the smallest prepare in future too), ii) non-2PC transactions also call AddPrepared if they do not commit in one step. Closes https://github.com/facebook/rocksdb/pull/3649 Differential Revision: D7388630 Pulled By: maysamyabandeh fbshipit-source-id: b79506238c17467d590763582960d4d90181c600	2018-04-02 20:27:41 -07:00
Maysam Yabandeh	0377ff9dea	WritePrepared Txn: make recoverable state visible after flush Summary: Currently if the CommitTimeWriteBatch is set to be used only as a state that is required only for recovery , the user cannot see that in DB until it is restarted. This while the state is already inserted into the DB after the memtable flush. It would be useful for debugging if make this state visible to the user after the flush by committing it. The patch does it by a invoking a callback that does the commit on the recoverable state. Closes https://github.com/facebook/rocksdb/pull/3661 Differential Revision: D7424577 Pulled By: maysamyabandeh fbshipit-source-id: 137f9408662f0853938b33fa440f27f04c1bbf5c	2018-03-28 12:12:08 -07:00
Maysam Yabandeh	0999e9b79a	WritePrepared Txn: Increase commit cache size to 2^23 Summary: Current commit cache size is 2^21. This was due to a type. With 2^23 commit entries we can have transactions as long as 64s without incurring the cost of having them evicted from the commit cache before their commit. Here is the math: 2^23 / 2 (one out of two seq numbers are for commit) / 2^16 TPS = 2^6 = 64s Closes https://github.com/facebook/rocksdb/pull/3657 Differential Revision: D7411211 Pulled By: maysamyabandeh fbshipit-source-id: e7cacf40579f3acf940643d8a1cfe5dd201caa35	2018-03-26 19:45:17 -07:00
Maysam Yabandeh	3e417a6607	WritePrepared Txn: AddPrepared for all sub-batches Summary: Currently AddPrepared is performed only on the first sub-batch if there are duplicate keys in the write batch. This could cause a problem if the transaction takes too long to commit and the seq number of the first sub-patch moved to old_prepared_ but not the seq of the later ones. The patch fixes this by calling AddPrepared for all sub-patches. Closes https://github.com/facebook/rocksdb/pull/3651 Differential Revision: D7388635 Pulled By: maysamyabandeh fbshipit-source-id: 0ccd80c150d9bc42fe955e49ddb9d7ca353067b4	2018-03-23 17:30:04 -07:00
Maysam Yabandeh	7429b20e39	WritePrepared Txn: fix race condition on publishing seq Summary: This commit fixes a race condition on calling SetLastPublishedSequence. The function must be called only from the 2nd write queue when two_write_queues is enabled. However there was a bug that would also call it from the main write queue if CommitTimeWriteBatch is provided to the commit request and yet use_only_the_last_commit_time_batch_for_recovery optimization is not enabled. To fix that we penalize the commit request in such cases by doing an additional write solely to publish the seq number from the 2nd queue. Closes https://github.com/facebook/rocksdb/pull/3641 Differential Revision: D7361508 Pulled By: maysamyabandeh fbshipit-source-id: bf8f7a27e5cccf5425dccbce25eb0032e8e5a4d7	2018-03-22 14:43:36 -07:00
Bruce Mitchener	a3a3f5497c	Fix some typos in comments and docs. Summary: Closes https://github.com/facebook/rocksdb/pull/3568 Differential Revision: D7170953 Pulled By: siying fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c	2018-03-08 10:27:25 -08:00
Dmitri Smirnov	c364eb42b5	Windows cumulative patch Summary: This patch addressed several issues. Portability including db_test std::thread -> port::Thread Cc: @ and %z to ROCKSDB portable macro. Cc: maysamyabandeh Implement Env::AreFilesSame Make the implementation of file unique number more robust Get rid of C-runtime and go directly to Windows API when dealing with file primitives. Implement GetSectorSize() and aling unbuffered read on the value if available. Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976 Fix test running script issue where $status var was of incorrect scope so the failures were swallowed and not reported. DestroyDB() creates a logger and opens a LOG file in the directory being cleaned up. This holds a lock on the folder and the cleanup is prevented. This fails one of the checkpoin tests. We observe the same in production. We close the log file in this change. Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test attempts to open a directory with NewRandomAccessFile which does not work on Windows. Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug Closes https://github.com/facebook/rocksdb/pull/3552 Differential Revision: D7156304 Pulled By: siying fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b	2018-03-06 11:57:43 -08:00
Maysam Yabandeh	62277e15c3	WritePrepared Txn: Move DuplicateDetector to util Summary: Move DuplicateDetector and SetComparator to its own header file in util. It would also address a complaint in the unity test. Closes https://github.com/facebook/rocksdb/pull/3567 Differential Revision: D7163268 Pulled By: maysamyabandeh fbshipit-source-id: 6ddf82773473646dbbc1284ae601a78c4907c778	2018-03-05 23:57:12 -08:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Maysam Yabandeh	680864ae54	WritePrepared Txn: Fix bug with duplicate keys during recovery Summary: Fix the following bugs: - During recovery a duplicate key was inserted twice into the write batch of the recovery transaction, once when the memtable returns false (because it was duplicates) and once for the 2nd attempt. This would result into different SubBatch count measured when the recovered transactions is committing. - If a cf is flushed during recovery the memtable is not available to assist in detecting the duplicate key. This could result into not advancing the sequence number when iterating over duplicate keys of a flushed cf and hence inserting the next key with the wrong sequence number. - SubBacthCounter would reset the comparator to default comparator after the first duplicate key. The 2nd duplicate key hence would have gone through a wrong comparator and not being detected. Closes https://github.com/facebook/rocksdb/pull/3562 Differential Revision: D7149440 Pulled By: maysamyabandeh fbshipit-source-id: 91ec317b165f363f5d11ff8b8c47c81cebb8ed77	2018-03-05 10:57:59 -08:00
Maysam Yabandeh	d060421c77	Fix a leak in prepared_section_completed_ Summary: The zeroed entries were not removed from prepared_section_completed_ map. This patch adds a unit test to show the problem and fixes that by refactoring the code. The new code is more efficient since i) it uses two separate mutex to avoid contention between commit and prepare threads, ii) it uses a sorted vector for maintaining uniq log entires with prepare which avoids a very large heap with many duplicate entries. Closes https://github.com/facebook/rocksdb/pull/3545 Differential Revision: D7106071 Pulled By: maysamyabandeh fbshipit-source-id: b3ae17cb6cd37ef10b6b35e0086c15c758768a48	2018-03-01 20:41:56 -08:00
Maysam Yabandeh	90eca1e616	WritePrepared Txn: optimize SubBatchCnt Summary: Make use of the index in WriteBatchWithIndex to also count the number of sub-batches. This eliminates the need to separately scan the batch to count the number of sub-batches once a duplicate key is detected. Closes https://github.com/facebook/rocksdb/pull/3529 Differential Revision: D7049947 Pulled By: maysamyabandeh fbshipit-source-id: 81cbf12c4e662541c772c7265a8f91631e25c7cd	2018-02-22 18:12:26 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
Maysam Yabandeh	828211e901	WritePrepared Txn: fix non-emptied PreparedHeap bug Summary: Under a certain sequence of accessing PreparedHeap, there was a bug that would not successfully empty the heap. This would result in performance issues when the heap content is moved to old_prepared_ after max_evicted_seq_ advances the orphan prepared sequence numbers. The patch fixed the bug and add more unit tests. It also does more logging when the unlikely scenarios are faced Closes https://github.com/facebook/rocksdb/pull/3526 Differential Revision: D7038486 Pulled By: maysamyabandeh fbshipit-source-id: f1e40bea558f67b03d2a29131fcb8734c65fce97	2018-02-21 13:42:23 -08:00
jsteemann	e9c31ab159	save redundant key lookup in map of locked keys Summary: In case it is found that a key is already marked as locked in a stripe's map of locked keys, it is not necessary to look it up again using `std::unordered_map<std::string, ...>::at(size_t)`. Instead, we can use the already found position using the iterator produced by the previous `find` operation. Reusing the iterator will avoid having to hash the key again and do additional "random" memory lookups in the map of keys (though the data will very likely sit available in caches here already due to the previous find operation) Closes https://github.com/facebook/rocksdb/pull/3505 Differential Revision: D7036446 Pulled By: sagar0 fbshipit-source-id: cced51547b2bd2d49394f6bc8c5896f09fa80f68	2018-02-20 17:44:44 -08:00
Maysam Yabandeh	c178da053b	WritePrepared Txn: optimizations for sysbench update_noindex Summary: These are optimization that we applied to improve sysbech's update_noindex performance. 1. Make use of LIKELY compiler hint 2. Move std::atomic so the subclass 3. Make use of skip_prepared in non-2pc transactions. Closes https://github.com/facebook/rocksdb/pull/3512 Differential Revision: D7000075 Pulled By: maysamyabandeh fbshipit-source-id: 1ab8292584df1f6305a4992973fb1b7933632181	2018-02-16 08:42:31 -08:00
jsteemann	4e7a182d09	Several small "fixes" Summary: - removed a few unneeded variables - fused some variable declarations and their assignments - fixed right-trimming code in string_util.cc to not underflow - simplifed an assertion - move non-nullptr check assertion before dereferencing of that pointer - pass an std::string function parameter by const reference instead of by value (avoiding potential copy) Closes https://github.com/facebook/rocksdb/pull/3507 Differential Revision: D7004679 Pulled By: sagar0 fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4	2018-02-15 16:57:37 -08:00
Maysam Yabandeh	8a04ee4fd1	WritePrepared Txn: use TransactionDBWriteOptimizations (2nd attempt) Summary: TransactionDB::Write can receive some optimization hints from the user. One is to skip the concurrency control mechanism. WritePreparedTxnDB is currently ignoring such hints. This patch optimizes WritePreparedTxnDB::Write for skip_concurrency_control and skip_duplicate_key_check hints. Closes https://github.com/facebook/rocksdb/pull/3496 Differential Revision: D6971784 Pulled By: maysamyabandeh fbshipit-source-id: cbab10ad538fa2b8bcb47e37c77724afe6e30f03	2018-02-12 16:43:40 -08:00
Siying Dong	a0931b3185	Fix UBSAN Error in WritePreparedTransactionTest Summary: WritePreparedTransactionTest has the UBSAN error because the wrong order of its parent class construction. Fix it. Closes https://github.com/facebook/rocksdb/pull/3478 Differential Revision: D6928975 Pulled By: siying fbshipit-source-id: 13edfd5cb9cf73f1ac5ae3b6f53061d32783733d	2018-02-07 14:57:35 -08:00
Maysam Yabandeh	8feee28020	Add skip_cc option to TransactionDB::Write Summary: Compared to DB::Write, TransactionDB::Write has the additional overhead of creating and initializing an internal transaction object, as well as the overhead of locking/unlocking the keys. This patch extends the TransactionDB::Write with an skip_cc option to allow the users to indicate that the write batch do not conflict with others and the concurrency control and its overhead can be skipped. TransactionDB::Write by default calls DB::Write when skip_cc is set, which works for WriteCommitted WritePolicy. Any other flavor of TransactionDB that is not compatible with this default behavior (such as WritePreparedTxnDB) can extend ::Write and implement their own approach for taking into account the skip_cc optimization. Closes https://github.com/facebook/rocksdb/pull/3457 Differential Revision: D6877318 Pulled By: maysamyabandeh fbshipit-source-id: 56f4e21db87ff71492db4e376fb7c2b03dfeab6b	2018-02-06 15:28:24 -08:00
Maysam Yabandeh	8f8eb4f1c0	Fix leak report by asan on DuplicateKeys test Summary: Deletes the transaction object at the end of the test. Verified by: - COMPILE_WITH_ASAN=1 make -j32 transaction_test - ./transaction_test --gtest_filter="DBA*Duplicate" Closes https://github.com/facebook/rocksdb/pull/3470 Differential Revision: D6916473 Pulled By: maysamyabandeh fbshipit-source-id: 8303df25408635d5d3ac2b25f309a3d15957c937	2018-02-06 14:26:35 -08:00
Maysam Yabandeh	88d8b2a2f5	WritePrepared Txn: Duplicate Keys, Txn Part Summary: This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again. Closes https://github.com/facebook/rocksdb/pull/3455 Differential Revision: D6873699 Pulled By: maysamyabandeh fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d	2018-02-05 18:43:24 -08:00

1 2 3 4 5 ...

393 Commits