rocksdb

Author	SHA1	Message	Date
Manuel Ung	6f0f82de87	WriteUnPrepared: increase test coverage in transaction_test (#5658 ) Summary: The changes transaction_test to set `txn_db_options.default_write_batch_flush_threshold = 1` in order to give better test coverage for WriteUnprepared. As part of the change, some tests had to be updated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5658 Differential Revision: D16740468 Pulled By: lth fbshipit-source-id: 3821eec20baf13917c8c1fab444332f75a509de9	2019-08-12 12:16:04 -07:00
Maysam Yabandeh	12eaacb71d	WritePrepared: Fix SmallestUnCommittedSeq bug (#5683 ) Summary: SmallestUnCommittedSeq reads two data structures, prepared_txns_ and delayed_prepared_. These two are updated in CheckPreparedAgainstMax when max_evicted_seq_ advances some prepared entires. To avoid the cost of acquiring a mutex, the read from them in SmallestUnCommittedSeq is not atomic. This creates a potential race condition. The fix is to read the two data structures in the reverse order of their update. CheckPreparedAgainstMax copies the prepared entry to delayed_prepared_ before removing it from prepared_txns_ and SmallestUnCommittedSeq looks into prepared_txns_ before reading delayed_prepared_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5683 Differential Revision: D16744699 Pulled By: maysamyabandeh fbshipit-source-id: b1bdb134018beb0b9de58827f512662bea35cad0	2019-08-09 16:40:00 -07:00
Maysam Yabandeh	208556ee13	WritePrepared: fix Get without snapshot (#5664 ) Summary: if read_options.snapshot is not set, ::Get will take the last sequence number after taking a super-version and uses that as the sequence number. Theoretically max_eviceted_seq_ could advance this sequence number. This could lead ::IsInSnapshot that will be invoked by the ReadCallback to notice the absence of the snapshot. In this case, the ReadCallback should have passed a non-value to snap_released so that it could be set by the ::IsInSnapshot. The patch does that, and adds a unit test to verify it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5664 Differential Revision: D16614033 Pulled By: maysamyabandeh fbshipit-source-id: 06fb3fd4aacd75806ed1a1acec7961f5d02486f2	2019-08-05 13:41:21 -07:00
Maysam Yabandeh	60f3ec2ca5	Fix appveyor compliant about passing const to thread (#5447 ) Summary: CLANG would complain if we pass const to lambda function and appveyor complains if we don't (https://github.com/facebook/rocksdb/pull/5443). The patch fixes that by using the default capture mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5447 Differential Revision: D15788722 Pulled By: maysamyabandeh fbshipit-source-id: 47e7f49264afe31fdafe42cb8bf93da126abfca9	2019-06-12 15:06:22 -07:00
Maysam Yabandeh	4a285d0dd3	Remove passing const variable to thread (#5443 ) Summary: CLANG complains that passing const to thread is not necessary. The patch removes it form PreparedHeap::Concurrent test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5443 Differential Revision: D15781598 Pulled By: maysamyabandeh fbshipit-source-id: 3aceb05d96182fa4726d6d37eed45fd3aac4c016	2019-06-12 09:45:57 -07:00
Maysam Yabandeh	773f914a40	WritePrepared: switch PreparedHeap from priority_queue to deque (#5436 ) Summary: Internally PreparedHeap is currently using a priority_queue. The rationale was the in the initial design PreparedHeap::AddPrepared could be called in arbitrary order. With the recent optimizations, we call ::AddPrepared only from the main write queue, which results into in-order insertion into PreparedHeap. The patch thus replaces the underlying priority_queue with a more efficient deque implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5436 Differential Revision: D15752147 Pulled By: maysamyabandeh fbshipit-source-id: e6960f2b2097e13137dded1ceeff3b10b03b0aeb	2019-06-11 19:55:14 -07:00
sdong	58c4aee42e	TransactionUtil::CheckKey() to skip unnecessary history (#4941 ) Summary: If a memtable definitely covers a key, there isn't a need to check older memtables. We can skip them by checking the earliest sequence number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4941 Differential Revision: D13932666 fbshipit-source-id: b9d52f234b8ad9dd3bf6547645cd457175a3ca9b	2019-06-11 11:46:42 -07:00
Maysam Yabandeh	c292dc8540	WritePrepared: reduce prepared_mutex_ overhead (#5420 ) Summary: The patch reduces the contention over prepared_mutex_ using these techniques: 1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues. 2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop() 3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420 Differential Revision: D15741985 Pulled By: maysamyabandeh fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735	2019-06-10 11:53:31 -07:00
Zhongyi Xie	d68f9f4580	simplify include directive involving inttypes (#5402 ) Summary: When using `PRIu64` type of printf specifier, current code base does the following: ``` #ifndef __STDC_FORMAT_MACROS #define __STDC_FORMAT_MACROS #endif #include <inttypes.h> ``` However, this can be simplified to ``` #include <cinttypes> ``` as long as flag `-std=c++11` is used. This should solve issues like https://github.com/facebook/rocksdb/issues/5159 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402 Differential Revision: D15701195 Pulled By: miasantreble fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03	2019-06-06 13:56:07 -07:00
Vijay Nadimpalli	49c5a12dbe	Organizing rocksdb/db directory Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390 Differential Revision: D15579388 Pulled By: vjnadimpalli fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366	2019-05-31 11:57:01 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Siying Dong	e9e0101ca4	Move test related files under util/ to test_util/ (#5377 ) Summary: There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo ve them to a new directory test_util/, so that util/ is cleaner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377 Differential Revision: D15551366 Pulled By: siying fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c	2019-05-30 11:25:51 -07:00
Maysam Yabandeh	f5576c3317	WritePrepared: disableWAL in commit without prepare (#5327 ) Summary: When committing a transaction without prepare, WritePrepared simply writes the batch to db and add the commit entry to CommitCache. When two_write_queues=true, following the rule of committing only from 2nd write queue, the first write, writes the batch and the only thing the 2nd write does is to write the commit entry to CommitCache. Currently the write batch in 2nd write is set to an empty LogData entry, while the write to the WAL could simply be entirely disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5327 Differential Revision: D15424546 Pulled By: maysamyabandeh fbshipit-source-id: 3d9ea3922d5196984c584d62a3ed57e1f7ca7b9f	2019-05-28 14:21:52 -07:00
Maysam Yabandeh	5c0e304170	WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313 ) Summary: WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes. The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313 Differential Revision: D15379977 Pulled By: maysamyabandeh fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00	2019-05-20 07:49:20 -07:00
Thomas Fersch	a42757607d	Use pre-increment instead of post-increment for iterators (#5296 ) Summary: Google C++ style guide indicates pre-increment should be used for iterators: https://google.github.io/styleguide/cppguide.html#Preincrement_and_Predecrement. Replaced all instances of ' it++' by ' ++it' (where type is iterator). So this covers the cases where iterators are named 'it'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5296 Differential Revision: D15301256 Pulled By: tfersch fbshipit-source-id: 2803483c1392504ad3b281d21db615429c71114b	2019-05-15 13:19:15 -07:00
Maysam Yabandeh	f383641a1d	Unordered Writes (#5218 ) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759	2019-05-13 17:47:21 -07:00
Maysam Yabandeh	fe642cbee6	WritePrepared: fix race condition in reading batch with duplicate keys (#5147 ) Summary: When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete. The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache. A unit test is added to reproduce the race condition with duplicate keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147 Differential Revision: D14758815 Pulled By: maysamyabandeh fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719	2019-04-12 14:40:41 -07:00
Maysam Yabandeh	14b3f683a1	WriteUnPrepared: less virtual in iterator callback (#5049 ) Summary: WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads. The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call. Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots. The following benchmark shows +0.26% higher throughput in seekrandom benchmark. Benchmark: ./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench ./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100 seekrandom [AVG 10 runs] : 20355 ops/sec; 225.2 MB/sec seekrandom [MEDIAN 10 runs] : 20425 ops/sec; 225.9 MB/sec ./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100 seekrandom [AVG 10 runs] : 20409 ops/sec; 225.8 MB/sec seekrandom [MEDIAN 10 runs] : 20487 ops/sec; 226.6 MB/sec Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049 Differential Revision: D14366459 Pulled By: maysamyabandeh fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef	2019-04-02 14:47:16 -07:00
Maysam Yabandeh	04d3ac4e63	Fix tsan compliant on AddPreparedBeforeMax (#5052 ) Summary: Add a mutex to the test to synchronize before accessing the shared txn object. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5052 Differential Revision: D14386861 Pulled By: maysamyabandeh fbshipit-source-id: 5b32e209840b210c35af53848dc77f489a76c95a	2019-03-08 09:39:00 -08:00
Maysam Yabandeh	04a2631dbe	WritePrepared: handle adding prepare before max_evicted_seq_ (#5025 ) Summary: The patch fixes an improbable race condition between AddPrepared from one write queue and AdvanceMaxEvictedSeq from another queue. In this scenario AddPrepared finds prepare_seq lower than max and adding to PrepareHeap as usual while AdvanceMaxEvictedSeq has finished checking PrepareHeap against the future max. Thus when AdvanceMaxEvictedSeq finishes off by updating the max_evicted_seq_, PrepareHeap ends up with a prepared_seq lower than it which breaks the PrepareHeap contract. The fix is that in AddPrepared we check against the future_max_evicted_seq_ instead, which is update before AdvanceMaxEvictedSeq acquire prepare_mutex_ and looks into PrepareHeap. A unit test added to test for the failure scenario. The code is also refactored a bit to remove the duplicate code between AdvanceMaxEvictedSeq and AddPrepared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5025 Differential Revision: D14249028 Pulled By: maysamyabandeh fbshipit-source-id: 072ea56663f40359662c05fafa6ac524417b0622	2019-03-07 07:41:15 -08:00
Maysam Yabandeh	703f1375c2	WritePrepared: Add rollback batch to PreparedHeap (#5026 ) Summary: The patch adds the sequence number of the rollback patch to the PrepareHeap when two_write_queues is enabled. Although the current behavior is still correct, the change simplifies reasoning about the code, by having all uncommitted batches registered with the PreparedHeap. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5026 Differential Revision: D14249401 Pulled By: maysamyabandeh fbshipit-source-id: 1e3424edee5cd14e56ee35931ad3c93ed997cd5a	2019-03-07 07:33:31 -08:00
Maysam Yabandeh	bcdc8c8b19	WritePrepared: max_evicted_seq_ update during commit cache lookup (#4955 ) Summary: max_evicted_seq_ could be updated in the middle of the read in ::IsInSnapshot. The code to be correct in presence of this update would be complicated. The patch simplifies it by checking the value of max_evicted_seq_ before and after looking into commit_cache_ and retries in the unlucky case that it was changed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4955 Differential Revision: D13999556 Pulled By: maysamyabandeh fbshipit-source-id: 7a1bdfa95ea8b5d8d73ddff3263ed31d7297b39c	2019-02-19 16:14:08 -08:00
Michael Liu	ca89ac2ba9	Apply modernize-use-override (2nd iteration) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. Reviewed By: Orvid Differential Revision: D14090024 fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a	2019-02-14 14:41:36 -08:00
Maysam Yabandeh	9144d1f186	WritePrepared: add private options to TransactionDBOptions (#4966 ) Summary: WritePreparedTransactionDB operates with more options which should not be configurable to avoid complicating it for the users. For testing purposes however we need to change the default value of this parameters. This patch makes these parameters private fields in TransactionDBOptions so that the existing ::Open API could use them seamlessly without however exposing them to the users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4966 Differential Revision: D14015986 Pulled By: maysamyabandeh fbshipit-source-id: 13037efa7dfdd6f73ec7a19414b66571e044c633	2019-02-11 14:44:02 -08:00
Maysam Yabandeh	10d14693ac	WritePrepared: fix ValidateSnapshot with long-running txn (#4961 ) Summary: ValidateSnapshot checks if another txn has committed a value to about-to-be-locked key since a particular snapshot. It applies an optimization of looking into only the memtable if snapshot seq is larger than the earliest seq in the memtables. With a long-running txn in WritePrepared, the prepared value might be flushed out to the disk and yet it commits after the snapshot, which breaks this optimization. The patch fixes that by disabling this optimization when the min_uncomitted seq at the time the snapshot was taken is lower than earliest seq in the memtables. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4961 Differential Revision: D14009947 Pulled By: maysamyabandeh fbshipit-source-id: 1d11679950326f7c4094b433e6b821b729f08850	2019-02-08 18:01:25 -08:00
Maysam Yabandeh	199fabc197	WritePrepared: non-atomic commit of delayed prepared (#4947 ) Summary: Commit of delayed prepared has two non-atomic steps: add to commit cache, remove from delayed_prepared_. Similarly in ::IsInSnapshot we read from commit cache first and then look into delayed_prepared_. Due to non-atomicity thus the reader might not find the prep_seq that is just committed neither in commit cache nor in delayed_prepared_. To fix that i) we check if there was any delayed prepared BEFORE looking into commit cache, ii) if there was, we complete the search steps to be these: i) commit cache, ii) delayed prepared, commit cache again. In this way if the first query to commit cache missed the commit, the 2nd will catch it. The cost of the redundant read from commit cache is paid only if delayed_prepared_ is nonempty which should be a very rare scenario. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4947 Differential Revision: D13952754 Pulled By: maysamyabandeh fbshipit-source-id: 8f47826b13f8ce154398d842028342423f4ca2b2	2019-02-06 08:48:06 -08:00
Maysam Yabandeh	dcb73e7735	WritePrepared: release snapshot equal to max (#4944 ) Summary: WritePrepared maintains a list of snapshots that are <= max_evicted_seq_. Based on this list, old_commit_map_ is updated if an evicted commit entry overlaps with such snapshot. Such lists are garbage collected when the release of snapshot is reported to WritePreparedTxnDB, which is the next time max_evicted_seq_ is updated and yet the snapshot is not found is the list returned from DB. This logic was broken since ReleaseSnapshotInternal was using "< max_evicted_seq_" to cleanup old_commit_map_, which would leave a snapshot uncleaned if it "= max_evicted_seq_". The patch fixes that and adds a unit test to check for the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4944 Differential Revision: D13945000 Pulled By: maysamyabandeh fbshipit-source-id: 0c904294f735911f52348a148bf1f945282fc17c	2019-02-04 12:57:23 -08:00
Yi Wu	b1ad6ebba8	WritePrepared: fix two versions in compaction see different status for released snapshots (#4890 ) Summary: Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios: Scenario 1: key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1. The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key. Scenario 2: key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here. The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890 Differential Revision: D13705538 Pulled By: maysamyabandeh fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4	2019-01-18 17:24:06 -08:00
Maysam Yabandeh	7fd9813b9f	WritePrepared: commit of delayed prepared entries (#4894 ) Summary: Here is the order of ops in a commit: 1) update commit cache 2) publish seq, 3) RemovePrepared. In case of a delayed prepared, there will be a gap between when the commit is visible to snapshots until delayed_prepared_ is cleaned up. To tell apart this case from a delayed uncommitted txn from, the commit entry of a delayed prepared is also stored in delayed_prepared_commits_, which is updated before publishing the commit. Also logic in GetSnapshotInternal that ensures that each new snapshot is always larger than max_evicted_seq_ is updated to check against the upcoming value of max_evicted_seq_ rather than its current one. This is because AdvanceMaxEvictedSeq gets the list of snapshots lower than the new max, before updating max_evicted_seq_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4894 Differential Revision: D13726988 Pulled By: maysamyabandeh fbshipit-source-id: 1e70d78061b50c944c9816bf4b6dac405ab4ccd3	2019-01-18 11:36:36 -08:00
Yanqin Jin	dd9eca1c58	Remove unused variable to fix clang compilation err (#4893 ) Summary: as title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4893 Differential Revision: D13716733 Pulled By: riversand963 fbshipit-source-id: 6811d6a99fe2094d5344f854e8939f01238b2adb	2019-01-17 11:57:31 -08:00
Yi Wu	128f532858	WritePrepared: fix issue with snapshot released during compaction (#4858 ) Summary: Compaction iterator keep a copy of list of live snapshots at the beginning of compaction, and then query snapshot checker to verify if values of a sequence number is visible to these snapshots. However when the snapshot is released in the middle of compaction, the snapshot checker implementation (i.e. WritePreparedSnapshotChecker) may remove info with the snapshot and may report incorrect result, which lead to values being compacted out when it shouldn't. This patch conservatively keep the values if snapshot checker determines that the snapshots is released. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4858 Differential Revision: D13617146 Pulled By: maysamyabandeh fbshipit-source-id: cf18a94f6f61a94bcff73c280f117b224af5fbc3	2019-01-16 09:55:32 -08:00
Yi Wu	5d4fddfa52	WritePrepared: Fix visible key compacted out by compaction (#4883 ) Summary: With WritePrepared transaction, flush/compaction can contain uncommitted keys, and those keys can get committed during compaction. If a snapshot is taken before the key is committed, it should not see the key. On the other hand, compaction grab the list of snapshots at its beginning, and only consider those snapshots to dedup keys. Consider the case: ``` seq = 1: put "foo" = "bar" seq = 2: transaction T: delete "foo", prepare seq = 3: compaction start seq = 4: take snapshot S seq = 5: transaction T: commit. ... seq = N: compaction iterator reached key "foo". ``` When compaction start, the list of snapshot is empty. Compaction doesn't take snapshot S into account. When it reached "foo", transaction T is committed. Compaction may think the value "foo=bar" is not visible by any snapshot (which is wrong), and compact the value out. The fix is to explicitly take a snapshot before compaction grabbing the list of snapshots. Compaction will then has to keep keys visible to this snapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4883 Differential Revision: D13668775 Pulled By: maysamyabandeh fbshipit-source-id: 1cab9615f94b7d3e8522cc3d44c3a14c7d4720e4	2019-01-15 21:34:38 -08:00
Maysam Yabandeh	cad99a6031	WritePrepared: snapshot should be larger than max_evicted_seq_ (#4886 ) Summary: The AdvanceMaxEvictedSeq algorithm assumes that new snapshots always have sequence number larger than the last max_evicted_seq_. To enforce this assumption we make two changes: i) max is not advanced beyond the last published seq, with the exception that the evicted commit entry itself is not published yet, which is quite rare. ii) When obtaining the snapshot if the max_evicted_seq_ is not published yet, commit a dummy entry so that it waits for it to be published and also increased the latest published seq by one above the max. To test these non-realistic corner cases we create a commit cache with size 1 so that every single commit results into eviction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4886 Differential Revision: D13685270 Pulled By: maysamyabandeh fbshipit-source-id: 5461bc09c2a9b75798bfcb9853a256c81cdac0b0	2019-01-15 18:11:52 -08:00
Yi Wu	d50c10ed37	WritePrepared: Fix SmallestUnCommittedSeq() doesn't check delayed_prepared (#4867 ) Summary: When prepared_txns_ heap is empty, SmallestUnCommittedSeq() should check delayed_prepared_ set as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4867 Differential Revision: D13632134 Pulled By: maysamyabandeh fbshipit-source-id: b0423bb0a58dc95f1e636d5ed3f6e619df801fb7	2019-01-15 09:17:53 -08:00
Zhongyi Xie	6a4ec41fed	add assert to silence clang warning (#4871 ) Summary: currently clang analyze fails with the following warning: > utilities/transactions/write_prepared_transaction_test.cc:1451:5: warning: Forming reference to null pointer ASSERT_GT(wp_db->max_evicted_seq_, 0); // max after recovery Pull Request resolved: https://github.com/facebook/rocksdb/pull/4871 Differential Revision: D13638053 Pulled By: miasantreble fbshipit-source-id: b192b0c13c411c58defc9e280b34cdfcab3fa8e3	2019-01-11 12:17:34 -08:00
Maysam Yabandeh	f3a99e8a4d	WritePrepared: Report released snapshots in IsInSnapshot (#4856 ) Summary: Previously IsInSnapshot assumed that the snapshot is valid at the time that the function is called. However there are cases where that might not be valid. Example is background compactions where the compaction algorithm operates with a list of snapshots some of which might be released by the time they are being passed to IsInSnapshot. The patch make two changes to enable the caller to tell difference: i) any live snapshot below max is added to max_committed_seq_, which allows IsInSnapshot to confidently tell whether the passed snapshot is invalid if it below max, ii) extends IsInSnapshot API with a "released" variable that is set true when IsInSnapshot find no such snapshot below max and also find no other way to give a certain return value. In such cases the return value is true but the caller should also check the "released" boolean after the call. In short here is the changes in the API: i) If the snapshot is valid, no change is required. ii) If the snapshot might be invalid, a reference to "released" boolean must be passed to IsInSnapshot. ii-a) If snapshot is above max, IsInSnapshot can figure the return valid using the commit cache. ii-b) otherwise if snapshot is in old_commit_map_, IsInSnapshot can use that to tell if value was visible to the snapshot. ii-c) otherwise it sets "released" to true and returns true as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4856 Differential Revision: D13599847 Pulled By: maysamyabandeh fbshipit-source-id: 1752be28667f886a1efec8cae5714b9b7a8f1e0f	2019-01-08 14:47:29 -08:00
Maysam Yabandeh	cd227d74ba	WritePrepared: improve IsInSnapshotEmptyMapTest (#4853 ) Summary: IsInSnapshotEmptyMapTest tests that IsInSnapshot returns correct value for existing data after a recovery, where max is not zero and yet commit cache is empty. The existing test was preliminary which is improved in this patch. It also increases the db sequence after recovery so that there the snapshot immediately taken after recovery would have a sequence number different than that of max_evicted_seq. This simplifies the logic in IsInSnapshot by not having to consider the special case that an old snapshot might be equal to max_evicted_seq and yet not present in old_commit_map. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4853 Differential Revision: D13595223 Pulled By: maysamyabandeh fbshipit-source-id: 77c12ca8a3f61a47479a93bef2038ff502dc3322	2019-01-08 11:27:11 -08:00
Maysam Yabandeh	f1b0841f06	WritePrepared: followup fix for snapshot double release issue (#4734 ) Summary: The fix in #4727 for double snapshot release was incomplete since it does not properly remove the duplicate entires in the snapshot list after finding that a snapshot is still valid. The patch does that and also improves the unit test to show the issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4734 Differential Revision: D13266260 Pulled By: maysamyabandeh fbshipit-source-id: 351e2c40cca45a87b757774c11af74182314911e	2018-11-29 21:01:57 -08:00
Maysam Yabandeh	1a5a93ff74	WritePrepared: Fix double snapshot release issue (#4727 ) Summary: Currently the garbage collection of items in old_commit_map_ was done upon ::ReleaseSnapshot. The assumption behind this method was that the sequence number of snapshots are unique, which is incorrect. In the very rare cases that two consecutive snapshot have the same sequence number this could lead the release of the first snapshot affect the old_commit_map_ that is necessary to service the reads of the second snapshot. The bug would be triggered only if i) two snapshot have the same seq, ii) both of them are very old (older than the last ~4m transactions), and iii) there is commit entry overlapping with the snapshot seq number. It is fixed by doing the cleanup of old_commit_map_ in UpdateSnapshot: the new list of snapshots are compared with the old one and the missing sequence numbers are concluded released. If two snapshots have the same seq number, after the release of one of them, the seq number still appears in the snapshot least and thus not cleaned up prematurely. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4727 Differential Revision: D13246495 Pulled By: maysamyabandeh fbshipit-source-id: 93b87a5042afd8060889df245526d3f5d29de9fe	2018-11-28 19:03:31 -08:00
Maysam Yabandeh	2b5b7bc795	WritePrepared: Fix bug in searching in non-cached snapshots (#4639 ) Summary: When evicting an entry form the commit_cache, it is verified against the list of old snapshots to see if it overlaps with any. The list of old snapshots is split into two lists: an efficient concurrent cache and an slow vector protected by a lock. The patch fixes a bug that would stop the search in the cache if it finds any and yet would not include the larger snapshots in the slower list. An extra info log entry is also removed. The condition to trigger that although very rare is still feasible and should not spam the LOG when that happens. Fixes #4621 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4639 Differential Revision: D12934989 Pulled By: maysamyabandeh fbshipit-source-id: 4e0fe8147ba292b554ae78e94c21c2ef31e03e2d	2018-11-05 23:03:50 -08:00
Maysam Yabandeh	3f5282268f	Skip concurrency control during recovery of pessimistic txn (#4346 ) Summary: TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346 Differential Revision: D9759149 Pulled By: maysamyabandeh fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b	2018-09-10 16:57:53 -07:00
Yi Wu	4f12d49daf	Suppress clang analyzer error (#4299 ) Summary: Suppress multiple clang-analyzer error. All of them are clang false-positive. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4299 Differential Revision: D9430740 Pulled By: yiwu-arbug fbshipit-source-id: fbdd575bdc214d124826d61d35a117995c509279	2018-08-21 16:43:05 -07:00
Siying Dong	9c0c8f5ff6	GetAllKeyVersions() to take an extra argument of `max_num_ikeys`. (#4271 ) Summary: Right now, `ldb idump` may have memory out of control if there is a big range of tombstones. Add an option to cut maxinum number of keys in GetAllKeyVersions(), and push down --max_num_ikeys from ldb. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4271 Differential Revision: D9369149 Pulled By: siying fbshipit-source-id: 7cbb797b7d2fa16573495a7e84937456d3ff25bf	2018-08-16 15:57:08 -07:00
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	2018-07-13 17:27:39 -07:00
Siying Dong	4dd80debd0	Remove tests from ROCKSDB_VALGRIND_RUN Summary: In order to make valgrind check test to pass in a day, remove some tests that run prohibitively slow under valgrind. Closes https://github.com/facebook/rocksdb/pull/3924 Differential Revision: D8210184 Pulled By: siying fbshipit-source-id: 5b06fb08f3cf57571d422d05a0dbddc9f9376f7a	2018-05-30 16:15:16 -07:00
Maysam Yabandeh	5bed8a0065	WritePrepared Txn: split SeqAdvanceConcurrentTest Summary: The tsan flavor of SeqAdvanceConcurrentTest times out in our test infra. The patch splits it into 10 tests. On my vm before: [ OK ] WritePreparedTransactionTest/WritePreparedTransactionTest.SeqAdvanceConcurrentTest/0 (5194 ms) after: [ OK ] OneWriteQueue/SeqAdvanceConcurrentTest.SeqAdvanceConcurrentTest/0 (1906 ms) Closes https://github.com/facebook/rocksdb/pull/3799 Differential Revision: D7854515 Pulled By: maysamyabandeh fbshipit-source-id: 4fbac42a1f974326cbc237f8cb9d6232d379c431	2018-05-02 18:13:05 -07:00
Zhongyi Xie	954b496b3f	fix memory leak in two_level_iterator Summary: this PR fixes a few failed contbuild: 1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406 2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662 3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag. Closes https://github.com/facebook/rocksdb/pull/3718 Reviewed By: maysamyabandeh Differential Revision: D7621192 Pulled By: miasantreble fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146	2018-04-15 17:26:26 -07:00
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	2018-04-12 17:59:16 -07:00
Maysam Yabandeh	6f5e6445d9	WritePrepared Txn: fix smallest_prep atomicity issue Summary: We introduced smallest_prep optimization in this commit `b225de7e10`, which enables storing the smallest uncommitted sequence number along with the snapshot. This enables the readers that read from the snapshot to skip further checks and safely assumed the data is committed if its sequence number is less than smallest uncommitted when the snapshot was taken. The problem was that smallest uncommitted and the snapshot must be taken atomically, and the lack of atomicity had led to readers using a smallest uncommitted after the snapshot was taken and hence mistakenly skipping some data. This patch fixes the problem by i) separating the process of removing of prepare entries from the AddCommitted function, ii) removing the prepare entires AFTER the committed sequence number is published, iii) getting smallest uncommitted (from the prepare list) BEFORE taking a snapshot. This guarantees that the smallest uncommitted that is accompanied with a snapshot is less than or equal of such number if it was obtained atomically. Tested by running MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest that was failing sporadically. Closes https://github.com/facebook/rocksdb/pull/3703 Differential Revision: D7581934 Pulled By: maysamyabandeh fbshipit-source-id: dc9d6f4fb477eba75d4d5927326905b548a96a32	2018-04-11 20:11:51 -07:00
Maysam Yabandeh	3e417a6607	WritePrepared Txn: AddPrepared for all sub-batches Summary: Currently AddPrepared is performed only on the first sub-batch if there are duplicate keys in the write batch. This could cause a problem if the transaction takes too long to commit and the seq number of the first sub-patch moved to old_prepared_ but not the seq of the later ones. The patch fixes this by calling AddPrepared for all sub-patches. Closes https://github.com/facebook/rocksdb/pull/3651 Differential Revision: D7388635 Pulled By: maysamyabandeh fbshipit-source-id: 0ccd80c150d9bc42fe955e49ddb9d7ca353067b4	2018-03-23 17:30:04 -07:00

1 2

86 Commits