Commit Graph

56 Commits

Author SHA1 Message Date
Maysam Yabandeh
12eaacb71d WritePrepared: Fix SmallestUnCommittedSeq bug (#5683)
Summary:
SmallestUnCommittedSeq reads two data structures, prepared_txns_ and delayed_prepared_. These two are updated in CheckPreparedAgainstMax when max_evicted_seq_ advances some prepared entires. To avoid the cost of acquiring a mutex, the read from them in SmallestUnCommittedSeq is not atomic. This creates a potential race condition.
The fix is to read the two data structures in the reverse order of their update. CheckPreparedAgainstMax copies the prepared entry to delayed_prepared_ before removing it from prepared_txns_ and SmallestUnCommittedSeq looks into prepared_txns_ before reading delayed_prepared_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5683

Differential Revision: D16744699

Pulled By: maysamyabandeh

fbshipit-source-id: b1bdb134018beb0b9de58827f512662bea35cad0
2019-08-09 16:40:00 -07:00
Maysam Yabandeh
208556ee13 WritePrepared: fix Get without snapshot (#5664)
Summary:
if read_options.snapshot is not set, ::Get will take the last sequence number after taking a super-version and uses that as the sequence number. Theoretically max_eviceted_seq_ could advance this sequence number. This could lead ::IsInSnapshot that will be invoked by the ReadCallback to notice the absence of the snapshot. In this case, the ReadCallback should have passed a non-value to snap_released so that it could be set by the ::IsInSnapshot. The patch does that, and adds a unit test to verify it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5664

Differential Revision: D16614033

Pulled By: maysamyabandeh

fbshipit-source-id: 06fb3fd4aacd75806ed1a1acec7961f5d02486f2
2019-08-05 13:41:21 -07:00
Maysam Yabandeh
773f914a40 WritePrepared: switch PreparedHeap from priority_queue to deque (#5436)
Summary:
Internally PreparedHeap is currently using a priority_queue. The rationale was the in the initial design PreparedHeap::AddPrepared could be called in arbitrary order. With the recent optimizations, we call ::AddPrepared only from the main write queue, which results into in-order insertion into PreparedHeap. The patch thus replaces the underlying priority_queue with a more efficient deque implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5436

Differential Revision: D15752147

Pulled By: maysamyabandeh

fbshipit-source-id: e6960f2b2097e13137dded1ceeff3b10b03b0aeb
2019-06-11 19:55:14 -07:00
Maysam Yabandeh
c292dc8540 WritePrepared: reduce prepared_mutex_ overhead (#5420)
Summary:
The patch reduces the contention over prepared_mutex_ using these techniques:
1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues.
2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop()
3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420

Differential Revision: D15741985

Pulled By: maysamyabandeh

fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
2019-06-10 11:53:31 -07:00
Zhongyi Xie
d68f9f4580 simplify include directive involving inttypes (#5402)
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402

Differential Revision: D15701195

Pulled By: miasantreble

fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
2019-06-06 13:56:07 -07:00
Maysam Yabandeh
eab4f49a2c WritePrepared: skip_concurrency_control option (#5330)
Summary:
This enables the user to set TransactionDBOptions::skip_concurrency_control so the standard `DB::Write(const WriteOptions& opts, WriteBatch* updates)` would skip the concurrency control. This would give higher throughput to the users who know their use case doesn't need concurrency control.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5330

Differential Revision: D15525932

Pulled By: maysamyabandeh

fbshipit-source-id: 68421ac1ba34f549a4a8de9ce4c2dccf6fb4b06b
2019-05-28 16:29:45 -07:00
Maysam Yabandeh
fe642cbee6 WritePrepared: fix race condition in reading batch with duplicate keys (#5147)
Summary:
When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete.
The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache.
A unit  test is added to reproduce the race condition with duplicate keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147

Differential Revision: D14758815

Pulled By: maysamyabandeh

fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719
2019-04-12 14:40:41 -07:00
Maysam Yabandeh
5234fc1b70 Mark logs with prepare in PreReleaseCallback (#5121)
Summary:
In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121

Differential Revision: D14665210

Pulled By: maysamyabandeh

fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534
2019-04-02 15:17:47 -07:00
Maysam Yabandeh
14b3f683a1 WriteUnPrepared: less virtual in iterator callback (#5049)
Summary:
WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads.
The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call.
Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots.

The following benchmark shows +0.26% higher throughput in seekrandom benchmark.

Benchmark:
./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench

./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG    10 runs] : 20355 ops/sec;  225.2 MB/sec
seekrandom [MEDIAN 10 runs] : 20425 ops/sec;  225.9 MB/sec

./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG    10 runs] : 20409 ops/sec;  225.8 MB/sec
seekrandom [MEDIAN 10 runs] : 20487 ops/sec;  226.6 MB/sec
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049

Differential Revision: D14366459

Pulled By: maysamyabandeh

fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef
2019-04-02 14:47:16 -07:00
Maysam Yabandeh
04a2631dbe WritePrepared: handle adding prepare before max_evicted_seq_ (#5025)
Summary:
The patch fixes an improbable race condition between AddPrepared from one write queue and AdvanceMaxEvictedSeq from another queue. In this scenario AddPrepared finds prepare_seq lower than max and adding to PrepareHeap as usual while AdvanceMaxEvictedSeq has finished checking PrepareHeap against the future max. Thus when AdvanceMaxEvictedSeq finishes off by updating the max_evicted_seq_, PrepareHeap ends up with a prepared_seq lower than it which breaks the PrepareHeap contract. The fix is that in AddPrepared we check against the future_max_evicted_seq_ instead, which is update before AdvanceMaxEvictedSeq acquire prepare_mutex_ and looks into PrepareHeap.
A unit test added to test for the failure scenario. The code is also refactored a bit to remove the duplicate code between AdvanceMaxEvictedSeq and AddPrepared.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5025

Differential Revision: D14249028

Pulled By: maysamyabandeh

fbshipit-source-id: 072ea56663f40359662c05fafa6ac524417b0622
2019-03-07 07:41:15 -08:00
Maysam Yabandeh
68a2f94d5d WritePrepared: commit only from the 2nd queue (#5014)
Summary:
When two_write_queues is enabled we call ::AddPrepared only from the main queue, which writes to both WAL and memtable, and call ::AddCommitted from the 2nd queue, which writes only to WAL. This simplifies the logic by avoiding concurrency between AddPrepared and also between AddCommitted. The patch fixes one case that did not conform with the rule above. This would allow future refactoring. For example AdvaneMaxEvictedSeq, which is invoked by AddCommitted, can be simplified by assuming lack of concurrent calls to it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5014

Differential Revision: D14210493

Pulled By: maysamyabandeh

fbshipit-source-id: 6db5ba372a294a568a14caa010576460917a4eab
2019-02-28 15:23:34 -08:00
Maysam Yabandeh
a661c0d208 WritePrepared: optimize read path by avoiding virtual (#5018)
Summary:
The read path includes a callback function, ReadCallback, which would eventually calls IsInSnapshot to figure if a particular seq is in the reading snapshot or not. This callback is virtual, which adds the cost of multiple virtual function call to each read. The first few checks in IsInSnapshot, however, are quite trivial and take care of majority of the cases. The patch moves those to a non-virtual function in the the parent class, ReadCallback, to lower the virtual callback cost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5018

Differential Revision: D14226562

Pulled By: maysamyabandeh

fbshipit-source-id: 6feed5b34f3b082e52092c5ef143e29b49c46b44
2019-02-26 16:56:19 -08:00
Maysam Yabandeh
0f4244fe00 WritePrepared: Improve stress tests with slow threads (#4974)
Summary:
The transaction stress tests, stress a high concurrency scenario. In WritePrepared/WriteUnPrepared we need to also stress the scenarios where an inserting/reading transaction is very slow. This would stress the corner cases that the caching is not sufficient and other slower data structures are engaged. To emulate such cases we make use of slow inserter/verifier threads and also reduce the size of cache data structures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4974

Differential Revision: D14143070

Pulled By: maysamyabandeh

fbshipit-source-id: 81eb674678faf9fae0f654cd60ebcc74e26aeee7
2019-02-19 16:56:49 -08:00
Maysam Yabandeh
bcdc8c8b19 WritePrepared: max_evicted_seq_ update during commit cache lookup (#4955)
Summary:
max_evicted_seq_ could be updated in the middle of the read in ::IsInSnapshot. The code to be correct in presence of this update would be complicated. The patch simplifies it by checking the value of max_evicted_seq_ before and after looking into commit_cache_ and retries in the unlucky case that it was changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4955

Differential Revision: D13999556

Pulled By: maysamyabandeh

fbshipit-source-id: 7a1bdfa95ea8b5d8d73ddff3263ed31d7297b39c
2019-02-19 16:14:08 -08:00
Maysam Yabandeh
9144d1f186 WritePrepared: add private options to TransactionDBOptions (#4966)
Summary:
WritePreparedTransactionDB operates with more options which should not be configurable to avoid complicating it for the users. For testing purposes however we need to change the default value of this parameters. This patch makes these parameters private fields in TransactionDBOptions so that the existing ::Open API could use them seamlessly without however exposing them to the users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4966

Differential Revision: D14015986

Pulled By: maysamyabandeh

fbshipit-source-id: 13037efa7dfdd6f73ec7a19414b66571e044c633
2019-02-11 14:44:02 -08:00
Maysam Yabandeh
199fabc197 WritePrepared: non-atomic commit of delayed prepared (#4947)
Summary:
Commit of delayed prepared has two non-atomic steps: add to commit cache, remove from delayed_prepared_. Similarly in ::IsInSnapshot we read from commit cache first and then look into delayed_prepared_. Due to non-atomicity thus the reader might not find the
prep_seq that is just committed neither in commit cache nor in delayed_prepared_. To fix that i)
we check if there was any delayed prepared BEFORE looking into commit
cache, ii) if there was, we complete the search steps to be these: i)
commit cache, ii) delayed prepared, commit cache again. In this way if
the first query to commit cache missed the commit, the 2nd will catch it. The cost of the redundant read from commit cache is paid only if delayed_prepared_ is nonempty which should be a very rare scenario.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4947

Differential Revision: D13952754

Pulled By: maysamyabandeh

fbshipit-source-id: 8f47826b13f8ce154398d842028342423f4ca2b2
2019-02-06 08:48:06 -08:00
Maysam Yabandeh
dcb73e7735 WritePrepared: release snapshot equal to max (#4944)
Summary:
WritePrepared maintains a list of snapshots that are <= max_evicted_seq_. Based on this list, old_commit_map_ is updated if an evicted commit entry overlaps with such snapshot. Such lists are garbage collected when the release of snapshot is reported to WritePreparedTxnDB, which is the next time max_evicted_seq_ is updated and yet the snapshot is not found is the list returned from DB. This logic was broken since ReleaseSnapshotInternal was using "< max_evicted_seq_" to cleanup old_commit_map_, which would leave a snapshot uncleaned if it "= max_evicted_seq_". The patch fixes that and adds a unit test to check for the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4944

Differential Revision: D13945000

Pulled By: maysamyabandeh

fbshipit-source-id: 0c904294f735911f52348a148bf1f945282fc17c
2019-02-04 12:57:23 -08:00
Alexander Zinoviev
32a6dd9a41 Add a new CPU time counter to compaction report (#4889)
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889

Differential Revision: D13701276

Pulled By: zinoale

fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
2019-01-29 17:24:00 -08:00
Yi Wu
b1ad6ebba8 WritePrepared: fix two versions in compaction see different status for released snapshots (#4890)
Summary:
Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios:

Scenario 1:
key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1.
The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key.

Scenario 2:
key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here.
The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890

Differential Revision: D13705538

Pulled By: maysamyabandeh

fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4
2019-01-18 17:24:06 -08:00
Maysam Yabandeh
7fd9813b9f WritePrepared: commit of delayed prepared entries (#4894)
Summary:
Here is the order of ops in a commit: 1) update commit cache 2) publish seq, 3) RemovePrepared. In case of a delayed prepared, there will be a gap between when the commit is visible to snapshots until delayed_prepared_ is cleaned up. To tell apart this case from a delayed uncommitted txn from, the commit entry of a delayed prepared is also stored in delayed_prepared_commits_, which is updated before publishing the commit.
Also logic in GetSnapshotInternal that ensures that each new snapshot is always larger than max_evicted_seq_ is updated to check against the upcoming value of max_evicted_seq_ rather than its current one. This is because AdvanceMaxEvictedSeq gets the list of snapshots lower than the new max, before updating max_evicted_seq_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4894

Differential Revision: D13726988

Pulled By: maysamyabandeh

fbshipit-source-id: 1e70d78061b50c944c9816bf4b6dac405ab4ccd3
2019-01-18 11:36:36 -08:00
Yi Wu
128f532858 WritePrepared: fix issue with snapshot released during compaction (#4858)
Summary:
Compaction iterator keep a copy of list of live snapshots at the beginning of compaction, and then query snapshot checker to verify if values of a sequence number is visible to these snapshots. However when the snapshot is released in the middle of compaction, the snapshot checker implementation (i.e. WritePreparedSnapshotChecker) may remove info with the snapshot and may report incorrect result, which lead to values being compacted out when it shouldn't. This patch conservatively keep the values if snapshot checker determines that the snapshots is released.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4858

Differential Revision: D13617146

Pulled By: maysamyabandeh

fbshipit-source-id: cf18a94f6f61a94bcff73c280f117b224af5fbc3
2019-01-16 09:55:32 -08:00
Maysam Yabandeh
cad99a6031 WritePrepared: snapshot should be larger than max_evicted_seq_ (#4886)
Summary:
The AdvanceMaxEvictedSeq algorithm assumes that new snapshots always have sequence number larger than the last max_evicted_seq_. To enforce this assumption we make two changes:
i) max is not advanced beyond the last published seq, with the exception that the evicted commit entry itself is not published yet, which is quite rare.
ii) When obtaining the snapshot if the max_evicted_seq_ is not published yet, commit a dummy entry so that it waits for it to be published and also increased the latest published seq by one above the max.
To test these non-realistic corner cases we create a commit cache with size 1 so that every single commit results into eviction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4886

Differential Revision: D13685270

Pulled By: maysamyabandeh

fbshipit-source-id: 5461bc09c2a9b75798bfcb9853a256c81cdac0b0
2019-01-15 18:11:52 -08:00
Yi Wu
d50c10ed37 WritePrepared: Fix SmallestUnCommittedSeq() doesn't check delayed_prepared (#4867)
Summary:
When prepared_txns_ heap is empty, SmallestUnCommittedSeq() should check delayed_prepared_ set as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4867

Differential Revision: D13632134

Pulled By: maysamyabandeh

fbshipit-source-id: b0423bb0a58dc95f1e636d5ed3f6e619df801fb7
2019-01-15 09:17:53 -08:00
Maysam Yabandeh
f3a99e8a4d WritePrepared: Report released snapshots in IsInSnapshot (#4856)
Summary:
Previously IsInSnapshot assumed that the snapshot is valid at the time that the function is called. However there are cases where that might not be valid. Example is background compactions where the compaction algorithm operates with a list of snapshots some of which might be released by the time they are being passed to IsInSnapshot. The patch make two changes to enable the caller to tell difference: i) any live snapshot below max is added to max_committed_seq_, which allows IsInSnapshot to confidently tell whether the passed snapshot is invalid if it below max, ii) extends IsInSnapshot API with a "released" variable that is set true when IsInSnapshot find no such snapshot below max and also find no other way to give a certain return value. In such cases the return value is true but the caller should also check the "released" boolean after the call.
In short here is the changes in the API:
i) If the snapshot is valid, no change is required.
ii) If the snapshot might be invalid, a reference to "released" boolean must be passed to IsInSnapshot.
ii-a) If snapshot is above max, IsInSnapshot can figure the return valid using the commit cache.
ii-b) otherwise if snapshot is in old_commit_map_, IsInSnapshot can use that to tell if value was visible to the snapshot.
ii-c) otherwise it sets "released" to true and returns true as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4856

Differential Revision: D13599847

Pulled By: maysamyabandeh

fbshipit-source-id: 1752be28667f886a1efec8cae5714b9b7a8f1e0f
2019-01-08 14:47:29 -08:00
Maysam Yabandeh
1a5a93ff74 WritePrepared: Fix double snapshot release issue (#4727)
Summary:
Currently the garbage collection of items in old_commit_map_ was done upon ::ReleaseSnapshot. The assumption behind this method was that the sequence number of snapshots are unique, which is incorrect. In the very rare cases that two consecutive snapshot have the same sequence number this could lead the release of the first snapshot affect the old_commit_map_ that is necessary to service the reads of the second snapshot. The bug would be triggered only if i) two snapshot have the same seq, ii) both of them are very old (older than the last ~4m transactions), and iii) there is commit entry overlapping with the snapshot seq number.
It is fixed by doing the cleanup of old_commit_map_ in UpdateSnapshot: the new list of snapshots are compared with the old one and the missing sequence numbers are concluded released. If two snapshots have the same seq number, after the release of one of them, the seq number still appears in the snapshot least and thus not cleaned up prematurely.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4727

Differential Revision: D13246495

Pulled By: maysamyabandeh

fbshipit-source-id: 93b87a5042afd8060889df245526d3f5d29de9fe
2018-11-28 19:03:31 -08:00
Maysam Yabandeh
2b5b7bc795 WritePrepared: Fix bug in searching in non-cached snapshots (#4639)
Summary:
When evicting an entry form the commit_cache, it is verified against the list of old snapshots to see if it overlaps with any. The list of old snapshots is split into two lists: an efficient concurrent cache and an slow vector protected by a lock. The patch fixes a bug that would stop the search in the cache if it finds any and yet would not include the larger snapshots in the slower list.
An extra info log entry is also removed. The condition to trigger that although very rare is still feasible and should not spam the LOG when that happens.
Fixes #4621
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4639

Differential Revision: D12934989

Pulled By: maysamyabandeh

fbshipit-source-id: 4e0fe8147ba292b554ae78e94c21c2ef31e03e2d
2018-11-05 23:03:50 -08:00
jsteemann
d1c0d3f358 Small issues (#4564)
Summary:
Couple of very minor improvements (typos in comments, full qualification of class name, reordering members of a struct to make it smaller)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4564

Differential Revision: D10510183

Pulled By: maysamyabandeh

fbshipit-source-id: c7ddf9bfbf2db08cd31896c3fd93789d3fa68c8b
2018-10-23 10:35:57 -07:00
Manuel Ung
ea212e5316 WriteUnPrepared: Implement unprepared batches for transactions (#4104)
Summary:
This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch.

Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`.

A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104

Differential Revision: D8785717

Pulled By: lth

fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91
2018-07-24 00:13:18 -07:00
Manuel Ung
b9846370e9 WriteUnPrepared: Add support for recovering WriteUnprepared transactions (#4078)
Summary:
This adds support for recovering WriteUnprepared transactions through the following changes:
- The information in `RecoveredTransaction` is extended so that it can reference multiple batches.
- `MarkBeginPrepare` is extended with a bool indicating whether it is an unprepared begin, and this is passed down to `InsertRecoveredTransaction` to indicate whether the current transaction is prepared or not.
- `WriteUnpreparedTxnDB::Initialize` is overridden so that it will rollback unprepared transactions from the recovered transactions. This can be done without updating the prepare heap/commit map, because this is before the DB has finished initializing, and after writing the rollback batch, those data structures should not contain information about the rolled back transaction anyway.

Commit/Rollback of live transactions is still unimplemented and will come later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4078

Differential Revision: D8703382

Pulled By: lth

fbshipit-source-id: 7e0aada6c23bd39299f1f20d6c060492e0e6b60a
2018-07-06 17:59:13 -07:00
Manuel Ung
a16e00b7b9 WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955)
Summary:
This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.

There are the places where reads had to be modified.
- Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
- Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
- Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
Closes https://github.com/facebook/rocksdb/pull/3955

Differential Revision: D8286688

Pulled By: lth

fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
2018-06-27 12:23:07 -07:00
Maysam Yabandeh
cfb86659bf WritePrepared Txn: enable rollback in stress test
Summary:
Rollback was disabled in stress test since there was a concurrency issue in WritePrepared rollback algorithm. The issue is fixed by caching the column family handles in WritePrepared to skip getting them from the db when needed for rollback.

Tested by running transaction stress test under tsan.
Closes https://github.com/facebook/rocksdb/pull/3785

Differential Revision: D7793727

Pulled By: maysamyabandeh

fbshipit-source-id: d81ab6fda0e53186ca69944cfe0712ce4869451e
2018-05-02 18:13:05 -07:00
Maysam Yabandeh
bb2a2ec731 WritePrepared Txn: rollback via commit
Summary:
Currently WritePrepared rolls back a transaction with prepare sequence number prepare_seq by i) write a single rollback batch with rollback_seq, ii) add <rollback_seq, rollback_seq> to commit cache, iii) remove prepare_seq from PrepareHeap.
This is correct assuming that there is no snapshot taken when a transaction is rolled back. This is the case the way MySQL does rollback which is after recovery. Otherwise if max_evicted_seq advances the prepare_seq, the live snapshot might assume data as committed since it does not find them in CommitCache.
The change is to simply add <prepare_seq. rollback_seq> to commit cache before removing prepare_seq from PrepareHeap. In this way if max_evicted_seq advances prpeare_seq, the existing mechanism that we have to check evicted entries against live snapshots will make sure that the live snapshot will not see the data of rolled back transaction.
Closes https://github.com/facebook/rocksdb/pull/3745

Differential Revision: D7696193

Pulled By: maysamyabandeh

fbshipit-source-id: c9a2d46341ddc03554dded1303520a1cab74ef9c
2018-04-20 15:28:19 -07:00
David Lai
3be9b36453 comment unused parameters to turn on -Wunused-parameter flag
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662

Differential Revision: D7426121

Pulled By: Dayvedde

fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
2018-04-12 17:59:16 -07:00
Maysam Yabandeh
6f5e6445d9 WritePrepared Txn: fix smallest_prep atomicity issue
Summary:
We introduced smallest_prep optimization in this commit b225de7e10, which enables storing the smallest uncommitted sequence number along with the snapshot. This enables the readers that read from the snapshot to skip further checks and safely assumed the data is committed if its sequence number is less than smallest uncommitted when the snapshot was taken. The problem was that smallest uncommitted and the snapshot must be taken atomically, and the lack of atomicity had led to readers using a smallest uncommitted after the snapshot was taken and hence mistakenly skipping some data.
This patch fixes the problem by i) separating the process of removing of prepare entries from the AddCommitted function, ii) removing the prepare entires AFTER the committed sequence number is published, iii) getting smallest uncommitted (from the prepare list) BEFORE taking a snapshot. This guarantees that the smallest uncommitted that is accompanied with a snapshot is less than or equal of such number if it was obtained atomically.

Tested by running MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest that was failing sporadically.
Closes https://github.com/facebook/rocksdb/pull/3703

Differential Revision: D7581934

Pulled By: maysamyabandeh

fbshipit-source-id: dc9d6f4fb477eba75d4d5927326905b548a96a32
2018-04-11 20:11:51 -07:00
Maysam Yabandeh
bde1c1a72a WritePrepared Txn: add stats
Summary:
Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex.
Closes https://github.com/facebook/rocksdb/pull/3683

Differential Revision: D7529393

Pulled By: maysamyabandeh

fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9
2018-04-07 21:56:42 -07:00
Maysam Yabandeh
b225de7e10 WritePrepared Txn: smallest_prepare optimization
Summary:
The is an optimization to reduce lookup in the CommitCache when querying IsInSnapshot. The optimization takes the smallest uncommitted data at the time that the snapshot was taken and if the sequence number of the read data is lower than that number it assumes the data as committed.
To implement this optimization two changes are required: i) The AddPrepared function must be called sequentially to avoid out of order insertion in the PrepareHeap (otherwise the top of the heap does not indicate the smallest prepare in future too), ii) non-2PC transactions also call AddPrepared if they do not commit in one step.
Closes https://github.com/facebook/rocksdb/pull/3649

Differential Revision: D7388630

Pulled By: maysamyabandeh

fbshipit-source-id: b79506238c17467d590763582960d4d90181c600
2018-04-02 20:27:41 -07:00
Maysam Yabandeh
0999e9b79a WritePrepared Txn: Increase commit cache size to 2^23
Summary:
Current commit cache size is 2^21. This was due to a type. With 2^23 commit entries we can have transactions as long as 64s without incurring the cost of having them evicted from the commit cache before their commit. Here is the math:
2^23 / 2 (one out of two seq numbers are for commit) / 2^16 TPS = 2^6 = 64s
Closes https://github.com/facebook/rocksdb/pull/3657

Differential Revision: D7411211

Pulled By: maysamyabandeh

fbshipit-source-id: e7cacf40579f3acf940643d8a1cfe5dd201caa35
2018-03-26 19:45:17 -07:00
Maysam Yabandeh
3e417a6607 WritePrepared Txn: AddPrepared for all sub-batches
Summary:
Currently AddPrepared is performed only on the first sub-batch if there are duplicate keys in the write batch. This could cause a problem if the transaction takes too long to commit and the seq number of the first sub-patch moved to old_prepared_ but not the seq of the later ones. The patch fixes this by calling AddPrepared for all sub-patches.
Closes https://github.com/facebook/rocksdb/pull/3651

Differential Revision: D7388635

Pulled By: maysamyabandeh

fbshipit-source-id: 0ccd80c150d9bc42fe955e49ddb9d7ca353067b4
2018-03-23 17:30:04 -07:00
Maysam Yabandeh
7429b20e39 WritePrepared Txn: fix race condition on publishing seq
Summary:
This commit fixes a race condition on calling SetLastPublishedSequence. The function must be called only from the 2nd write queue when two_write_queues is enabled. However there was a bug that would also call it from the main write queue if CommitTimeWriteBatch is provided to the commit request and yet use_only_the_last_commit_time_batch_for_recovery optimization is not enabled. To fix that we penalize the commit request in such cases by doing an additional write solely to publish the seq number from the 2nd queue.
Closes https://github.com/facebook/rocksdb/pull/3641

Differential Revision: D7361508

Pulled By: maysamyabandeh

fbshipit-source-id: bf8f7a27e5cccf5425dccbce25eb0032e8e5a4d7
2018-03-22 14:43:36 -07:00
Bruce Mitchener
a3a3f5497c Fix some typos in comments and docs.
Summary: Closes https://github.com/facebook/rocksdb/pull/3568

Differential Revision: D7170953

Pulled By: siying

fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c
2018-03-08 10:27:25 -08:00
Dmitri Smirnov
c364eb42b5 Windows cumulative patch
Summary:
This patch addressed several issues.
  Portability including db_test std::thread -> port::Thread Cc: @
  and %z to ROCKSDB portable macro. Cc: maysamyabandeh

  Implement Env::AreFilesSame

  Make the implementation of file unique number more robust

  Get rid of C-runtime and go directly to Windows API when dealing
  with file primitives.

  Implement GetSectorSize() and aling unbuffered read on the value if
  available.

  Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976

  Fix test running script issue where $status var was of incorrect scope
  so the failures were swallowed and not reported.

  DestroyDB() creates a logger and opens a LOG file in the directory
  being cleaned up. This holds a lock on the folder and the cleanup is
  prevented. This fails one of the checkpoin tests. We observe the same in production.
  We close the log file in this change.

 Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test
 attempts to open a directory with NewRandomAccessFile which does not
 work on Windows.
  Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug
Closes https://github.com/facebook/rocksdb/pull/3552

Differential Revision: D7156304

Pulled By: siying

fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b
2018-03-06 11:57:43 -08:00
Maysam Yabandeh
62277e15c3 WritePrepared Txn: Move DuplicateDetector to util
Summary:
Move DuplicateDetector and SetComparator to its own header file in util. It would also address a complaint in the unity test.
Closes https://github.com/facebook/rocksdb/pull/3567

Differential Revision: D7163268

Pulled By: maysamyabandeh

fbshipit-source-id: 6ddf82773473646dbbc1284ae601a78c4907c778
2018-03-05 23:57:12 -08:00
Maysam Yabandeh
680864ae54 WritePrepared Txn: Fix bug with duplicate keys during recovery
Summary:
Fix the following bugs:
- During recovery a duplicate key was inserted twice into the write batch of the recovery transaction,
once when the memtable returns false (because it was duplicates) and once for the 2nd attempt. This would result into different SubBatch count measured when the recovered transactions is committing.
- If a cf is flushed during recovery the memtable is not available to assist in detecting the duplicate key. This could result into not advancing the sequence number when iterating over duplicate keys of a flushed cf and hence inserting the next key with the wrong sequence number.
- SubBacthCounter would reset the comparator to default comparator after the first duplicate key. The 2nd duplicate key hence would have gone through a wrong comparator and not being detected.
Closes https://github.com/facebook/rocksdb/pull/3562

Differential Revision: D7149440

Pulled By: maysamyabandeh

fbshipit-source-id: 91ec317b165f363f5d11ff8b8c47c81cebb8ed77
2018-03-05 10:57:59 -08:00
Maysam Yabandeh
828211e901 WritePrepared Txn: fix non-emptied PreparedHeap bug
Summary:
Under a certain sequence of accessing PreparedHeap, there was a bug that would not successfully empty the heap. This would result in performance issues when the heap content is moved to old_prepared_ after max_evicted_seq_ advances the orphan prepared sequence numbers. The patch fixed the bug and add more unit tests. It also does more logging when the unlikely scenarios are faced
Closes https://github.com/facebook/rocksdb/pull/3526

Differential Revision: D7038486

Pulled By: maysamyabandeh

fbshipit-source-id: f1e40bea558f67b03d2a29131fcb8734c65fce97
2018-02-21 13:42:23 -08:00
Maysam Yabandeh
c178da053b WritePrepared Txn: optimizations for sysbench update_noindex
Summary:
These are optimization that we applied to improve sysbech's update_noindex performance.
1. Make use of LIKELY compiler hint
2. Move std::atomic so the subclass
3. Make use of skip_prepared in non-2pc transactions.
Closes https://github.com/facebook/rocksdb/pull/3512

Differential Revision: D7000075

Pulled By: maysamyabandeh

fbshipit-source-id: 1ab8292584df1f6305a4992973fb1b7933632181
2018-02-16 08:42:31 -08:00
Maysam Yabandeh
8a04ee4fd1 WritePrepared Txn: use TransactionDBWriteOptimizations (2nd attempt)
Summary:
TransactionDB::Write can receive some optimization hints from the user. One is to skip the concurrency control mechanism. WritePreparedTxnDB is currently ignoring such hints. This patch optimizes WritePreparedTxnDB::Write for skip_concurrency_control and skip_duplicate_key_check hints.
Closes https://github.com/facebook/rocksdb/pull/3496

Differential Revision: D6971784

Pulled By: maysamyabandeh

fbshipit-source-id: cbab10ad538fa2b8bcb47e37c77724afe6e30f03
2018-02-12 16:43:40 -08:00
Maysam Yabandeh
8feee28020 Add skip_cc option to TransactionDB::Write
Summary:
Compared to DB::Write, TransactionDB::Write has the additional overhead of creating and initializing an internal transaction object, as well as the overhead of locking/unlocking the keys. This patch extends the TransactionDB::Write with an skip_cc option to allow the users to indicate that the write batch do not conflict with others and the concurrency control and its overhead can be skipped. TransactionDB::Write by default calls DB::Write when skip_cc is set, which works for WriteCommitted WritePolicy. Any other flavor of TransactionDB that is not compatible with this default behavior (such as WritePreparedTxnDB) can extend ::Write and implement their own approach for taking into account the skip_cc optimization.
Closes https://github.com/facebook/rocksdb/pull/3457

Differential Revision: D6877318

Pulled By: maysamyabandeh

fbshipit-source-id: 56f4e21db87ff71492db4e376fb7c2b03dfeab6b
2018-02-06 15:28:24 -08:00
Maysam Yabandeh
88d8b2a2f5 WritePrepared Txn: Duplicate Keys, Txn Part
Summary:
This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again.
Closes https://github.com/facebook/rocksdb/pull/3455

Differential Revision: D6873699

Pulled By: maysamyabandeh

fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d
2018-02-05 18:43:24 -08:00
Maysam Yabandeh
3073b1c573 Split SnapshotConcurrentAccessTest into 20 sub tests
Summary:
SnapshotConcurrentAccessTest sometimes times out when running on the test infra. This patch splits the test into smaller sub-tests to avoid the timeout. It also benefits from lower run-time of each sub-test and increases the coverage of the test. The overall run-time of each final sub-test is at most half of the original test so we should no longer see a timeout.
Closes https://github.com/facebook/rocksdb/pull/3435

Differential Revision: D6839427

Pulled By: maysamyabandeh

fbshipit-source-id: d53fdb157109e2438ca7fe447d0cf4b71f304bd8
2018-01-29 17:12:55 -08:00
Maysam Yabandeh
00b33c2474 WritePrepared Txn: address some pending TODOs
Summary:
This patch addresses a couple of minor TODOs for WritePrepared Txn such as double checking some assert statements at runtime as well, skip extra AddPrepared in non-2pc transactions, and safety check for infinite loops.
Closes https://github.com/facebook/rocksdb/pull/3302

Differential Revision: D6617002

Pulled By: maysamyabandeh

fbshipit-source-id: ef6673c139cb49f64c0879508d2f573b78609aca
2018-01-09 08:57:20 -08:00