2016-02-10 00:12:00 +01:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
Replace tracked_keys with a new LockTracker interface in TransactionDB (#7013)
Summary:
We're going to support more locking protocols such as range lock in transaction.
However, in current design, `TransactionBase` has a member `tracked_keys` which assumes that point lock (lock a single key) is used, and is used in snapshot checking (isolation protocol). When using range lock, we may use read committed instead of snapshot checking as the isolation protocol.
The most significant usage scenarios of `tracked_keys` are:
1. pessimistic transaction uses it to track the locked keys, and unlock these keys when commit or rollback.
2. optimistic transaction does not lock keys upfront, it only tracks the lock intentions in tracked_keys, and do write conflict checking when commit.
3. each `SavePoint` tracks the keys that are locked since the `SavePoint`, `RollbackToSavePoint` or `PopSavePoint` relies on both the tracked keys in `SavePoint`s and `tracked_keys`.
Based on these scenarios, if we can abstract out a `LockTracker` interface to hold a set of tracked locks (can be keys or key ranges), and have methods that can be composed together to implement the scenarios, then `tracked_keys` can be an internal data structure of one implementation of `LockTracker`. See `utilities/transactions/lock/lock_tracker.h` for the detailed interface design, and `utilities/transactions/lock/point_lock_tracker.cc` for the implementation.
In the future, a `RangeLockTracker` can be implemented to track range locks without affecting other components.
After this PR, a clean interface for lock manager should be possible, and then ideally, we can have pluggable locking protocols.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7013
Test Plan: Run `transaction_test` and `optimistic_transaction_test`.
Reviewed By: ajkr
Differential Revision: D22163706
Pulled By: cheng-chang
fbshipit-source-id: f2860577b5334e31dd2994f5bc6d7c40d502b1b4
2020-08-06 21:36:48 +02:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
2015-08-22 00:47:21 +02:00
|
|
|
|
|
|
|
#pragma once
|
|
|
|
|
|
|
|
#ifndef ROCKSDB_LITE
|
|
|
|
|
|
|
|
#include <stack>
|
|
|
|
#include <string>
|
|
|
|
#include <vector>
|
|
|
|
|
2019-07-31 22:36:22 +02:00
|
|
|
#include "db/write_batch_internal.h"
|
2015-08-22 00:47:21 +02:00
|
|
|
#include "rocksdb/db.h"
|
|
|
|
#include "rocksdb/slice.h"
|
|
|
|
#include "rocksdb/snapshot.h"
|
|
|
|
#include "rocksdb/status.h"
|
|
|
|
#include "rocksdb/types.h"
|
|
|
|
#include "rocksdb/utilities/transaction.h"
|
|
|
|
#include "rocksdb/utilities/transaction_db.h"
|
|
|
|
#include "rocksdb/utilities/write_batch_with_index.h"
|
refactor SavePoints (#5192)
Summary:
Savepoints are assumed to be used in a stack-wise fashion (only
the top element should be used), so they were stored by `WriteBatch`
in a member variable `save_points` using an std::stack.
Conceptually this is fine, but the implementation had a few issues:
- the `save_points_` instance variable was a plain pointer to a heap-
allocated `SavePoints` struct. The destructor of `WriteBatch` simply
deletes this pointer. However, the copy constructor of WriteBatch
just copied that pointer, meaning that copying a WriteBatch with
active savepoints will very likely have crashed before. Now a proper
copy of the savepoints is made in the copy constructor, and not just
a copy of the pointer
- `save_points_` was an std::stack, which defaults to `std::deque` for
the underlying container. A deque is a bit over the top here, as we
only need access to the most recent savepoint (i.e. stack.top()) but
never any elements at the front. std::deque is rather expensive to
initialize in common environments. For example, the STL implementation
shipped with GNU g++ will perform a heap allocation of more than 500
bytes to create an empty deque object. Although the `save_points_`
container is created lazily by RocksDB, moving from a deque to a plain
`std::vector` is much more memory-efficient. So `save_points_` is now
a vector.
- `save_points_` was changed from a plain pointer to an `std::unique_ptr`,
making ownership more explicit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192
Differential Revision: D15024074
Pulled By: maysamyabandeh
fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7
2019-04-20 05:30:03 +02:00
|
|
|
#include "util/autovector.h"
|
Replace tracked_keys with a new LockTracker interface in TransactionDB (#7013)
Summary:
We're going to support more locking protocols such as range lock in transaction.
However, in current design, `TransactionBase` has a member `tracked_keys` which assumes that point lock (lock a single key) is used, and is used in snapshot checking (isolation protocol). When using range lock, we may use read committed instead of snapshot checking as the isolation protocol.
The most significant usage scenarios of `tracked_keys` are:
1. pessimistic transaction uses it to track the locked keys, and unlock these keys when commit or rollback.
2. optimistic transaction does not lock keys upfront, it only tracks the lock intentions in tracked_keys, and do write conflict checking when commit.
3. each `SavePoint` tracks the keys that are locked since the `SavePoint`, `RollbackToSavePoint` or `PopSavePoint` relies on both the tracked keys in `SavePoint`s and `tracked_keys`.
Based on these scenarios, if we can abstract out a `LockTracker` interface to hold a set of tracked locks (can be keys or key ranges), and have methods that can be composed together to implement the scenarios, then `tracked_keys` can be an internal data structure of one implementation of `LockTracker`. See `utilities/transactions/lock/lock_tracker.h` for the detailed interface design, and `utilities/transactions/lock/point_lock_tracker.cc` for the implementation.
In the future, a `RangeLockTracker` can be implemented to track range locks without affecting other components.
After this PR, a clean interface for lock manager should be possible, and then ideally, we can have pluggable locking protocols.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7013
Test Plan: Run `transaction_test` and `optimistic_transaction_test`.
Reviewed By: ajkr
Differential Revision: D22163706
Pulled By: cheng-chang
fbshipit-source-id: f2860577b5334e31dd2994f5bc6d7c40d502b1b4
2020-08-06 21:36:48 +02:00
|
|
|
#include "utilities/transactions/lock/lock_tracker.h"
|
2015-09-12 03:10:50 +02:00
|
|
|
#include "utilities/transactions/transaction_util.h"
|
2015-08-22 00:47:21 +02:00
|
|
|
|
2020-02-20 21:07:53 +01:00
|
|
|
namespace ROCKSDB_NAMESPACE {
|
2015-08-22 00:47:21 +02:00
|
|
|
|
|
|
|
class TransactionBaseImpl : public Transaction {
|
|
|
|
public:
|
2020-10-19 19:12:53 +02:00
|
|
|
TransactionBaseImpl(DB* db, const WriteOptions& write_options,
|
|
|
|
const LockTrackerFactory& lock_tracker_factory);
|
2015-08-22 00:47:21 +02:00
|
|
|
|
2021-10-12 03:12:48 +02:00
|
|
|
~TransactionBaseImpl() override;
|
2015-08-22 00:47:21 +02:00
|
|
|
|
2015-08-25 04:13:18 +02:00
|
|
|
// Remove pending operations queued in this transaction.
|
|
|
|
virtual void Clear();
|
|
|
|
|
2016-03-04 00:36:26 +01:00
|
|
|
void Reinitialize(DB* db, const WriteOptions& write_options);
|
2016-02-03 04:19:17 +01:00
|
|
|
|
2015-08-22 00:47:21 +02:00
|
|
|
// Called before executing Put, Merge, Delete, and GetForUpdate. If TryLock
|
|
|
|
// returns non-OK, the Put/Merge/Delete/GetForUpdate will be failed.
|
2018-12-07 02:46:57 +01:00
|
|
|
// do_validate will be false if called from PutUntracked, DeleteUntracked,
|
|
|
|
// MergeUntracked, or GetForUpdate(do_validate=false)
|
2015-08-22 00:47:21 +02:00
|
|
|
virtual Status TryLock(ColumnFamilyHandle* column_family, const Slice& key,
|
2016-12-06 02:18:14 +01:00
|
|
|
bool read_only, bool exclusive,
|
2018-12-07 02:46:57 +01:00
|
|
|
const bool do_validate = true,
|
|
|
|
const bool assume_tracked = false) = 0;
|
2015-08-22 00:47:21 +02:00
|
|
|
|
|
|
|
void SetSavePoint() override;
|
|
|
|
|
|
|
|
Status RollbackToSavePoint() override;
|
2019-04-23 23:08:24 +02:00
|
|
|
|
2018-08-17 20:53:33 +02:00
|
|
|
Status PopSavePoint() override;
|
2015-08-22 00:47:21 +02:00
|
|
|
|
2017-08-23 19:01:17 +02:00
|
|
|
using Transaction::Get;
|
2015-08-22 00:47:21 +02:00
|
|
|
Status Get(const ReadOptions& options, ColumnFamilyHandle* column_family,
|
|
|
|
const Slice& key, std::string* value) override;
|
|
|
|
|
2017-08-23 19:01:17 +02:00
|
|
|
Status Get(const ReadOptions& options, ColumnFamilyHandle* column_family,
|
|
|
|
const Slice& key, PinnableSlice* value) override;
|
|
|
|
|
2015-08-22 00:47:21 +02:00
|
|
|
Status Get(const ReadOptions& options, const Slice& key,
|
|
|
|
std::string* value) override {
|
|
|
|
return Get(options, db_->DefaultColumnFamily(), key, value);
|
|
|
|
}
|
|
|
|
|
2017-08-23 19:01:17 +02:00
|
|
|
using Transaction::GetForUpdate;
|
2015-08-22 00:47:21 +02:00
|
|
|
Status GetForUpdate(const ReadOptions& options,
|
|
|
|
ColumnFamilyHandle* column_family, const Slice& key,
|
2018-12-07 02:46:57 +01:00
|
|
|
std::string* value, bool exclusive,
|
|
|
|
const bool do_validate) override;
|
2015-08-22 00:47:21 +02:00
|
|
|
|
2017-08-23 19:01:17 +02:00
|
|
|
Status GetForUpdate(const ReadOptions& options,
|
|
|
|
ColumnFamilyHandle* column_family, const Slice& key,
|
2018-12-07 02:46:57 +01:00
|
|
|
PinnableSlice* pinnable_val, bool exclusive,
|
|
|
|
const bool do_validate) override;
|
2017-08-23 19:01:17 +02:00
|
|
|
|
2015-08-22 00:47:21 +02:00
|
|
|
Status GetForUpdate(const ReadOptions& options, const Slice& key,
|
2018-12-07 02:46:57 +01:00
|
|
|
std::string* value, bool exclusive,
|
|
|
|
const bool do_validate) override {
|
2016-12-06 02:18:14 +01:00
|
|
|
return GetForUpdate(options, db_->DefaultColumnFamily(), key, value,
|
2018-12-07 02:46:57 +01:00
|
|
|
exclusive, do_validate);
|
2015-08-22 00:47:21 +02:00
|
|
|
}
|
|
|
|
|
2019-04-23 23:08:24 +02:00
|
|
|
using Transaction::MultiGet;
|
2015-08-22 00:47:21 +02:00
|
|
|
std::vector<Status> MultiGet(
|
|
|
|
const ReadOptions& options,
|
|
|
|
const std::vector<ColumnFamilyHandle*>& column_family,
|
|
|
|
const std::vector<Slice>& keys,
|
|
|
|
std::vector<std::string>* values) override;
|
|
|
|
|
|
|
|
std::vector<Status> MultiGet(const ReadOptions& options,
|
|
|
|
const std::vector<Slice>& keys,
|
|
|
|
std::vector<std::string>* values) override {
|
|
|
|
return MultiGet(options, std::vector<ColumnFamilyHandle*>(
|
|
|
|
keys.size(), db_->DefaultColumnFamily()),
|
|
|
|
keys, values);
|
|
|
|
}
|
|
|
|
|
2019-04-23 23:08:24 +02:00
|
|
|
void MultiGet(const ReadOptions& options, ColumnFamilyHandle* column_family,
|
|
|
|
const size_t num_keys, const Slice* keys, PinnableSlice* values,
|
2019-11-27 01:55:46 +01:00
|
|
|
Status* statuses, const bool sorted_input = false) override;
|
2019-04-23 23:08:24 +02:00
|
|
|
|
|
|
|
using Transaction::MultiGetForUpdate;
|
2015-08-22 00:47:21 +02:00
|
|
|
std::vector<Status> MultiGetForUpdate(
|
|
|
|
const ReadOptions& options,
|
|
|
|
const std::vector<ColumnFamilyHandle*>& column_family,
|
|
|
|
const std::vector<Slice>& keys,
|
|
|
|
std::vector<std::string>* values) override;
|
|
|
|
|
|
|
|
std::vector<Status> MultiGetForUpdate(
|
|
|
|
const ReadOptions& options, const std::vector<Slice>& keys,
|
|
|
|
std::vector<std::string>* values) override {
|
|
|
|
return MultiGetForUpdate(options,
|
|
|
|
std::vector<ColumnFamilyHandle*>(
|
|
|
|
keys.size(), db_->DefaultColumnFamily()),
|
|
|
|
keys, values);
|
|
|
|
}
|
|
|
|
|
|
|
|
Iterator* GetIterator(const ReadOptions& read_options) override;
|
|
|
|
Iterator* GetIterator(const ReadOptions& read_options,
|
|
|
|
ColumnFamilyHandle* column_family) override;
|
|
|
|
|
|
|
|
Status Put(ColumnFamilyHandle* column_family, const Slice& key,
|
2018-12-07 02:46:57 +01:00
|
|
|
const Slice& value, const bool assume_tracked = false) override;
|
2015-08-22 00:47:21 +02:00
|
|
|
Status Put(const Slice& key, const Slice& value) override {
|
|
|
|
return Put(nullptr, key, value);
|
|
|
|
}
|
|
|
|
|
|
|
|
Status Put(ColumnFamilyHandle* column_family, const SliceParts& key,
|
2018-12-07 02:46:57 +01:00
|
|
|
const SliceParts& value,
|
|
|
|
const bool assume_tracked = false) override;
|
2015-08-22 00:47:21 +02:00
|
|
|
Status Put(const SliceParts& key, const SliceParts& value) override {
|
|
|
|
return Put(nullptr, key, value);
|
|
|
|
}
|
|
|
|
|
|
|
|
Status Merge(ColumnFamilyHandle* column_family, const Slice& key,
|
2018-12-07 02:46:57 +01:00
|
|
|
const Slice& value, const bool assume_tracked = false) override;
|
2015-08-22 00:47:21 +02:00
|
|
|
Status Merge(const Slice& key, const Slice& value) override {
|
|
|
|
return Merge(nullptr, key, value);
|
|
|
|
}
|
|
|
|
|
2018-12-07 02:46:57 +01:00
|
|
|
Status Delete(ColumnFamilyHandle* column_family, const Slice& key,
|
|
|
|
const bool assume_tracked = false) override;
|
2015-08-22 00:47:21 +02:00
|
|
|
Status Delete(const Slice& key) override { return Delete(nullptr, key); }
|
2018-12-07 02:46:57 +01:00
|
|
|
Status Delete(ColumnFamilyHandle* column_family, const SliceParts& key,
|
|
|
|
const bool assume_tracked = false) override;
|
2015-08-22 00:47:21 +02:00
|
|
|
Status Delete(const SliceParts& key) override { return Delete(nullptr, key); }
|
|
|
|
|
2018-12-07 02:46:57 +01:00
|
|
|
Status SingleDelete(ColumnFamilyHandle* column_family, const Slice& key,
|
|
|
|
const bool assume_tracked = false) override;
|
2015-09-25 03:31:32 +02:00
|
|
|
Status SingleDelete(const Slice& key) override {
|
|
|
|
return SingleDelete(nullptr, key);
|
|
|
|
}
|
2018-12-07 02:46:57 +01:00
|
|
|
Status SingleDelete(ColumnFamilyHandle* column_family, const SliceParts& key,
|
|
|
|
const bool assume_tracked = false) override;
|
2015-09-25 03:31:32 +02:00
|
|
|
Status SingleDelete(const SliceParts& key) override {
|
|
|
|
return SingleDelete(nullptr, key);
|
|
|
|
}
|
|
|
|
|
2015-08-22 00:47:21 +02:00
|
|
|
Status PutUntracked(ColumnFamilyHandle* column_family, const Slice& key,
|
|
|
|
const Slice& value) override;
|
|
|
|
Status PutUntracked(const Slice& key, const Slice& value) override {
|
|
|
|
return PutUntracked(nullptr, key, value);
|
|
|
|
}
|
|
|
|
|
|
|
|
Status PutUntracked(ColumnFamilyHandle* column_family, const SliceParts& key,
|
|
|
|
const SliceParts& value) override;
|
|
|
|
Status PutUntracked(const SliceParts& key, const SliceParts& value) override {
|
|
|
|
return PutUntracked(nullptr, key, value);
|
|
|
|
}
|
|
|
|
|
|
|
|
Status MergeUntracked(ColumnFamilyHandle* column_family, const Slice& key,
|
|
|
|
const Slice& value) override;
|
|
|
|
Status MergeUntracked(const Slice& key, const Slice& value) override {
|
|
|
|
return MergeUntracked(nullptr, key, value);
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DeleteUntracked(ColumnFamilyHandle* column_family,
|
|
|
|
const Slice& key) override;
|
|
|
|
Status DeleteUntracked(const Slice& key) override {
|
|
|
|
return DeleteUntracked(nullptr, key);
|
|
|
|
}
|
|
|
|
Status DeleteUntracked(ColumnFamilyHandle* column_family,
|
|
|
|
const SliceParts& key) override;
|
|
|
|
Status DeleteUntracked(const SliceParts& key) override {
|
|
|
|
return DeleteUntracked(nullptr, key);
|
|
|
|
}
|
|
|
|
|
2017-09-27 19:24:42 +02:00
|
|
|
Status SingleDeleteUntracked(ColumnFamilyHandle* column_family,
|
|
|
|
const Slice& key) override;
|
|
|
|
Status SingleDeleteUntracked(const Slice& key) override {
|
|
|
|
return SingleDeleteUntracked(nullptr, key);
|
|
|
|
}
|
|
|
|
|
2015-08-22 00:47:21 +02:00
|
|
|
void PutLogData(const Slice& blob) override;
|
|
|
|
|
|
|
|
WriteBatchWithIndex* GetWriteBatch() override;
|
|
|
|
|
2018-03-05 22:08:17 +01:00
|
|
|
virtual void SetLockTimeout(int64_t /*timeout*/) override { /* Do nothing */
|
2015-09-02 01:51:49 +02:00
|
|
|
}
|
|
|
|
|
2015-08-22 00:47:21 +02:00
|
|
|
const Snapshot* GetSnapshot() const override {
|
2021-11-08 21:31:22 +01:00
|
|
|
// will return nullptr when there is no snapshot
|
|
|
|
return snapshot_.get();
|
2015-08-22 00:47:21 +02:00
|
|
|
}
|
|
|
|
|
2018-04-03 05:19:21 +02:00
|
|
|
virtual void SetSnapshot() override;
|
2015-12-04 19:12:27 +01:00
|
|
|
void SetSnapshotOnNextOperation(
|
|
|
|
std::shared_ptr<TransactionNotifier> notifier = nullptr) override;
|
2015-08-22 00:47:21 +02:00
|
|
|
|
2015-10-16 00:37:23 +02:00
|
|
|
void ClearSnapshot() override {
|
|
|
|
snapshot_.reset();
|
|
|
|
snapshot_needed_ = false;
|
2015-12-04 19:12:27 +01:00
|
|
|
snapshot_notifier_ = nullptr;
|
2015-10-16 00:37:23 +02:00
|
|
|
}
|
|
|
|
|
2015-10-09 22:31:10 +02:00
|
|
|
void DisableIndexing() override { indexing_enabled_ = false; }
|
|
|
|
|
|
|
|
void EnableIndexing() override { indexing_enabled_ = true; }
|
|
|
|
|
Support user-defined timestamps in write-committed txns (#9629)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9629
Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are
locked upon first operation that writes the key or has the intention of writing. For example,
`PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`,
`PessimisticTransaction::SingleDelete()` will write to or delete a key, while
`PessimisticTransaction::GetForUpdate()` is used by application to indicate
to RocksDB that the transaction has the intention of performing write operation later
in the same transaction.
Pessimistic transactions support two-phase commit (2PC). A transaction can be
`Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once
`Prepare()` succeeds, the transaction has acquired the necessary resources to commit.
The resources include locks, persistence of WAL, etc.
Write-committed transaction is the default pessimistic transaction implementation. In
RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare
section. `Commit()` will write a commit marker to the WAL and then write data to the
memtables. While writing to the memtables, different keys in the transaction's write batch
will be assigned different sequence numbers in ascending order.
Until commit/rollback, the transaction holds locks on the keys so that no other transaction
can write to the same keys. Furthermore, the keys' sequence numbers represent the order
in which they are committed and should be made visible. This is convenient for us to
implement support for user-defined timestamps.
Since column families with and without timestamps can co-exist in the same database,
a transaction may or may not involve timestamps. Based on this observation, we add two
optional members to each `PessimisticTransaction`, `read_timestamp_` and
`commit_timestamp_`. If no key in the transaction's write batch has timestamp, then
setting these two variables do not have any effect. For the rest of this commit, we discuss
only the cases when these two variables are meaningful.
read_timestamp_ is used mainly for validation, and should be set before first call to
`GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls
`TryLock()` that can verify if another transaction has written the same key since
`read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed
written the same key, then validation fails, and RocksDB allows this transaction to
refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()`
with a different timestamp to read, but the result of the read should not be used to
determine data that will be written later.
commit_timestamp_ must be set after finishing writing and before transaction commit.
This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after
prepare phase succeeds.
We currently require that the commit timestamp be chosen after all keys are locked. This
means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used
by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`,
`PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`,
etc. will return non-ok status because they specify timestamps before locking the keys.
Users are also prompted to use the `Transaction` APIs when they receive the non-ok status.
Reviewed By: ltamasi
Differential Revision: D31822445
fbshipit-source-id: b82abf8e230216dc89cc519564a588224a88fd43
2022-03-09 01:20:59 +01:00
|
|
|
bool IndexingEnabled() const { return indexing_enabled_; }
|
|
|
|
|
2015-08-25 04:13:18 +02:00
|
|
|
uint64_t GetElapsedTime() const override;
|
|
|
|
|
|
|
|
uint64_t GetNumPuts() const override;
|
|
|
|
|
|
|
|
uint64_t GetNumDeletes() const override;
|
|
|
|
|
|
|
|
uint64_t GetNumMerges() const override;
|
|
|
|
|
2015-09-12 03:10:50 +02:00
|
|
|
uint64_t GetNumKeys() const override;
|
|
|
|
|
2015-09-15 02:11:52 +02:00
|
|
|
void UndoGetForUpdate(ColumnFamilyHandle* column_family,
|
|
|
|
const Slice& key) override;
|
|
|
|
void UndoGetForUpdate(const Slice& key) override {
|
|
|
|
return UndoGetForUpdate(nullptr, key);
|
|
|
|
};
|
|
|
|
|
2016-06-23 21:20:48 +02:00
|
|
|
WriteOptions* GetWriteOptions() override { return &write_options_; }
|
2015-12-11 23:27:49 +01:00
|
|
|
|
|
|
|
void SetWriteOptions(const WriteOptions& write_options) override {
|
|
|
|
write_options_ = write_options;
|
|
|
|
}
|
|
|
|
|
2016-01-28 02:11:44 +01:00
|
|
|
// Used for memory management for snapshot_
|
|
|
|
void ReleaseSnapshot(const Snapshot* snapshot, DB* db);
|
|
|
|
|
2016-04-18 20:15:50 +02:00
|
|
|
// iterates over the given batch and makes the appropriate inserts.
|
|
|
|
// used for rebuilding prepared transactions after recovery.
|
2018-02-06 03:32:54 +01:00
|
|
|
virtual Status RebuildFromWriteBatch(WriteBatch* src_batch) override;
|
2016-04-18 20:15:50 +02:00
|
|
|
|
|
|
|
WriteBatch* GetCommitTimeWriteBatch() override;
|
|
|
|
|
2020-12-23 04:10:56 +01:00
|
|
|
LockTracker& GetTrackedLocks() { return *tracked_locks_; }
|
|
|
|
|
2015-08-22 00:47:21 +02:00
|
|
|
protected:
|
2015-09-12 03:10:50 +02:00
|
|
|
// Add a key to the list of tracked keys.
|
2015-09-15 02:11:52 +02:00
|
|
|
//
|
2015-09-12 03:10:50 +02:00
|
|
|
// seqno is the earliest seqno this key was involved with this transaction.
|
2015-09-15 02:11:52 +02:00
|
|
|
// readonly should be set to true if no data was written for this key
|
|
|
|
void TrackKey(uint32_t cfh_id, const std::string& key, SequenceNumber seqno,
|
2017-04-11 00:47:20 +02:00
|
|
|
bool readonly, bool exclusive);
|
2015-09-15 02:11:52 +02:00
|
|
|
|
|
|
|
// Called when UndoGetForUpdate determines that this key can be unlocked.
|
|
|
|
virtual void UnlockGetForUpdate(ColumnFamilyHandle* column_family,
|
|
|
|
const Slice& key) = 0;
|
2015-09-12 03:10:50 +02:00
|
|
|
|
2015-09-28 21:12:17 +02:00
|
|
|
// Sets a snapshot if SetSnapshotOnNextOperation() has been called.
|
|
|
|
void SetSnapshotIfNeeded();
|
|
|
|
|
2019-07-31 22:36:22 +02:00
|
|
|
// Initialize write_batch_ for 2PC by inserting Noop.
|
|
|
|
inline void InitWriteBatch(bool clear = false) {
|
|
|
|
if (clear) {
|
|
|
|
write_batch_.Clear();
|
|
|
|
}
|
|
|
|
assert(write_batch_.GetDataSize() == WriteBatchInternal::kHeader);
|
2020-10-21 23:02:00 +02:00
|
|
|
auto s = WriteBatchInternal::InsertNoop(write_batch_.GetWriteBatch());
|
|
|
|
assert(s.ok());
|
2019-07-31 22:36:22 +02:00
|
|
|
}
|
|
|
|
|
Support user-defined timestamps in write-committed txns (#9629)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9629
Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are
locked upon first operation that writes the key or has the intention of writing. For example,
`PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`,
`PessimisticTransaction::SingleDelete()` will write to or delete a key, while
`PessimisticTransaction::GetForUpdate()` is used by application to indicate
to RocksDB that the transaction has the intention of performing write operation later
in the same transaction.
Pessimistic transactions support two-phase commit (2PC). A transaction can be
`Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once
`Prepare()` succeeds, the transaction has acquired the necessary resources to commit.
The resources include locks, persistence of WAL, etc.
Write-committed transaction is the default pessimistic transaction implementation. In
RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare
section. `Commit()` will write a commit marker to the WAL and then write data to the
memtables. While writing to the memtables, different keys in the transaction's write batch
will be assigned different sequence numbers in ascending order.
Until commit/rollback, the transaction holds locks on the keys so that no other transaction
can write to the same keys. Furthermore, the keys' sequence numbers represent the order
in which they are committed and should be made visible. This is convenient for us to
implement support for user-defined timestamps.
Since column families with and without timestamps can co-exist in the same database,
a transaction may or may not involve timestamps. Based on this observation, we add two
optional members to each `PessimisticTransaction`, `read_timestamp_` and
`commit_timestamp_`. If no key in the transaction's write batch has timestamp, then
setting these two variables do not have any effect. For the rest of this commit, we discuss
only the cases when these two variables are meaningful.
read_timestamp_ is used mainly for validation, and should be set before first call to
`GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls
`TryLock()` that can verify if another transaction has written the same key since
`read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed
written the same key, then validation fails, and RocksDB allows this transaction to
refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()`
with a different timestamp to read, but the result of the read should not be used to
determine data that will be written later.
commit_timestamp_ must be set after finishing writing and before transaction commit.
This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after
prepare phase succeeds.
We currently require that the commit timestamp be chosen after all keys are locked. This
means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used
by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`,
`PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`,
etc. will return non-ok status because they specify timestamps before locking the keys.
Users are also prompted to use the `Transaction` APIs when they receive the non-ok status.
Reviewed By: ltamasi
Differential Revision: D31822445
fbshipit-source-id: b82abf8e230216dc89cc519564a588224a88fd43
2022-03-09 01:20:59 +01:00
|
|
|
WriteBatchBase* GetBatchForWrite();
|
|
|
|
|
2016-03-04 00:36:26 +01:00
|
|
|
DB* db_;
|
2016-04-18 20:15:50 +02:00
|
|
|
DBImpl* dbimpl_;
|
2015-08-22 00:47:21 +02:00
|
|
|
|
2015-12-11 23:27:49 +01:00
|
|
|
WriteOptions write_options_;
|
2015-08-22 00:47:21 +02:00
|
|
|
|
|
|
|
const Comparator* cmp_;
|
|
|
|
|
2020-10-19 19:12:53 +02:00
|
|
|
const LockTrackerFactory& lock_tracker_factory_;
|
|
|
|
|
2015-08-22 00:47:21 +02:00
|
|
|
// Stores that time the txn was constructed, in microseconds.
|
2016-02-03 04:19:17 +01:00
|
|
|
uint64_t start_time_;
|
2015-08-22 00:47:21 +02:00
|
|
|
|
2016-04-28 11:30:44 +02:00
|
|
|
// Stores the current snapshot that was set by SetSnapshot or null if
|
2015-08-22 00:47:21 +02:00
|
|
|
// no snapshot is currently set.
|
2016-01-28 02:11:44 +01:00
|
|
|
std::shared_ptr<const Snapshot> snapshot_;
|
2015-08-22 00:47:21 +02:00
|
|
|
|
2015-08-25 04:13:18 +02:00
|
|
|
// Count of various operations pending in this transaction
|
|
|
|
uint64_t num_puts_ = 0;
|
|
|
|
uint64_t num_deletes_ = 0;
|
|
|
|
uint64_t num_merges_ = 0;
|
|
|
|
|
|
|
|
struct SavePoint {
|
2016-01-28 02:11:44 +01:00
|
|
|
std::shared_ptr<const Snapshot> snapshot_;
|
2019-07-26 20:31:46 +02:00
|
|
|
bool snapshot_needed_ = false;
|
2015-12-04 19:12:27 +01:00
|
|
|
std::shared_ptr<TransactionNotifier> snapshot_notifier_;
|
2019-07-26 20:31:46 +02:00
|
|
|
uint64_t num_puts_ = 0;
|
|
|
|
uint64_t num_deletes_ = 0;
|
|
|
|
uint64_t num_merges_ = 0;
|
2015-08-25 04:13:18 +02:00
|
|
|
|
Replace tracked_keys with a new LockTracker interface in TransactionDB (#7013)
Summary:
We're going to support more locking protocols such as range lock in transaction.
However, in current design, `TransactionBase` has a member `tracked_keys` which assumes that point lock (lock a single key) is used, and is used in snapshot checking (isolation protocol). When using range lock, we may use read committed instead of snapshot checking as the isolation protocol.
The most significant usage scenarios of `tracked_keys` are:
1. pessimistic transaction uses it to track the locked keys, and unlock these keys when commit or rollback.
2. optimistic transaction does not lock keys upfront, it only tracks the lock intentions in tracked_keys, and do write conflict checking when commit.
3. each `SavePoint` tracks the keys that are locked since the `SavePoint`, `RollbackToSavePoint` or `PopSavePoint` relies on both the tracked keys in `SavePoint`s and `tracked_keys`.
Based on these scenarios, if we can abstract out a `LockTracker` interface to hold a set of tracked locks (can be keys or key ranges), and have methods that can be composed together to implement the scenarios, then `tracked_keys` can be an internal data structure of one implementation of `LockTracker`. See `utilities/transactions/lock/lock_tracker.h` for the detailed interface design, and `utilities/transactions/lock/point_lock_tracker.cc` for the implementation.
In the future, a `RangeLockTracker` can be implemented to track range locks without affecting other components.
After this PR, a clean interface for lock manager should be possible, and then ideally, we can have pluggable locking protocols.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7013
Test Plan: Run `transaction_test` and `optimistic_transaction_test`.
Reviewed By: ajkr
Differential Revision: D22163706
Pulled By: cheng-chang
fbshipit-source-id: f2860577b5334e31dd2994f5bc6d7c40d502b1b4
2020-08-06 21:36:48 +02:00
|
|
|
// Record all locks tracked since the last savepoint
|
|
|
|
std::shared_ptr<LockTracker> new_locks_;
|
2015-09-12 03:10:50 +02:00
|
|
|
|
2016-01-28 02:11:44 +01:00
|
|
|
SavePoint(std::shared_ptr<const Snapshot> snapshot, bool snapshot_needed,
|
2015-12-04 19:12:27 +01:00
|
|
|
std::shared_ptr<TransactionNotifier> snapshot_notifier,
|
2020-10-19 19:12:53 +02:00
|
|
|
uint64_t num_puts, uint64_t num_deletes, uint64_t num_merges,
|
|
|
|
const LockTrackerFactory& lock_tracker_factory)
|
2015-08-25 04:13:18 +02:00
|
|
|
: snapshot_(snapshot),
|
2015-09-28 21:12:17 +02:00
|
|
|
snapshot_needed_(snapshot_needed),
|
2015-12-04 19:12:27 +01:00
|
|
|
snapshot_notifier_(snapshot_notifier),
|
2015-08-25 04:13:18 +02:00
|
|
|
num_puts_(num_puts),
|
|
|
|
num_deletes_(num_deletes),
|
Replace tracked_keys with a new LockTracker interface in TransactionDB (#7013)
Summary:
We're going to support more locking protocols such as range lock in transaction.
However, in current design, `TransactionBase` has a member `tracked_keys` which assumes that point lock (lock a single key) is used, and is used in snapshot checking (isolation protocol). When using range lock, we may use read committed instead of snapshot checking as the isolation protocol.
The most significant usage scenarios of `tracked_keys` are:
1. pessimistic transaction uses it to track the locked keys, and unlock these keys when commit or rollback.
2. optimistic transaction does not lock keys upfront, it only tracks the lock intentions in tracked_keys, and do write conflict checking when commit.
3. each `SavePoint` tracks the keys that are locked since the `SavePoint`, `RollbackToSavePoint` or `PopSavePoint` relies on both the tracked keys in `SavePoint`s and `tracked_keys`.
Based on these scenarios, if we can abstract out a `LockTracker` interface to hold a set of tracked locks (can be keys or key ranges), and have methods that can be composed together to implement the scenarios, then `tracked_keys` can be an internal data structure of one implementation of `LockTracker`. See `utilities/transactions/lock/lock_tracker.h` for the detailed interface design, and `utilities/transactions/lock/point_lock_tracker.cc` for the implementation.
In the future, a `RangeLockTracker` can be implemented to track range locks without affecting other components.
After this PR, a clean interface for lock manager should be possible, and then ideally, we can have pluggable locking protocols.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7013
Test Plan: Run `transaction_test` and `optimistic_transaction_test`.
Reviewed By: ajkr
Differential Revision: D22163706
Pulled By: cheng-chang
fbshipit-source-id: f2860577b5334e31dd2994f5bc6d7c40d502b1b4
2020-08-06 21:36:48 +02:00
|
|
|
num_merges_(num_merges),
|
2020-10-19 19:12:53 +02:00
|
|
|
new_locks_(lock_tracker_factory.Create()) {}
|
2019-07-26 20:31:46 +02:00
|
|
|
|
2020-10-19 19:12:53 +02:00
|
|
|
explicit SavePoint(const LockTrackerFactory& lock_tracker_factory)
|
|
|
|
: new_locks_(lock_tracker_factory.Create()) {}
|
2015-08-25 04:13:18 +02:00
|
|
|
};
|
|
|
|
|
2016-01-28 02:11:44 +01:00
|
|
|
// Records writes pending in this transaction
|
|
|
|
WriteBatchWithIndex write_batch_;
|
|
|
|
|
Replace tracked_keys with a new LockTracker interface in TransactionDB (#7013)
Summary:
We're going to support more locking protocols such as range lock in transaction.
However, in current design, `TransactionBase` has a member `tracked_keys` which assumes that point lock (lock a single key) is used, and is used in snapshot checking (isolation protocol). When using range lock, we may use read committed instead of snapshot checking as the isolation protocol.
The most significant usage scenarios of `tracked_keys` are:
1. pessimistic transaction uses it to track the locked keys, and unlock these keys when commit or rollback.
2. optimistic transaction does not lock keys upfront, it only tracks the lock intentions in tracked_keys, and do write conflict checking when commit.
3. each `SavePoint` tracks the keys that are locked since the `SavePoint`, `RollbackToSavePoint` or `PopSavePoint` relies on both the tracked keys in `SavePoint`s and `tracked_keys`.
Based on these scenarios, if we can abstract out a `LockTracker` interface to hold a set of tracked locks (can be keys or key ranges), and have methods that can be composed together to implement the scenarios, then `tracked_keys` can be an internal data structure of one implementation of `LockTracker`. See `utilities/transactions/lock/lock_tracker.h` for the detailed interface design, and `utilities/transactions/lock/point_lock_tracker.cc` for the implementation.
In the future, a `RangeLockTracker` can be implemented to track range locks without affecting other components.
After this PR, a clean interface for lock manager should be possible, and then ideally, we can have pluggable locking protocols.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7013
Test Plan: Run `transaction_test` and `optimistic_transaction_test`.
Reviewed By: ajkr
Differential Revision: D22163706
Pulled By: cheng-chang
fbshipit-source-id: f2860577b5334e31dd2994f5bc6d7c40d502b1b4
2020-08-06 21:36:48 +02:00
|
|
|
// For Pessimistic Transactions this is the set of acquired locks.
|
|
|
|
// Optimistic Transactions will keep note the requested locks (not actually
|
|
|
|
// locked), and do conflict checking until commit time based on the tracked
|
|
|
|
// lock requests.
|
|
|
|
std::unique_ptr<LockTracker> tracked_locks_;
|
2019-07-17 00:19:45 +02:00
|
|
|
|
2019-07-31 22:36:22 +02:00
|
|
|
// Stack of the Snapshot saved at each save point. Saved snapshots may be
|
|
|
|
// nullptr if there was no snapshot at the time SetSavePoint() was called.
|
|
|
|
std::unique_ptr<std::stack<TransactionBaseImpl::SavePoint,
|
|
|
|
autovector<TransactionBaseImpl::SavePoint>>>
|
|
|
|
save_points_;
|
|
|
|
|
2017-04-11 00:38:34 +02:00
|
|
|
private:
|
2018-04-03 05:19:21 +02:00
|
|
|
friend class WritePreparedTxn;
|
2017-11-02 01:23:52 +01:00
|
|
|
// Extra data to be persisted with the commit. Note this is only used when
|
|
|
|
// prepare phase is not skipped.
|
2016-04-18 20:15:50 +02:00
|
|
|
WriteBatch commit_time_batch_;
|
|
|
|
|
2015-10-09 22:31:10 +02:00
|
|
|
// If true, future Put/Merge/Deletes will be indexed in the
|
|
|
|
// WriteBatchWithIndex.
|
|
|
|
// If false, future Put/Merge/Deletes will be inserted directly into the
|
|
|
|
// underlying WriteBatch and not indexed in the WriteBatchWithIndex.
|
2016-03-04 00:36:26 +01:00
|
|
|
bool indexing_enabled_;
|
2015-10-09 22:31:10 +02:00
|
|
|
|
2015-09-28 21:12:17 +02:00
|
|
|
// SetSnapshotOnNextOperation() has been called and the snapshot has not yet
|
|
|
|
// been reset.
|
|
|
|
bool snapshot_needed_ = false;
|
|
|
|
|
2015-12-04 19:12:27 +01:00
|
|
|
// SetSnapshotOnNextOperation() has been called and the caller would like
|
|
|
|
// a notification through the TransactionNotifier interface
|
|
|
|
std::shared_ptr<TransactionNotifier> snapshot_notifier_ = nullptr;
|
|
|
|
|
2015-08-22 00:47:21 +02:00
|
|
|
Status TryLock(ColumnFamilyHandle* column_family, const SliceParts& key,
|
2018-12-07 02:46:57 +01:00
|
|
|
bool read_only, bool exclusive, const bool do_validate = true,
|
|
|
|
const bool assume_tracked = false);
|
2015-10-09 22:31:10 +02:00
|
|
|
|
2016-02-03 04:19:17 +01:00
|
|
|
void SetSnapshotInternal(const Snapshot* snapshot);
|
2015-08-22 00:47:21 +02:00
|
|
|
};
|
|
|
|
|
2020-02-20 21:07:53 +01:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|
2015-08-22 00:47:21 +02:00
|
|
|
|
|
|
|
#endif // ROCKSDB_LITE
|