Compare commits

...

20 Commits

Author SHA1 Message Date
Peter Dillinger
b705f808e4 Update version and HISTORY for 6.16.5 2021-09-08 12:34:59 -07:00
Levi Tamasi
cc4c9e80e4 Fix the sorting of KeyContexts for batched MultiGet (#8633)
Summary:
`CompareKeyContext::operator()` on the trunk has a bug: when comparing
column family IDs, `lhs` is used for both sides of the comparison. This
results in the `KeyContext`s getting sorted solely based on key, which
in turn means that keys with the same column family do not necessarily
form a single range in the sorted list. This violates an assumption of the
batched `MultiGet` logic, leading to the same column family
showing up multiple times in the list of `MultiGetColumnFamilyData`.
The end result is the code attempting to check out the thread-local
`SuperVersion` for the same CF multiple times, causing an
assertion violation in debug builds and memory corruption/crash in
release builds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D30169182

Pulled By: ltamasi

fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023
2021-09-08 12:25:27 -07:00
Jay Zhuang
9d95050ca7 Bump version and update HISTORY.md for 6.16.4 2021-03-30 15:54:11 -07:00
Adam Retter
9857d45f37 Fix compilation against musl lib C (#7875)
Summary:
See https://github.com/percona/PerconaFT/pull/450

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7875

Reviewed By: ajkr

Differential Revision: D25938020

Pulled By: jay-zhuang

fbshipit-source-id: 9014dbc7b23bf92c5e63bfbdda4565bb0d2f2b58
2021-03-30 13:48:39 -07:00
Adam Retter
5884cf5d5b range_tree requires GNU libc on ppc64 (#8070)
Summary:
If the platform is ppc64 and the libc is not GNU libc, then we exclude the range_tree from compilation.

See https://jira.percona.com/browse/PS-7559

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8070

Reviewed By: jay-zhuang

Differential Revision: D27246004

Pulled By: mrambacher

fbshipit-source-id: 59d8433242ce7ce608988341becb4f83312445f5
2021-03-29 20:28:50 -07:00
Andrew Kryczka
e32a64aa54 bump version and update HISTORY.md for 6.16.3 2021-02-05 17:09:36 -08:00
Andrew Kryczka
031368f67a Allow range deletions in *TransactionDB only when safe (#7929)
Summary:
Explicitly reject all range deletions on `TransactionDB` or `OptimisticTransactionDB`, except when the user provides sufficient promises that allow us to proceed safely. The necessary promises are described in the API doc for `TransactionDB::DeleteRange()`. There is currently no way to provide enough promises to make it safe in `OptimisticTransactionDB`.

Fixes https://github.com/facebook/rocksdb/issues/7913.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7929

Test Plan: unit tests covering the cases it's permitted/rejected

Reviewed By: ltamasi

Differential Revision: D26240254

Pulled By: ajkr

fbshipit-source-id: 2834a0ce64cc3e4c3799e35b885a5e79c2f4f6d9
2021-02-05 17:08:28 -08:00
Andrew Kryczka
5105db8d7b update HISTORY.md and bump version for 6.16.2 2021-01-21 12:36:23 -08:00
Andrew Kryczka
ba7c46a3d8 workaround race conditions during PeriodicWorkScheduler registration (#7888)
Summary:
This provides a workaround for two race conditions that will be fixed in
a more sophisticated way later. This PR:

(1) Makes the client serialize calls to `Timer::Start()` and `Timer::Shutdown()` (see https://github.com/facebook/rocksdb/issues/7711). The long-term fix will be to make those functions thread-safe.
(2) Makes `PeriodicWorkScheduler` atomically add/cancel work together with starting/shutting down its `Timer`. The long-term fix will be for `Timer` API to offer more specialized APIs so the client will not need to synchronize.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7888

Test Plan: ran the repro provided in https://github.com/facebook/rocksdb/issues/7881

Reviewed By: jay-zhuang

Differential Revision: D25990891

Pulled By: ajkr

fbshipit-source-id: a97fdaebbda6d7db7ddb1b146738b68c16c5be38
2021-01-21 12:35:30 -08:00
Cheng Chang
be40c99dda bump to 6.16.1 2021-01-20 08:40:10 -08:00
Cheng Chang
89dd231f6a Update HISTORY.md 2021-01-20 08:38:18 -08:00
Cheng Chang
e931bbfec0 Make it able to ignore WAL related VersionEdits in older versions (#7873)
Summary:
Although the tags for `WalAddition`, `WalDeletion` are after `kTagSafeIgnoreMask`, to actually be able to skip these entries in older versions of RocksDB, we require that they are encoded with their encoded size as the prefix. This requirement is not met in the current codebase, so a downgraded DB may fail to open if these entries exist in the MANIFEST.

If a DB wants to downgrade, and its MANIFEST contains `WalAddition` or `WalDeletion`, it can set `track_and_verify_wals_in_manifest` to `false`, then restart twice, then downgrade. On the first restart, a new MANIFEST will be created with a `WalDeletion` indicating that all previously tracked WALs are removed from MANIFEST. On the second restart, since there is  no tracked WALs in MANIFEST now, a new MANIFEST will be created with neither `WalAddition` nor `WalDeletion`. Then the DB can downgrade.

Tags for `BlobFileAddition`, `BlobFileGarbage` also have the same problem, but this PR focuses on solving the problem for WAL edits.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7873

Test Plan: Added a `VersionEditTest::IgnorableTags` unit test to verify all entries with tags larger than `kTagSafeIgnoreMask` can actually be skipped and won't affect parsing of other entries.

Reviewed By: ajkr

Differential Revision: D25935930

Pulled By: cheng-chang

fbshipit-source-id: 7a02fdba4311d6084328c14aed110a26d08c3efb
2021-01-20 08:34:52 -08:00
Cheng Chang
67f8189e68 Update HISTORY.md 2021-01-20 08:34:06 -08:00
Cheng Chang
0a91c691a9 Add note for PR 7789 in history (#7855)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7855

Reviewed By: ajkr

Differential Revision: D25872797

Pulled By: cheng-chang

fbshipit-source-id: 82159a13f897aaaad5f3c70c7dfa822e073bc623
2021-01-11 13:55:25 -08:00
Jay Zhuang
dff9219df6 Fix checkpoint_test hang (#7849)
Summary:
`CheckpointTest.CurrentFileModifiedWhileCheckpointing` could hang
because now create checkpoint triggers flush twice. The test should wait
both flush done.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7849

Test Plan: `gtest-parallel ./checkpoint_test --gtest_filter=CheckpointTest.CurrentFileModifiedWhileCheckpointing -r 100`

Reviewed By: ajkr

Differential Revision: D25860713

Pulled By: jay-zhuang

fbshipit-source-id: e1c2f23037dedc33e205519f4289a25e77816b41
2021-01-09 13:29:00 -08:00
Cheng Chang
eeea27a048 Get manifest size again after getting min_log_num during checkpoint (#7836)
Summary:
Currently, manifest size is determined before getting min_log_num.

But between getting manifest size and getting min_log_num, concurrently, a flush might succeed, which will write new records to manifest to make some WALs become outdated, then min_log_num will be correspondingly increased, but the new records in manifest will not be copied into the checkpoint because the manifest's size is determined before them, then the newly outdated WALs will still exist in the checkpoint's manifest, but they are not linked/copied to the checkpoint because their log number is < min_log_num, so a corruption of missing WAL will be reported when restoring from the checkpoint.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7836

Test Plan: make crash_test

Reviewed By: ajkr

Differential Revision: D25788204

Pulled By: cheng-chang

fbshipit-source-id: a4e5acf30f08270b3c0a95304ff559a9e655252f
2021-01-08 10:16:03 -08:00
Zhichao Cao
2b64cddf99 Treat File Scope Write IO Error the same as Retryable IO Error (#7840)
Summary:
In RocksDB, when IO error happens, the flags of IOStatus can be set. If the IOStatus is set as "File Scope IO Error", it indicate that the error is constrained in the file level. Since RocksDB does not continues write data to a file when any IO Error happens, File Scope IO Error can be treated the same as Retryable IO Error. Adding the logic to ErrorHandler::SetBGError to include the file scope IO Error in its error handling logic, which is the same as retryable IO Error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7840

Test Plan: added new unit tests in error_handler_fs_test. make check

Reviewed By: anand1976

Differential Revision: D25820481

Pulled By: zhichao-cao

fbshipit-source-id: 69cabd3d010073e064d6142ce1cabf341b8a6806
2021-01-08 10:15:54 -08:00
Adam Retter
79a21d67cb Attempt to fix build errors around missing compression library includes (#7803)
Summary:
This fixes an issue introduced in https://github.com/facebook/rocksdb/pull/7769 that caused many errors about missing compression libraries to be displayed during compilation, although compilation actually succeeded. This PR fixes the compilation so the compression libraries are only introduced where strictly needed.

It likely needs to be merged into the same branches as https://github.com/facebook/rocksdb/pull/7769 which I think are:
1. master
2. 6.15.fb
3. 6.16.fb

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7803

Reviewed By: ramvadiv

Differential Revision: D25733743

Pulled By: pdillinger

fbshipit-source-id: 6c04f6864b2ff4a345841d791a89b19e0e3f5bf7
2021-01-04 11:49:56 -08:00
cheng-chang
2eaad9e1b1 Skip WALs according to MinLogNumberToKeep when creating checkpoint (#7789)
Summary:
In a stress test failure, we observe that a WAL is skipped when creating checkpoint, although its log number >= MinLogNumberToKeep(). This might happen in the following case:

1. when creating the checkpoint, there are 2 column families: CF0 and CF1, and there are 2 WALs: 1, 2;
2. CF0's log number is 1, CF0's active memtable is empty, CF1's log number is 2, CF1's active memtable is not empty, WAL 2 is not empty, the sequence number points to WAL 2;
2. the checkpoint process flushes CF0, since CF0' active memtable is empty, there is no need to SwitchMemtable, thus no new WAL will be created, so CF0's log number is now 2, concurrently, some data is written to CF0 and WAL 2;
3. the checkpoint process flushes CF1, WAL 3 is created and CF1's log number is now 3, CF0's log number is still 2 because CF0 is not empty and WAL 2 contains its unflushed data concurrently written in step 2;
4.  the checkpoint process determines that WAL 1 and 2 are no longer needed according to [live_wal_files[i]->StartSequence() >= *sequence_number](https://github.com/facebook/rocksdb/blob/master/utilities/checkpoint/checkpoint_impl.cc#L388), so it skips linking them to the checkpoint directory;
5. but according to `MinLogNumberToKeep()`, WAL 2 still needs to be kept because CF0's log number is 2.

If the checkpoint is reopened in read-only mode, and only read from the snapshot with the initial sequence number, then there will be no data loss or data inconsistency.

But if the checkpoint is reopened and read from the most recent sequence number, suppose in step 3, there are also data concurrently written to CF1 and WAL 3, then the most recent sequence number refers to the latest entry in WAL 3, so the data written in step 2 should also be visible, but since WAL 2 is discarded, those data are lost.

When tracking WAL in MANIFEST is enabled, when reopening the checkpoint, since WAL 2 is still tracked in MANIFEST as alive, but it's missing from the checkpoint directory, a corruption will be reported.

This PR makes the checkpoint process to only skip a WAL if its log number < `MinLogNumberToKeep`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7789

Test Plan: watch existing tests to pass.

Reviewed By: ajkr

Differential Revision: D25662346

Pulled By: cheng-chang

fbshipit-source-id: 136471095baa01886cf44809455cf855f24857a0
2020-12-23 14:22:45 -08:00
Sergei Petrunia
2a2d4d0f37 Apply the changes from: PS-5501 : Re-license PerconaFT 'locktree' to Apache V2 (#7801)
Summary:
commit d5178f513c0b4144a5ac9358ec0f6a3b54a28e76
Author: George O. Lorch III <george.lorch@percona.com>
Date:   Tue Mar 19 12:18:40 2019 -0700

    PS-5501 : Re-license PerconaFT 'locktree' to Apache V2

    - Fixed some incomplete relicensed files from previous round.

    - Added missing license text to some.

    - Relicensed more files to Apache V2 that locktree depends on.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7801

Reviewed By: jay-zhuang

Differential Revision: D25682430

Pulled By: cheng-chang

fbshipit-source-id: deb8a0de3e76f3638672997bfbd300e2fffbe5f5
2020-12-23 14:20:51 -08:00
57 changed files with 849 additions and 115 deletions

View File

@ -1,7 +1,29 @@
# Rocksdb Change Log
## 6.16.5 (09/08/2021)
### Bug Fixes
* Fixed a bug affecting the batched `MultiGet` API when used with keys spanning multiple column families and `sorted_input == false`.
## 6.16.4 (03/30/2021)
### Bug Fixes
* Fix build on ppc64 and musl build.
## 6.16.3 (02/05/2021)
### Bug Fixes
* Since 6.15.0, `TransactionDB` returns error `Status`es from calls to `DeleteRange()` and calls to `Write()` where the `WriteBatch` contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees. There are certain cases where range deletion can still be used on such DBs; see the API doc on `TransactionDB::DeleteRange()` for details.
* `OptimisticTransactionDB` now returns error `Status`es from calls to `DeleteRange()` and calls to `Write()` where the `WriteBatch` contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees.
## 6.16.2 (1/21/2021)
### Bug Fixes
* Fix a race condition between DB startups and shutdowns in managing the periodic background worker threads. One effect of this race condition could be the process being terminated.
## 6.16.1 (1/20/2021)
### Bug Fixes
* Version older than 6.15 cannot decode VersionEdits `WalAddition` and `WalDeletion`, fixed this by changing the encoded format of them to be ignorable by older versions.
## 6.16.0 (12/18/2020)
### Behavior Changes
* Attempting to write a merge operand without explicitly configuring `merge_operator` now fails immediately, causing the DB to enter read-only mode. Previously, failure was deferred until the `merge_operator` was needed by a user read or a background operation.
* Since RocksDB does not continue write the same file if a file write fails for any reason, the file scope write IO error is treated the same as retryable IO error. More information about error handling of file scope IO error is included in `ErrorHandler::SetBGError`.
### Bug Fixes
* Truncated WALs ending in incomplete records can no longer produce gaps in the recovered data when `WALRecoveryMode::kPointInTimeRecovery` is used. Gaps are still possible when WALs are truncated exactly on record boundaries; for complete protection, users should enable `track_and_verify_wals_in_manifest`.
@ -10,6 +32,7 @@
* Fixed the logic of populating native data structure for `read_amp_bytes_per_bit` during OPTIONS file parsing on big-endian architecture. Without this fix, original code introduced in PR7659, when running on big-endian machine, can mistakenly store read_amp_bytes_per_bit (an uint32) in little endian format. Future access to `read_amp_bytes_per_bit` will give wrong values. Little endian architecture is not affected.
* Fixed prefix extractor with timestamp issues.
* Fixed a bug in atomic flush: in two-phase commit mode, the minimum WAL log number to keep is incorrect.
* Fixed a bug related to checkpoint in PR7789: if there are multiple column families, and the checkpoint is not opened as read only, then in rare cases, data loss may happen in the checkpoint. Since backup engine relies on checkpoint, it may also be affected.
### New Features
* User defined timestamp feature supports `CompactRange` and `GetApproximateSizes`.
@ -18,6 +41,7 @@
### Public API Change
* Deprecated public but rarely-used FilterBitsBuilder::CalculateNumEntry, which is replaced with ApproximateNumEntries taking a size_t parameter and returning size_t.
* Added a new option `track_and_verify_wals_in_manifest`. If `true`, the log numbers and sizes of the synced WALs are tracked in MANIFEST, then during DB recovery, if a synced WAL is missing from disk, or the WAL's size does not match the recorded size in MANIFEST, an error will be reported and the recovery will be aborted. Note that this option does not work with secondary instance.
## 6.15.0 (11/13/2020)
### Bug Fixes
@ -39,7 +63,6 @@
### Public API Change
* Deprecate `BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`. These options still take effect until users migrate to the replacement APIs in `BlockBasedTableOptions::metadata_cache_options`. Migration guidance can be found in the API comments on the deprecated options.
* Add new API `DB::VerifyFileChecksums` to verify SST file checksum with corresponding entries in the MANIFEST if present. Current implementation requires scanning and recomputing file checksums.
* Added a new option `track_and_verify_wals_in_manifest`. If `true`, the log numbers and sizes of the synced WALs are tracked in MANIFEST, then during DB recovery, if a synced WAL is missing from disk, or the WAL's size does not match the recorded size in MANIFEST, an error will be reported and the recovery will be aborted. Note that this option does not work with secondary instance.
### Behavior Changes
* The dictionary compression settings specified in `ColumnFamilyOptions::compression_opts` now additionally affect files generated by flush and compaction to non-bottommost level. Previously those settings at most affected files generated by compaction to bottommost level, depending on whether `ColumnFamilyOptions::bottommost_compression_opts` overrode them. Users who relied on dictionary compression settings in `ColumnFamilyOptions::compression_opts` affecting only the bottommost level can keep the behavior by moving their dictionary settings to `ColumnFamilyOptions::bottommost_compression_opts` and setting its `enabled` flag.

View File

@ -510,6 +510,12 @@ ifeq ($(USE_FOLLY_DISTRIBUTED_MUTEX),1)
LIB_OBJECTS += $(patsubst %.cpp, $(OBJ_DIR)/%.o, $(FOLLY_SOURCES))
endif
# range_tree is not compatible with non GNU libc on ppc64
# see https://jira.percona.com/browse/PS-7559
ifneq ($(PPC_LIBC_IS_GNU),0)
LIB_OBJECTS += $(patsubst %.cc, $(OBJ_DIR)/%.o, $(RANGE_TREE_SOURCES))
endif
GTEST = $(OBJ_DIR)/$(GTEST_DIR)/gtest/gtest-all.o
TESTUTIL = $(OBJ_DIR)/test_util/testutil.o
TESTHARNESS = $(OBJ_DIR)/test_util/testharness.o $(TESTUTIL) $(GTEST)
@ -2206,7 +2212,8 @@ endif
JAVA_STATIC_FLAGS = -DZLIB -DBZIP2 -DSNAPPY -DLZ4 -DZSTD
JAVA_STATIC_INCLUDES = -I./zlib-$(ZLIB_VER) -I./bzip2-$(BZIP2_VER) -I./snappy-$(SNAPPY_VER) -I./snappy-$(SNAPPY_VER)/build -I./lz4-$(LZ4_VER)/lib -I./zstd-$(ZSTD_VER)/lib -I./zstd-$(ZSTD_VER)/lib/dictBuilder
ifneq ($(findstring rocksdbjavastatic, $(MAKECMDGOALS)),)
ifneq ($(findstring rocksdbjavastatic, $(filter-out rocksdbjavastatic_deps, $(MAKECMDGOALS))),)
CXXFLAGS += $(JAVA_STATIC_FLAGS) $(JAVA_STATIC_INCLUDES)
CFLAGS += $(JAVA_STATIC_FLAGS) $(JAVA_STATIC_INCLUDES)
endif
@ -2216,8 +2223,11 @@ ifeq ($(JAVA_HOME),)
endif
$(MAKE) rocksdbjavastatic_deps
$(MAKE) rocksdbjavastatic_libobjects
cd java;$(MAKE) javalib;
rm -f ./java/target/$(ROCKSDBJNILIB)
$(MAKE) rocksdbjavastatic_javalib
rocksdbjavastatic_javalib:
cd java;$(MAKE) javalib
rm -f java/target/$(ROCKSDBJNILIB)
$(CXX) $(CXXFLAGS) -I./java/. $(JAVA_INCLUDE) -shared -fPIC \
-o ./java/target/$(ROCKSDBJNILIB) $(JNI_NATIVE_SOURCES) \
$(LIB_OBJECTS) $(COVERAGEFLAGS) \
@ -2440,6 +2450,8 @@ ifneq ($(MAKECMDGOALS),clean)
ifneq ($(MAKECMDGOALS),format)
ifneq ($(MAKECMDGOALS),jclean)
ifneq ($(MAKECMDGOALS),jtest)
ifneq ($(MAKECMDGOALS),rocksdbjavastatic)
ifneq ($(MAKECMDGOALS),rocksdbjavastatic_deps)
ifneq ($(MAKECMDGOALS),package)
ifneq ($(MAKECMDGOALS),analyze)
-include $(DEPFILES)
@ -2449,3 +2461,5 @@ endif
endif
endif
endif
endif
endif

View File

@ -135,11 +135,15 @@ def generate_targets(repo_path, deps_map):
TARGETS.add_library(
"rocksdb_lib",
src_mk["LIB_SOURCES"] +
# always add range_tree, it's only excluded on ppc64, which we don't use internally
src_mk["RANGE_TREE_SOURCES"] +
src_mk["TOOL_LIB_SOURCES"])
# rocksdb_whole_archive_lib
TARGETS.add_library(
"rocksdb_whole_archive_lib",
src_mk["LIB_SOURCES"] +
# always add range_tree, it's only excluded on ppc64, which we don't use internally
src_mk["RANGE_TREE_SOURCES"] +
src_mk["TOOL_LIB_SOURCES"],
deps=None,
headers=None,

View File

@ -663,6 +663,23 @@ else
fi
fi
if test -n "`echo $TARGET_ARCHITECTURE | grep ^ppc64`"; then
# check for GNU libc on ppc64
$CXX -x c++ - -o /dev/null 2>/dev/null <<EOF
#include <stdio.h>
#include <stdlib.h>
#include <gnu/libc-version.h>
int main(int argc, char *argv[]) {
printf("GNU libc version: %s\n", gnu_get_libc_version());
return 0;
}
EOF
if [ "$?" != 0 ]; then
PPC_LIBC_IS_GNU=0
fi
fi
if test "$TRY_SSE_ETC"; then
# The USE_SSE flag now means "attempt to compile with widely-available
# Intel architecture extensions utilized by specific optimizations in the
@ -856,3 +873,6 @@ echo "LUA_PATH=$LUA_PATH" >> "$OUTPUT"
if test -n "$USE_FOLLY_DISTRIBUTED_MUTEX"; then
echo "USE_FOLLY_DISTRIBUTED_MUTEX=$USE_FOLLY_DISTRIBUTED_MUTEX" >> "$OUTPUT"
fi
if test -n "$PPC_LIBC_IS_GNU"; then
echo "PPC_LIBC_IS_GNU=$PPC_LIBC_IS_GNU" >> "$OUTPUT"
fi

View File

@ -1361,6 +1361,28 @@ TEST_P(DBMultiGetTestWithParam, MultiGetMultiCFSnapshot) {
}
}
TEST_P(DBMultiGetTestWithParam, MultiGetMultiCFUnsorted) {
Options options = CurrentOptions();
CreateAndReopenWithCF({"one", "two"}, options);
ASSERT_OK(Put(1, "foo", "bar"));
ASSERT_OK(Put(2, "baz", "xyz"));
ASSERT_OK(Put(1, "abc", "def"));
// Note: keys for the same CF do not form a consecutive range
std::vector<int> cfs{1, 2, 1};
std::vector<std::string> keys{"foo", "baz", "abc"};
std::vector<std::string> values;
values =
MultiGet(cfs, keys, /* snapshot */ nullptr, /* batched */ GetParam());
ASSERT_EQ(values.size(), 3);
ASSERT_EQ(values[0], "bar");
ASSERT_EQ(values[1], "xyz");
ASSERT_EQ(values[2], "def");
}
INSTANTIATE_TEST_CASE_P(DBMultiGetTestWithParam, DBMultiGetTestWithParam,
testing::Bool());

View File

@ -2164,20 +2164,18 @@ void DBImpl::MultiGet(const ReadOptions& read_options, const size_t num_keys,
multiget_cf_data;
size_t cf_start = 0;
ColumnFamilyHandle* cf = sorted_keys[0]->column_family;
for (size_t i = 0; i < num_keys; ++i) {
KeyContext* key_ctx = sorted_keys[i];
if (key_ctx->column_family != cf) {
multiget_cf_data.emplace_back(
MultiGetColumnFamilyData(cf, cf_start, i - cf_start, nullptr));
multiget_cf_data.emplace_back(cf, cf_start, i - cf_start, nullptr);
cf_start = i;
cf = key_ctx->column_family;
}
}
{
// multiget_cf_data.emplace_back(
// MultiGetColumnFamilyData(cf, cf_start, num_keys - cf_start, nullptr));
multiget_cf_data.emplace_back(cf, cf_start, num_keys - cf_start, nullptr);
}
std::function<MultiGetColumnFamilyData*(
autovector<MultiGetColumnFamilyData,
MultiGetContext::MAX_BATCH_SIZE>::iterator&)>
@ -2237,7 +2235,7 @@ struct CompareKeyContext {
static_cast<ColumnFamilyHandleImpl*>(lhs->column_family);
uint32_t cfd_id1 = cfh->cfd()->GetID();
const Comparator* comparator = cfh->cfd()->user_comparator();
cfh = static_cast<ColumnFamilyHandleImpl*>(lhs->column_family);
cfh = static_cast<ColumnFamilyHandleImpl*>(rhs->column_family);
uint32_t cfd_id2 = cfh->cfd()->GetID();
if (cfd_id1 < cfd_id2) {
@ -2261,39 +2259,24 @@ struct CompareKeyContext {
void DBImpl::PrepareMultiGetKeys(
size_t num_keys, bool sorted_input,
autovector<KeyContext*, MultiGetContext::MAX_BATCH_SIZE>* sorted_keys) {
#ifndef NDEBUG
if (sorted_input) {
for (size_t index = 0; index < sorted_keys->size(); ++index) {
if (index > 0) {
KeyContext* lhs = (*sorted_keys)[index - 1];
KeyContext* rhs = (*sorted_keys)[index];
ColumnFamilyHandleImpl* cfh =
static_cast_with_check<ColumnFamilyHandleImpl>(lhs->column_family);
uint32_t cfd_id1 = cfh->cfd()->GetID();
const Comparator* comparator = cfh->cfd()->user_comparator();
cfh =
static_cast_with_check<ColumnFamilyHandleImpl>(lhs->column_family);
uint32_t cfd_id2 = cfh->cfd()->GetID();
#ifndef NDEBUG
CompareKeyContext key_context_less;
assert(cfd_id1 <= cfd_id2);
if (cfd_id1 < cfd_id2) {
continue;
}
for (size_t index = 1; index < sorted_keys->size(); ++index) {
const KeyContext* const lhs = (*sorted_keys)[index - 1];
const KeyContext* const rhs = (*sorted_keys)[index];
// Both keys are from the same column family
int cmp = comparator->CompareWithoutTimestamp(
*(lhs->key), /*a_has_ts=*/false, *(rhs->key), /*b_has_ts=*/false);
assert(cmp <= 0);
}
index++;
}
// lhs should be <= rhs, or in other words, rhs should NOT be < lhs
assert(!key_context_less(rhs, lhs));
}
#endif
if (!sorted_input) {
CompareKeyContext sort_comparator;
std::sort(sorted_keys->begin(), sorted_keys->begin() + num_keys,
sort_comparator);
return;
}
std::sort(sorted_keys->begin(), sorted_keys->begin() + num_keys,
CompareKeyContext());
}
void DBImpl::MultiGet(const ReadOptions& read_options,

View File

@ -350,12 +350,17 @@ const Status& ErrorHandler::SetBGError(const Status& bg_err,
// This is the main function for looking at IO related error during the
// background operations. The main logic is:
// 1) File scope IO error is treated as retryable IO error in the write
// path. In RocksDB, If a file has write IO error and it is at file scope,
// RocksDB never write to the same file again. RocksDB will create a new
// file and rewrite the whole content. Thus, it is retryable.
// 1) if the error is caused by data loss, the error is mapped to
// unrecoverable error. Application/user must take action to handle
// this situation.
// 2) if the error is a Retryable IO error, auto resume will be called and the
// auto resume can be controlled by resume count and resume interval
// options. There are three sub-cases:
// this situation (File scope case is excluded).
// 2) if the error is a Retryable IO error (i.e., it is a file scope IO error,
// or its retryable flag is set and not a data loss error), auto resume
// will be called and the auto resume can be controlled by resume count
// and resume interval options. There are three sub-cases:
// a) if the error happens during compaction, it is mapped to a soft error.
// the compaction thread will reschedule a new compaction.
// b) if the error happens during flush and also WAL is empty, it is mapped
@ -384,9 +389,10 @@ const Status& ErrorHandler::SetBGError(const IOStatus& bg_io_err,
Status new_bg_io_err = bg_io_err;
DBRecoverContext context;
if (bg_io_err.GetDataLoss()) {
// First, data loss is treated as unrecoverable error. So it can directly
// overwrite any existing bg_error_.
if (bg_io_err.GetScope() != IOStatus::IOErrorScope::kIOErrorScopeFile &&
bg_io_err.GetDataLoss()) {
// First, data loss (non file scope) is treated as unrecoverable error. So
// it can directly overwrite any existing bg_error_.
bool auto_recovery = false;
Status bg_err(new_bg_io_err, Status::Severity::kUnrecoverableError);
bg_error_ = bg_err;
@ -397,13 +403,15 @@ const Status& ErrorHandler::SetBGError(const IOStatus& bg_io_err,
&bg_err, db_mutex_, &auto_recovery);
recover_context_ = context;
return bg_error_;
} else if (bg_io_err.GetRetryable()) {
// Second, check if the error is a retryable IO error or not. if it is
// retryable error and its severity is higher than bg_error_, overwrite
// the bg_error_ with new error.
// In current stage, for retryable IO error of compaction, treat it as
// soft error. In other cases, treat the retryable IO error as hard
// error.
} else if (bg_io_err.GetScope() ==
IOStatus::IOErrorScope::kIOErrorScopeFile ||
bg_io_err.GetRetryable()) {
// Second, check if the error is a retryable IO error (file scope IO error
// is also treated as retryable IO error in RocksDB write path). if it is
// retryable error and its severity is higher than bg_error_, overwrite the
// bg_error_ with new error. In current stage, for retryable IO error of
// compaction, treat it as soft error. In other cases, treat the retryable
// IO error as hard error.
bool auto_recovery = false;
EventHelpers::NotifyOnBackgroundError(db_options_.listeners, reason,
&new_bg_io_err, db_mutex_,

View File

@ -241,6 +241,90 @@ TEST_F(DBErrorHandlingFSTest, FLushWritRetryableError) {
Destroy(options);
}
TEST_F(DBErrorHandlingFSTest, FLushWritFileScopeError) {
std::shared_ptr<ErrorHandlerFSListener> listener(
new ErrorHandlerFSListener());
Options options = GetDefaultOptions();
options.env = fault_env_.get();
options.create_if_missing = true;
options.listeners.emplace_back(listener);
options.max_bgerror_resume_count = 0;
Status s;
listener->EnableAutoRecovery(false);
DestroyAndReopen(options);
IOStatus error_msg = IOStatus::IOError("File Scope Data Loss Error");
error_msg.SetDataLoss(true);
error_msg.SetScope(
ROCKSDB_NAMESPACE::IOStatus::IOErrorScope::kIOErrorScopeFile);
error_msg.SetRetryable(false);
ASSERT_OK(Put(Key(1), "val1"));
SyncPoint::GetInstance()->SetCallBack(
"BuildTable:BeforeFinishBuildTable",
[&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
SyncPoint::GetInstance()->EnableProcessing();
s = Flush();
ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
SyncPoint::GetInstance()->DisableProcessing();
fault_fs_->SetFilesystemActive(true);
s = dbfull()->Resume();
ASSERT_OK(s);
Reopen(options);
ASSERT_EQ("val1", Get(Key(1)));
ASSERT_OK(Put(Key(2), "val2"));
SyncPoint::GetInstance()->SetCallBack(
"BuildTable:BeforeSyncTable",
[&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
SyncPoint::GetInstance()->EnableProcessing();
s = Flush();
ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
SyncPoint::GetInstance()->DisableProcessing();
fault_fs_->SetFilesystemActive(true);
s = dbfull()->Resume();
ASSERT_OK(s);
Reopen(options);
ASSERT_EQ("val2", Get(Key(2)));
ASSERT_OK(Put(Key(3), "val3"));
SyncPoint::GetInstance()->SetCallBack(
"BuildTable:BeforeCloseTableFile",
[&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
SyncPoint::GetInstance()->EnableProcessing();
s = Flush();
ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
SyncPoint::GetInstance()->DisableProcessing();
fault_fs_->SetFilesystemActive(true);
s = dbfull()->Resume();
ASSERT_OK(s);
Reopen(options);
ASSERT_EQ("val3", Get(Key(3)));
// not file scope, but retyrable set
error_msg.SetDataLoss(false);
error_msg.SetScope(
ROCKSDB_NAMESPACE::IOStatus::IOErrorScope::kIOErrorScopeFileSystem);
error_msg.SetRetryable(true);
ASSERT_OK(Put(Key(3), "val3"));
SyncPoint::GetInstance()->SetCallBack(
"BuildTable:BeforeCloseTableFile",
[&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
SyncPoint::GetInstance()->EnableProcessing();
s = Flush();
ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
SyncPoint::GetInstance()->DisableProcessing();
fault_fs_->SetFilesystemActive(true);
s = dbfull()->Resume();
ASSERT_OK(s);
Reopen(options);
ASSERT_EQ("val3", Get(Key(3)));
Destroy(options);
}
TEST_F(DBErrorHandlingFSTest, FLushWritNoWALRetryableError1) {
std::shared_ptr<ErrorHandlerFSListener> listener(
new ErrorHandlerFSListener());
@ -453,6 +537,52 @@ TEST_F(DBErrorHandlingFSTest, ManifestWriteRetryableError) {
Close();
}
TEST_F(DBErrorHandlingFSTest, ManifestWriteFileScopeError) {
std::shared_ptr<ErrorHandlerFSListener> listener(
new ErrorHandlerFSListener());
Options options = GetDefaultOptions();
options.env = fault_env_.get();
options.create_if_missing = true;
options.listeners.emplace_back(listener);
options.max_bgerror_resume_count = 0;
Status s;
std::string old_manifest;
std::string new_manifest;
listener->EnableAutoRecovery(false);
DestroyAndReopen(options);
old_manifest = GetManifestNameFromLiveFiles();
IOStatus error_msg = IOStatus::IOError("File Scope Data Loss Error");
error_msg.SetDataLoss(true);
error_msg.SetScope(
ROCKSDB_NAMESPACE::IOStatus::IOErrorScope::kIOErrorScopeFile);
error_msg.SetRetryable(false);
ASSERT_OK(Put(Key(0), "val"));
ASSERT_OK(Flush());
ASSERT_OK(Put(Key(1), "val"));
SyncPoint::GetInstance()->SetCallBack(
"VersionSet::LogAndApply:WriteManifest",
[&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
SyncPoint::GetInstance()->EnableProcessing();
s = Flush();
ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
SyncPoint::GetInstance()->ClearAllCallBacks();
SyncPoint::GetInstance()->DisableProcessing();
fault_fs_->SetFilesystemActive(true);
s = dbfull()->Resume();
ASSERT_OK(s);
new_manifest = GetManifestNameFromLiveFiles();
ASSERT_NE(new_manifest, old_manifest);
Reopen(options);
ASSERT_EQ("val", Get(Key(0)));
ASSERT_EQ("val", Get(Key(1)));
Close();
}
TEST_F(DBErrorHandlingFSTest, ManifestWriteNoWALRetryableError) {
std::shared_ptr<ErrorHandlerFSListener> listener(
new ErrorHandlerFSListener());
@ -779,6 +909,54 @@ TEST_F(DBErrorHandlingFSTest, CompactionWriteRetryableError) {
Destroy(options);
}
TEST_F(DBErrorHandlingFSTest, CompactionWriteFileScopeError) {
std::shared_ptr<ErrorHandlerFSListener> listener(
new ErrorHandlerFSListener());
Options options = GetDefaultOptions();
options.env = fault_env_.get();
options.create_if_missing = true;
options.level0_file_num_compaction_trigger = 2;
options.listeners.emplace_back(listener);
options.max_bgerror_resume_count = 0;
Status s;
DestroyAndReopen(options);
IOStatus error_msg = IOStatus::IOError("File Scope Data Loss Error");
error_msg.SetDataLoss(true);
error_msg.SetScope(
ROCKSDB_NAMESPACE::IOStatus::IOErrorScope::kIOErrorScopeFile);
error_msg.SetRetryable(false);
ASSERT_OK(Put(Key(0), "va;"));
ASSERT_OK(Put(Key(2), "va;"));
s = Flush();
ASSERT_OK(s);
listener->OverrideBGError(Status(error_msg, Status::Severity::kHardError));
listener->EnableAutoRecovery(false);
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
{{"DBImpl::FlushMemTable:FlushMemTableFinished",
"BackgroundCallCompaction:0"}});
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"CompactionJob::OpenCompactionOutputFile",
[&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
ASSERT_OK(Put(Key(1), "val"));
s = Flush();
ASSERT_OK(s);
s = dbfull()->TEST_WaitForCompact();
ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);
fault_fs_->SetFilesystemActive(true);
SyncPoint::GetInstance()->ClearAllCallBacks();
SyncPoint::GetInstance()->DisableProcessing();
s = dbfull()->Resume();
ASSERT_OK(s);
Destroy(options);
}
TEST_F(DBErrorHandlingFSTest, CorruptionError) {
Options options = GetDefaultOptions();
options.env = fault_env_.get();

View File

@ -10,13 +10,14 @@
#ifndef ROCKSDB_LITE
namespace ROCKSDB_NAMESPACE {
PeriodicWorkScheduler::PeriodicWorkScheduler(Env* env) {
PeriodicWorkScheduler::PeriodicWorkScheduler(Env* env) : timer_mu_(env) {
timer = std::unique_ptr<Timer>(new Timer(env));
}
void PeriodicWorkScheduler::Register(DBImpl* dbi,
unsigned int stats_dump_period_sec,
unsigned int stats_persist_period_sec) {
MutexLock l(&timer_mu_);
static std::atomic<uint64_t> initial_delay(0);
timer->Start();
if (stats_dump_period_sec > 0) {
@ -41,6 +42,7 @@ void PeriodicWorkScheduler::Register(DBImpl* dbi,
}
void PeriodicWorkScheduler::Unregister(DBImpl* dbi) {
MutexLock l(&timer_mu_);
timer->Cancel(GetTaskName(dbi, "dump_st"));
timer->Cancel(GetTaskName(dbi, "pst_st"));
timer->Cancel(GetTaskName(dbi, "flush_info_log"));
@ -78,7 +80,10 @@ PeriodicWorkTestScheduler* PeriodicWorkTestScheduler::Default(Env* env) {
MutexLock l(&mutex);
if (scheduler.timer.get() != nullptr &&
scheduler.timer->TEST_GetPendingTaskNum() == 0) {
{
MutexLock timer_mu_guard(&scheduler.timer_mu_);
scheduler.timer->Shutdown();
}
scheduler.timer.reset(new Timer(env));
}
}

View File

@ -42,6 +42,12 @@ class PeriodicWorkScheduler {
protected:
std::unique_ptr<Timer> timer;
// `timer_mu_` serves two purposes currently:
// (1) to ensure calls to `Start()` and `Shutdown()` are serialized, as
// they are currently not implemented in a thread-safe way; and
// (2) to ensure the `Timer::Add()`s and `Timer::Start()` run atomically, and
// the `Timer::Cancel()`s and `Timer::Shutdown()` run atomically.
port::Mutex timer_mu_;
explicit PeriodicWorkScheduler(Env* env);

View File

@ -226,13 +226,17 @@ bool VersionEdit::EncodeTo(std::string* dst) const {
}
for (const auto& wal_addition : wal_additions_) {
PutVarint32(dst, kWalAddition);
wal_addition.EncodeTo(dst);
PutVarint32(dst, kWalAddition2);
std::string encoded;
wal_addition.EncodeTo(&encoded);
PutLengthPrefixedSlice(dst, encoded);
}
if (!wal_deletion_.IsEmpty()) {
PutVarint32(dst, kWalDeletion);
wal_deletion_.EncodeTo(dst);
PutVarint32(dst, kWalDeletion2);
std::string encoded;
wal_deletion_.EncodeTo(&encoded);
PutLengthPrefixedSlice(dst, encoded);
}
// 0 is default and does not need to be explicitly written
@ -375,6 +379,11 @@ const char* VersionEdit::DecodeNewFile4From(Slice* input) {
Status VersionEdit::DecodeFrom(const Slice& src) {
Clear();
#ifndef NDEBUG
bool ignore_ignorable_tags = false;
TEST_SYNC_POINT_CALLBACK("VersionEdit::EncodeTo:IgnoreIgnorableTags",
&ignore_ignorable_tags);
#endif
Slice input = src;
const char* msg = nullptr;
uint32_t tag = 0;
@ -385,6 +394,11 @@ Status VersionEdit::DecodeFrom(const Slice& src) {
Slice str;
InternalKey key;
while (msg == nullptr && GetVarint32(&input, &tag)) {
#ifndef NDEBUG
if (ignore_ignorable_tags && tag > kTagSafeIgnoreMask) {
tag = kTagSafeIgnoreMask;
}
#endif
switch (tag) {
case kDbId:
if (GetLengthPrefixedSlice(&input, &str)) {
@ -575,6 +589,23 @@ Status VersionEdit::DecodeFrom(const Slice& src) {
break;
}
case kWalAddition2: {
Slice encoded;
if (!GetLengthPrefixedSlice(&input, &encoded)) {
msg = "WalAddition not prefixed by length";
break;
}
WalAddition wal_addition;
const Status s = wal_addition.DecodeFrom(&encoded);
if (!s.ok()) {
return s;
}
wal_additions_.emplace_back(std::move(wal_addition));
break;
}
case kWalDeletion: {
WalDeletion wal_deletion;
const Status s = wal_deletion.DecodeFrom(&input);
@ -586,6 +617,23 @@ Status VersionEdit::DecodeFrom(const Slice& src) {
break;
}
case kWalDeletion2: {
Slice encoded;
if (!GetLengthPrefixedSlice(&input, &encoded)) {
msg = "WalDeletion not prefixed by length";
break;
}
WalDeletion wal_deletion;
const Status s = wal_deletion.DecodeFrom(&encoded);
if (!s.ok()) {
return s;
}
wal_deletion_ = std::move(wal_deletion);
break;
}
case kColumnFamily:
if (!GetVarint32(&input, &column_family_)) {
if (!msg) {

View File

@ -62,6 +62,8 @@ enum Tag : uint32_t {
kWalAddition,
kWalDeletion,
kFullHistoryTsLow,
kWalAddition2,
kWalDeletion2,
};
enum NewFileCustomTag : uint32_t {

View File

@ -324,14 +324,22 @@ TEST_F(VersionEditTest, AddWalEncodeDecode) {
TestEncodeDecode(edit);
}
static std::string PrefixEncodedWalAdditionWithLength(
const std::string& encoded) {
std::string ret;
PutVarint32(&ret, Tag::kWalAddition2);
PutLengthPrefixedSlice(&ret, encoded);
return ret;
}
TEST_F(VersionEditTest, AddWalDecodeBadLogNumber) {
std::string encoded;
PutVarint32(&encoded, Tag::kWalAddition);
{
// No log number.
std::string encoded_edit = PrefixEncodedWalAdditionWithLength(encoded);
VersionEdit edit;
Status s = edit.DecodeFrom(encoded);
Status s = edit.DecodeFrom(encoded_edit);
ASSERT_TRUE(s.IsCorruption());
ASSERT_TRUE(s.ToString().find("Error decoding WAL log number") !=
std::string::npos)
@ -345,8 +353,10 @@ TEST_F(VersionEditTest, AddWalDecodeBadLogNumber) {
unsigned char* ptr = reinterpret_cast<unsigned char*>(&c);
*ptr = 128;
encoded.append(1, c);
std::string encoded_edit = PrefixEncodedWalAdditionWithLength(encoded);
VersionEdit edit;
Status s = edit.DecodeFrom(encoded);
Status s = edit.DecodeFrom(encoded_edit);
ASSERT_TRUE(s.IsCorruption());
ASSERT_TRUE(s.ToString().find("Error decoding WAL log number") !=
std::string::npos)
@ -358,14 +368,14 @@ TEST_F(VersionEditTest, AddWalDecodeBadTag) {
constexpr WalNumber kLogNumber = 100;
constexpr uint64_t kSizeInBytes = 100;
std::string encoded_without_tag;
PutVarint32(&encoded_without_tag, Tag::kWalAddition);
PutVarint64(&encoded_without_tag, kLogNumber);
std::string encoded;
PutVarint64(&encoded, kLogNumber);
{
// No tag.
std::string encoded_edit = PrefixEncodedWalAdditionWithLength(encoded);
VersionEdit edit;
Status s = edit.DecodeFrom(encoded_without_tag);
Status s = edit.DecodeFrom(encoded_edit);
ASSERT_TRUE(s.IsCorruption());
ASSERT_TRUE(s.ToString().find("Error decoding tag") != std::string::npos)
<< s.ToString();
@ -373,12 +383,15 @@ TEST_F(VersionEditTest, AddWalDecodeBadTag) {
{
// Only has size tag, no terminate tag.
std::string encoded_with_size = encoded_without_tag;
std::string encoded_with_size = encoded;
PutVarint32(&encoded_with_size,
static_cast<uint32_t>(WalAdditionTag::kSyncedSize));
PutVarint64(&encoded_with_size, kSizeInBytes);
std::string encoded_edit =
PrefixEncodedWalAdditionWithLength(encoded_with_size);
VersionEdit edit;
Status s = edit.DecodeFrom(encoded_with_size);
Status s = edit.DecodeFrom(encoded_edit);
ASSERT_TRUE(s.IsCorruption());
ASSERT_TRUE(s.ToString().find("Error decoding tag") != std::string::npos)
<< s.ToString();
@ -386,11 +399,14 @@ TEST_F(VersionEditTest, AddWalDecodeBadTag) {
{
// Only has terminate tag.
std::string encoded_with_terminate = encoded_without_tag;
std::string encoded_with_terminate = encoded;
PutVarint32(&encoded_with_terminate,
static_cast<uint32_t>(WalAdditionTag::kTerminate));
std::string encoded_edit =
PrefixEncodedWalAdditionWithLength(encoded_with_terminate);
VersionEdit edit;
ASSERT_OK(edit.DecodeFrom(encoded_with_terminate));
ASSERT_OK(edit.DecodeFrom(encoded_edit));
auto& wal_addition = edit.GetWalAdditions()[0];
ASSERT_EQ(wal_addition.GetLogNumber(), kLogNumber);
ASSERT_FALSE(wal_addition.GetMetadata().HasSyncedSize());
@ -401,15 +417,15 @@ TEST_F(VersionEditTest, AddWalDecodeNoSize) {
constexpr WalNumber kLogNumber = 100;
std::string encoded;
PutVarint32(&encoded, Tag::kWalAddition);
PutVarint64(&encoded, kLogNumber);
PutVarint32(&encoded, static_cast<uint32_t>(WalAdditionTag::kSyncedSize));
// No real size after the size tag.
{
// Without terminate tag.
std::string encoded_edit = PrefixEncodedWalAdditionWithLength(encoded);
VersionEdit edit;
Status s = edit.DecodeFrom(encoded);
Status s = edit.DecodeFrom(encoded_edit);
ASSERT_TRUE(s.IsCorruption());
ASSERT_TRUE(s.ToString().find("Error decoding WAL file size") !=
std::string::npos)
@ -419,8 +435,10 @@ TEST_F(VersionEditTest, AddWalDecodeNoSize) {
{
// With terminate tag.
PutVarint32(&encoded, static_cast<uint32_t>(WalAdditionTag::kTerminate));
std::string encoded_edit = PrefixEncodedWalAdditionWithLength(encoded);
VersionEdit edit;
Status s = edit.DecodeFrom(encoded);
Status s = edit.DecodeFrom(encoded_edit);
ASSERT_TRUE(s.IsCorruption());
// The terminate tag is misunderstood as the size.
ASSERT_TRUE(s.ToString().find("Error decoding tag") != std::string::npos)
@ -515,6 +533,62 @@ TEST_F(VersionEditTest, FullHistoryTsLow) {
TestEncodeDecode(edit);
}
// Tests that if RocksDB is downgraded, the new types of VersionEdits
// that have a tag larger than kTagSafeIgnoreMask can be safely ignored.
TEST_F(VersionEditTest, IgnorableTags) {
SyncPoint::GetInstance()->SetCallBack(
"VersionEdit::EncodeTo:IgnoreIgnorableTags", [&](void* arg) {
bool* ignore = static_cast<bool*>(arg);
*ignore = true;
});
SyncPoint::GetInstance()->EnableProcessing();
constexpr uint64_t kPrevLogNumber = 100;
constexpr uint64_t kLogNumber = 200;
constexpr uint64_t kNextFileNumber = 300;
constexpr uint64_t kColumnFamilyId = 400;
VersionEdit edit;
// Add some ignorable entries.
for (int i = 0; i < 2; i++) {
edit.AddWal(i + 1, WalMetadata(i + 2));
}
edit.SetDBId("db_id");
// Add unignorable entries.
edit.SetPrevLogNumber(kPrevLogNumber);
edit.SetLogNumber(kLogNumber);
// Add more ignorable entries.
edit.DeleteWalsBefore(100);
// Add unignorable entry.
edit.SetNextFile(kNextFileNumber);
// Add more ignorable entries.
edit.SetFullHistoryTsLow("ts");
// Add unignorable entry.
edit.SetColumnFamily(kColumnFamilyId);
std::string encoded;
ASSERT_TRUE(edit.EncodeTo(&encoded));
VersionEdit decoded;
ASSERT_OK(decoded.DecodeFrom(encoded));
// Check that all ignorable entries are ignored.
ASSERT_FALSE(decoded.HasDbId());
ASSERT_FALSE(decoded.HasFullHistoryTsLow());
ASSERT_FALSE(decoded.IsWalAddition());
ASSERT_FALSE(decoded.IsWalDeletion());
ASSERT_TRUE(decoded.GetWalAdditions().empty());
ASSERT_TRUE(decoded.GetWalDeletion().IsEmpty());
// Check that unignorable entries are still present.
ASSERT_EQ(edit.GetPrevLogNumber(), kPrevLogNumber);
ASSERT_EQ(edit.GetLogNumber(), kLogNumber);
ASSERT_EQ(edit.GetNextFile(), kNextFileNumber);
ASSERT_EQ(edit.GetColumnFamily(), kColumnFamilyId);
SyncPoint::GetInstance()->DisableProcessing();
}
} // namespace ROCKSDB_NAMESPACE
int main(int argc, char** argv) {

View File

@ -51,6 +51,8 @@ struct OptimisticTransactionDBOptions {
uint32_t occ_lock_buckets = (1 << 20);
};
// Range deletions (including those in `WriteBatch`es passed to `Write()`) are
// incompatible with `OptimisticTransactionDB` and will return a non-OK `Status`
class OptimisticTransactionDB : public StackableDB {
public:
// Open an OptimisticTransactionDB similar to DB::Open().

View File

@ -340,6 +340,17 @@ class TransactionDB : public StackableDB {
// falls back to the un-optimized version of ::Write
return Write(opts, updates);
}
// Transactional `DeleteRange()` is not yet supported.
// However, users who know their deleted range does not conflict with
// anything can still use it via the `Write()` API. In all cases, the
// `Write()` overload specifying `TransactionDBWriteOptimizations` must be
// used and `skip_concurrency_control` must be set. When using either
// WRITE_PREPARED or WRITE_UNPREPARED , `skip_duplicate_key_check` must
// additionally be set.
virtual Status DeleteRange(const WriteOptions&, ColumnFamilyHandle*,
const Slice&, const Slice&) override {
return Status::NotSupported();
}
// Open a TransactionDB similar to DB::Open().
// Internally call PrepareWrap() and WrapDB()
// If the return status is not ok, then dbptr is set to nullptr.

View File

@ -6,7 +6,7 @@
#define ROCKSDB_MAJOR 6
#define ROCKSDB_MINOR 16
#define ROCKSDB_PATCH 0
#define ROCKSDB_PATCH 5
// Do not use these. We made the mistake of declaring macros starting with
// double underscore. Now we have to live with our choice. We'll deprecate these

26
src.mk
View File

@ -255,18 +255,6 @@ LIB_SOURCES = \
utilities/transactions/lock/lock_manager.cc \
utilities/transactions/lock/point/point_lock_tracker.cc \
utilities/transactions/lock/point/point_lock_manager.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/concurrent_tree.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/keyrange.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/lock_request.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/locktree.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/manager.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/range_buffer.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/treenode.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/txnid_set.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/wfg.cc \
utilities/transactions/lock/range/range_tree/lib/standalone_port.cc \
utilities/transactions/lock/range/range_tree/lib/util/dbt.cc \
utilities/transactions/lock/range/range_tree/lib/util/memarena.cc \
utilities/transactions/optimistic_transaction.cc \
utilities/transactions/optimistic_transaction_db_impl.cc \
utilities/transactions/pessimistic_transaction.cc \
@ -298,6 +286,20 @@ LIB_SOURCES_ASM =
LIB_SOURCES_C =
endif
RANGE_TREE_SOURCES =\
utilities/transactions/lock/range/range_tree/lib/locktree/concurrent_tree.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/keyrange.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/lock_request.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/locktree.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/manager.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/range_buffer.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/treenode.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/txnid_set.cc \
utilities/transactions/lock/range/range_tree/lib/locktree/wfg.cc \
utilities/transactions/lock/range/range_tree/lib/standalone_port.cc \
utilities/transactions/lock/range/range_tree/lib/util/dbt.cc \
utilities/transactions/lock/range/range_tree/lib/util/memarena.cc
TOOL_LIB_SOURCES = \
tools/io_tracer_parser_tool.cc \
tools/ldb_cmd.cc \

View File

@ -22,6 +22,9 @@ namespace ROCKSDB_NAMESPACE {
// A Timer class to handle repeated work.
//
// `Start()` and `Shutdown()` are currently not thread-safe. The client must
// serialize calls to these two member functions.
//
// A single timer instance can handle multiple functions via a single thread.
// It is better to leave long running work to a dedicated thread pool.
//

View File

@ -70,6 +70,16 @@ class DummyDB : public StackableDB {
DBOptions GetDBOptions() const override { return DBOptions(options_); }
using StackableDB::GetIntProperty;
bool GetIntProperty(ColumnFamilyHandle*, const Slice& property,
uint64_t* value) override {
if (property == DB::Properties::kMinLogNumberToKeep) {
*value = 1;
return true;
}
return false;
}
Status EnableFileDeletions(bool /*force*/) override {
EXPECT_TRUE(!deletions_enabled_);
deletions_enabled_ = true;

View File

@ -232,32 +232,22 @@ Status CheckpointImpl::CreateCustomCheckpoint(
// this will return live_files prefixed with "/"
s = db_->GetLiveFiles(live_files, &manifest_file_size, flush_memtable);
if (s.ok() && db_options.allow_2pc) {
// If 2PC is enabled, we need to get minimum log number after the flush.
// Need to refetch the live files to recapture the snapshot.
if (!db_->GetIntProperty(DB::Properties::kMinLogNumberToKeep,
&min_log_num)) {
return Status::InvalidArgument(
"2PC enabled but cannot fine the min log number to keep.");
if (!db_->GetIntProperty(DB::Properties::kMinLogNumberToKeep, &min_log_num)) {
return Status::InvalidArgument("cannot get the min log number to keep.");
}
// We need to refetch live files with flush to handle this case:
// A previous 000001.log contains the prepare record of transaction tnx1.
// The current log file is 000002.log, and sequence_number points to this
// file.
// After calling GetLiveFiles(), 000003.log is created.
// Then tnx1 is committed. The commit record is written to 000003.log.
// Now we fetch min_log_num, which will be 3.
// Then only 000002.log and 000003.log will be copied, and 000001.log will
// be skipped. 000003.log contains commit message of tnx1, but we don't
// have respective prepare record for it.
// In order to avoid this situation, we need to force flush to make sure
// all transactions committed before getting min_log_num will be flushed
// to SST files.
// We cannot get min_log_num before calling the GetLiveFiles() for the
// first time, because if we do that, all the logs files will be included,
// far more than needed.
// Between GetLiveFiles and getting min_log_num, flush might happen
// concurrently, so new WAL deletions might be tracked in MANIFEST. If we do
// not get the new MANIFEST size, the deleted WALs might not be reflected in
// the checkpoint's MANIFEST.
//
// If we get min_log_num before the above GetLiveFiles, then there might
// be too many unnecessary WALs to be included in the checkpoint.
//
// Ideally, min_log_num should be got together with manifest_file_size in
// GetLiveFiles atomically. But that needs changes to GetLiveFiles' signature
// which is a public API.
s = db_->GetLiveFiles(live_files, &manifest_file_size, flush_memtable);
}
TEST_SYNC_POINT("CheckpointImpl::CreateCheckpoint:FlushDone");
TEST_SYNC_POINT("CheckpointImpl::CreateCheckpoint:SavedLiveFiles1");
TEST_SYNC_POINT("CheckpointImpl::CreateCheckpoint:SavedLiveFiles2");
@ -385,7 +375,6 @@ Status CheckpointImpl::CreateCustomCheckpoint(
for (size_t i = 0; s.ok() && i < wal_size; ++i) {
if ((live_wal_files[i]->Type() == kAliveLogFile) &&
(!flush_memtable ||
live_wal_files[i]->StartSequence() >= *sequence_number ||
live_wal_files[i]->LogNumber() >= min_log_num)) {
if (i + 1 == wal_size) {
s = copy_file_cb(db_options.wal_dir, live_wal_files[i]->PathName(),

View File

@ -549,7 +549,7 @@ TEST_F(CheckpointTest, CurrentFileModifiedWhileCheckpointing) {
{// Get past the flush in the checkpoint thread before adding any keys to
// the db so the checkpoint thread won't hit the WriteManifest
// syncpoints.
{"DBImpl::GetLiveFiles:1",
{"CheckpointImpl::CreateCheckpoint:FlushDone",
"CheckpointTest::CurrentFileModifiedWhileCheckpointing:PrePut"},
// Roll the manifest during checkpointing right after live files are
// snapshotted.

View File

@ -31,6 +31,20 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -32,6 +32,20 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -47,6 +47,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -45,6 +45,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -47,6 +47,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -45,6 +45,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -47,6 +47,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -45,6 +45,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -47,6 +47,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -47,6 +47,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -45,6 +45,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -47,6 +47,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -45,6 +45,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -47,6 +47,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -45,6 +45,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -47,6 +47,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -45,6 +45,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -32,6 +32,20 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -32,6 +32,20 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -1,3 +1,49 @@
/*======
This file is part of PerconaFT.
Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
PerconaFT is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License, version 2,
as published by the Free Software Foundation.
PerconaFT is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
PerconaFT is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License, version 3,
as published by the Free Software Foundation.
PerconaFT is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#pragma once
#include <stdio.h> // FILE

View File

@ -32,6 +32,20 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -32,6 +32,20 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \
@ -139,7 +153,12 @@ typedef struct toku_mutex_aligned {
{ .pmutex = PTHREAD_MUTEX_INITIALIZER, .psi_mutex = nullptr }
#endif // defined(TOKU_PTHREAD_DEBUG)
#else // __FreeBSD__, __linux__, at least
#if defined(__GLIBC__)
#define TOKU_MUTEX_ADAPTIVE PTHREAD_MUTEX_ADAPTIVE_NP
#else
// not all libc (e.g. musl) implement NP (Non-POSIX) attributes
#define TOKU_MUTEX_ADAPTIVE PTHREAD_MUTEX_DEFAULT
#endif
#if defined(TOKU_PTHREAD_DEBUG)
#define TOKU_ADAPTIVE_MUTEX_INITIALIZER \
{ \

View File

@ -32,6 +32,20 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -32,6 +32,20 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -32,6 +32,20 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -45,6 +45,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -34,6 +34,20 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -32,6 +32,20 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -45,6 +45,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -45,6 +45,7 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -32,6 +32,20 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -32,6 +32,20 @@ Copyright (c) 2006, 2015, Percona and/or its affiliates. All rights reserved.
You should have received a copy of the GNU Affero General Public License
along with PerconaFT. If not, see <http://www.gnu.org/licenses/>.
----------------------------------------
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
======= */
#ident \

View File

@ -46,6 +46,22 @@ class OptimisticTransactionDBImpl : public OptimisticTransactionDB {
const OptimisticTransactionOptions& txn_options,
Transaction* old_txn) override;
// Transactional `DeleteRange()` is not yet supported.
virtual Status DeleteRange(const WriteOptions&, ColumnFamilyHandle*,
const Slice&, const Slice&) override {
return Status::NotSupported();
}
// Range deletions also must not be snuck into `WriteBatch`es as they are
// incompatible with `OptimisticTransactionDB`.
virtual Status Write(const WriteOptions& write_opts,
WriteBatch* batch) override {
if (batch->HasDeleteRange()) {
return Status::NotSupported();
}
return OptimisticTransactionDB::Write(write_opts, batch);
}
size_t GetLockBucketsSize() const { return bucketed_locks_.size(); }
OccValidationPolicy GetValidatePolicy() const { return validate_policy_; }

View File

@ -1033,6 +1033,17 @@ TEST_P(OptimisticTransactionTest, IteratorTest) {
delete txn;
}
TEST_P(OptimisticTransactionTest, DeleteRangeSupportTest) {
// `OptimisticTransactionDB` does not allow range deletion in any API.
ASSERT_TRUE(
txn_db
->DeleteRange(WriteOptions(), txn_db->DefaultColumnFamily(), "a", "b")
.IsNotSupported());
WriteBatch wb;
ASSERT_OK(wb.DeleteRange("a", "b"));
ASSERT_NOK(txn_db->Write(WriteOptions(), &wb));
}
TEST_P(OptimisticTransactionTest, SavepointTest) {
WriteOptions write_options;
ReadOptions read_options, snapshot_read_options;

View File

@ -4835,6 +4835,56 @@ TEST_P(TransactionTest, MergeTest) {
ASSERT_EQ("a,3", value);
}
TEST_P(TransactionTest, DeleteRangeSupportTest) {
// The `DeleteRange()` API is banned everywhere.
ASSERT_TRUE(
db->DeleteRange(WriteOptions(), db->DefaultColumnFamily(), "a", "b")
.IsNotSupported());
// But range deletions can be added via the `Write()` API by specifying the
// proper flags to promise there are no conflicts according to the DB type
// (see `TransactionDB::DeleteRange()` API doc for details).
for (bool skip_concurrency_control : {false, true}) {
for (bool skip_duplicate_key_check : {false, true}) {
ASSERT_OK(db->Put(WriteOptions(), "a", "val"));
WriteBatch wb;
ASSERT_OK(wb.DeleteRange("a", "b"));
TransactionDBWriteOptimizations flags;
flags.skip_concurrency_control = skip_concurrency_control;
flags.skip_duplicate_key_check = skip_duplicate_key_check;
Status s = db->Write(WriteOptions(), flags, &wb);
std::string value;
switch (txn_db_options.write_policy) {
case WRITE_COMMITTED:
if (skip_concurrency_control) {
ASSERT_OK(s);
ASSERT_TRUE(db->Get(ReadOptions(), "a", &value).IsNotFound());
} else {
ASSERT_NOK(s);
ASSERT_OK(db->Get(ReadOptions(), "a", &value));
}
break;
case WRITE_PREPARED:
// Intentional fall-through
case WRITE_UNPREPARED:
if (skip_concurrency_control && skip_duplicate_key_check) {
ASSERT_OK(s);
ASSERT_TRUE(db->Get(ReadOptions(), "a", &value).IsNotFound());
} else {
ASSERT_NOK(s);
ASSERT_OK(db->Get(ReadOptions(), "a", &value));
}
break;
}
// Without any promises from the user, range deletion via other `Write()`
// APIs are still banned.
ASSERT_OK(db->Put(WriteOptions(), "a", "val"));
ASSERT_NOK(db->Write(WriteOptions(), &wb));
ASSERT_OK(db->Get(ReadOptions(), "a", &value));
}
}
}
TEST_P(TransactionTest, DeferSnapshotTest) {
WriteOptions write_options;
ReadOptions read_options;

View File

@ -157,7 +157,9 @@ Status WritePreparedTxnDB::WriteInternal(const WriteOptions& write_options_orig,
// TODO(myabandeh): add an option to allow user skipping this cost
SubBatchCounter counter(*GetCFComparatorMap());
auto s = batch->Iterate(&counter);
assert(s.ok());
if (!s.ok()) {
return s;
}
batch_cnt = counter.BatchCount();
WPRecordTick(TXN_DUPLICATE_KEY_OVERHEAD);
ROCKS_LOG_DETAILS(info_log_, "Duplicate key overhead: %" PRIu64 " batches",