0bd4dcde6b
Summary: **This PR does not affect the functionality of `DB` and write-committed transactions.** `CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed. As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if it is committed. In fact, the implementation of `KeyCommitted()` is as follows: ``` inline bool KeyCommitted(SequenceNumber seq) { // For non-txn-db and write-committed, snapshot_checker_ is always nullptr. return snapshot_checker_ == nullptr || snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot; } ``` With that being said, we focus on write-prepared/write-unprepared transactions. A few notes: - A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database. - `CompactionIterator` outputs a key as long as the key is uncommitted. Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`. Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone. To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot to determine whether a key is committed or not with minor change to `KeyCommitted()`. ``` inline bool KeyCommitted(SequenceNumber sequence) { // For non-txn-db and write-committed, snapshot_checker_ is always nullptr. return snapshot_checker_ == nullptr || snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) == SnapshotCheckerResult::kInSnapshot; } ``` As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble for `CompactionIterator`s assertions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830 Test Plan: make check Reviewed By: ltamasi Differential Revision: D35561162 Pulled By: riversand963 fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f
75 lines
3.0 KiB
C++
75 lines
3.0 KiB
C++
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
#pragma once
|
|
#include <string>
|
|
#include <utility>
|
|
#include <vector>
|
|
#include "db/range_tombstone_fragmenter.h"
|
|
#include "db/table_properties_collector.h"
|
|
#include "logging/event_logger.h"
|
|
#include "options/cf_options.h"
|
|
#include "rocksdb/comparator.h"
|
|
#include "rocksdb/env.h"
|
|
#include "rocksdb/listener.h"
|
|
#include "rocksdb/options.h"
|
|
#include "rocksdb/status.h"
|
|
#include "rocksdb/table_properties.h"
|
|
#include "rocksdb/types.h"
|
|
#include "table/scoped_arena_iterator.h"
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
struct FileMetaData;
|
|
|
|
class VersionSet;
|
|
class BlobFileAddition;
|
|
class SnapshotChecker;
|
|
class TableCache;
|
|
class TableBuilder;
|
|
class WritableFileWriter;
|
|
class InternalStats;
|
|
class BlobFileCompletionCallback;
|
|
|
|
// Convenience function for NewTableBuilder on the embedded table_factory.
|
|
TableBuilder* NewTableBuilder(const TableBuilderOptions& tboptions,
|
|
WritableFileWriter* file);
|
|
|
|
// Build a Table file from the contents of *iter. The generated file
|
|
// will be named according to number specified in meta. On success, the rest of
|
|
// *meta will be filled with metadata about the generated table.
|
|
// If no data is present in *iter, meta->file_size will be set to
|
|
// zero, and no Table file will be produced.
|
|
//
|
|
// @param column_family_name Name of the column family that is also identified
|
|
// by column_family_id, or empty string if unknown.
|
|
extern Status BuildTable(
|
|
const std::string& dbname, VersionSet* versions,
|
|
const ImmutableDBOptions& db_options, const TableBuilderOptions& tboptions,
|
|
const FileOptions& file_options, TableCache* table_cache,
|
|
InternalIterator* iter,
|
|
std::vector<std::unique_ptr<FragmentedRangeTombstoneIterator>>
|
|
range_del_iters,
|
|
FileMetaData* meta, std::vector<BlobFileAddition>* blob_file_additions,
|
|
std::vector<SequenceNumber> snapshots,
|
|
SequenceNumber earliest_write_conflict_snapshot,
|
|
SequenceNumber job_snapshot, SnapshotChecker* snapshot_checker,
|
|
bool paranoid_file_checks, InternalStats* internal_stats,
|
|
IOStatus* io_status, const std::shared_ptr<IOTracer>& io_tracer,
|
|
BlobFileCreationReason blob_creation_reason,
|
|
EventLogger* event_logger = nullptr, int job_id = 0,
|
|
const Env::IOPriority io_priority = Env::IO_HIGH,
|
|
TableProperties* table_properties = nullptr,
|
|
Env::WriteLifeTimeHint write_hint = Env::WLTH_NOT_SET,
|
|
const std::string* full_history_ts_low = nullptr,
|
|
BlobFileCompletionCallback* blob_callback = nullptr,
|
|
uint64_t* num_input_entries = nullptr,
|
|
uint64_t* memtable_payload_bytes = nullptr,
|
|
uint64_t* memtable_garbage_bytes = nullptr);
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|