de85e4cadf
Summary: The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL. 1. RecoverAfterRestart (kTolerateCorruptedTailRecords) This mocks the current recovery mode. 2. RecoverAfterCleanShutdown (kAbsoluteConsistency) This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes. 3. RecoverPointInTime (kPointInTimeRecovery) This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write. 4. RecoverAfterDisaster (kSkipAnyCorruptRecord) This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible. Test Plan: (1) Run added unit test to cover all levels. (2) Run make check. Reviewers: leveldb, sdong, igor Subscribers: yoshinorim, dhruba Differential Revision: https://reviews.facebook.net/D38487
139 lines
4.8 KiB
C++
139 lines
4.8 KiB
C++
// Copyright (c) 2013, Facebook, Inc. All rights reserved.
|
|
// This source code is licensed under the BSD-style license found in the
|
|
// LICENSE file in the root directory of this source tree. An additional grant
|
|
// of patent rights can be found in the PATENTS file in the same directory.
|
|
//
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
#pragma once
|
|
#include <memory>
|
|
#include <stdint.h>
|
|
|
|
#include "db/log_format.h"
|
|
#include "rocksdb/slice.h"
|
|
#include "rocksdb/status.h"
|
|
|
|
namespace rocksdb {
|
|
|
|
class SequentialFile;
|
|
using std::unique_ptr;
|
|
|
|
namespace log {
|
|
|
|
/**
|
|
* Reader is a general purpose log stream reader implementation. The actual job
|
|
* of reading from the device is implemented by the SequentialFile interface.
|
|
*
|
|
* Please see Writer for details on the file and record layout.
|
|
*/
|
|
class Reader {
|
|
public:
|
|
// Interface for reporting errors.
|
|
class Reporter {
|
|
public:
|
|
virtual ~Reporter();
|
|
|
|
// Some corruption was detected. "size" is the approximate number
|
|
// of bytes dropped due to the corruption.
|
|
virtual void Corruption(size_t bytes, const Status& status) = 0;
|
|
};
|
|
|
|
// Create a reader that will return log records from "*file".
|
|
// "*file" must remain live while this Reader is in use.
|
|
//
|
|
// If "reporter" is non-nullptr, it is notified whenever some data is
|
|
// dropped due to a detected corruption. "*reporter" must remain
|
|
// live while this Reader is in use.
|
|
//
|
|
// If "checksum" is true, verify checksums if available.
|
|
//
|
|
// The Reader will start reading at the first record located at physical
|
|
// position >= initial_offset within the file.
|
|
Reader(unique_ptr<SequentialFile>&& file, Reporter* reporter,
|
|
bool checksum, uint64_t initial_offset);
|
|
|
|
~Reader();
|
|
|
|
// Read the next record into *record. Returns true if read
|
|
// successfully, false if we hit end of the input. May use
|
|
// "*scratch" as temporary storage. The contents filled in *record
|
|
// will only be valid until the next mutating operation on this
|
|
// reader or the next mutation to *scratch.
|
|
bool ReadRecord(Slice* record, std::string* scratch,
|
|
bool report_eof_inconsistency = false);
|
|
|
|
// Returns the physical offset of the last record returned by ReadRecord.
|
|
//
|
|
// Undefined before the first call to ReadRecord.
|
|
uint64_t LastRecordOffset();
|
|
|
|
// returns true if the reader has encountered an eof condition.
|
|
bool IsEOF() {
|
|
return eof_;
|
|
}
|
|
|
|
// when we know more data has been written to the file. we can use this
|
|
// function to force the reader to look again in the file.
|
|
// Also aligns the file position indicator to the start of the next block
|
|
// by reading the rest of the data from the EOF position to the end of the
|
|
// block that was partially read.
|
|
void UnmarkEOF();
|
|
|
|
SequentialFile* file() { return file_.get(); }
|
|
|
|
private:
|
|
const unique_ptr<SequentialFile> file_;
|
|
Reporter* const reporter_;
|
|
bool const checksum_;
|
|
char* const backing_store_;
|
|
Slice buffer_;
|
|
bool eof_; // Last Read() indicated EOF by returning < kBlockSize
|
|
bool read_error_; // Error occurred while reading from file
|
|
|
|
// Offset of the file position indicator within the last block when an
|
|
// EOF was detected.
|
|
size_t eof_offset_;
|
|
|
|
// Offset of the last record returned by ReadRecord.
|
|
uint64_t last_record_offset_;
|
|
// Offset of the first location past the end of buffer_.
|
|
uint64_t end_of_buffer_offset_;
|
|
|
|
// Offset at which to start looking for the first record to return
|
|
uint64_t const initial_offset_;
|
|
|
|
// Extend record types with the following special values
|
|
enum {
|
|
kEof = kMaxRecordType + 1,
|
|
// Returned whenever we find an invalid physical record.
|
|
// Currently there are three situations in which this happens:
|
|
// * The record has an invalid CRC (ReadPhysicalRecord reports a drop)
|
|
// * The record is a 0-length record (No drop is reported)
|
|
// * The record is below constructor's initial_offset (No drop is reported)
|
|
kBadRecord = kMaxRecordType + 2
|
|
};
|
|
|
|
// Skips all blocks that are completely before "initial_offset_".
|
|
//
|
|
// Returns true on success. Handles reporting.
|
|
bool SkipToInitialBlock();
|
|
|
|
// Return type, or one of the preceding special values
|
|
unsigned int ReadPhysicalRecord(Slice* result,
|
|
bool report_eof_inconsistency = false);
|
|
|
|
// Reports dropped bytes to the reporter.
|
|
// buffer_ must be updated to remove the dropped bytes prior to invocation.
|
|
void ReportCorruption(size_t bytes, const char* reason);
|
|
void ReportDrop(size_t bytes, const Status& reason);
|
|
|
|
// No copying allowed
|
|
Reader(const Reader&);
|
|
void operator=(const Reader&);
|
|
};
|
|
|
|
} // namespace log
|
|
} // namespace rocksdb
|