Sync ingested files only if reopen is supported by the FS (#8296)

Summary:
Some file systems (especially distributed FS) do not support reopening a file for writing. The ExternalSstFileIngestionJob calls ReopenWritableFile in order to sync the ingested file, which typically makes sense only on a local file system with a page cache (i.e Posix). So this change tries to sync the ingested file only if ReopenWritableFile doesn't return Status::NotSupported().

Tests:
Add a new unit test in external_sst_file_basic_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8296

Reviewed By: jay-zhuang

Differential Revision: D28420865

Pulled By: anand1976

fbshipit-source-id: 380e7f5ff95324997f7a59864a9ac96ebbd0100c
This commit is contained in:
anand76 2021-05-18 19:32:55 -07:00 committed by Facebook GitHub Bot
parent 60e5af83c1
commit 9d61a0856d
3 changed files with 56 additions and 11 deletions

View File

@ -7,6 +7,7 @@
* Fixed the false-positive alert when recovering from the WAL file. Avoid reporting "SST file is ahead of WAL" on a newly created empty column family, if the previous WAL file is corrupted. * Fixed the false-positive alert when recovering from the WAL file. Avoid reporting "SST file is ahead of WAL" on a newly created empty column family, if the previous WAL file is corrupted.
* Fixed a bug where `GetLiveFiles()` output included a non-existent file called "OPTIONS-000000". Backups and checkpoints, which use `GetLiveFiles()`, failed on DBs impacted by this bug. Read-write DBs were impacted when the latest OPTIONS file failed to write and `fail_if_options_file_error == false`. Read-only DBs were impacted when no OPTIONS files existed. * Fixed a bug where `GetLiveFiles()` output included a non-existent file called "OPTIONS-000000". Backups and checkpoints, which use `GetLiveFiles()`, failed on DBs impacted by this bug. Read-write DBs were impacted when the latest OPTIONS file failed to write and `fail_if_options_file_error == false`. Read-only DBs were impacted when no OPTIONS files existed.
* Handle return code by io_uring_submit_and_wait() and io_uring_wait_cqe(). * Handle return code by io_uring_submit_and_wait() and io_uring_wait_cqe().
* In the IngestExternalFile() API, only try to sync the ingested file if the file is linked and the FileSystem/Env supports reopening a writable file.
### Behavior Changes ### Behavior Changes
* Due to the fix of false-postive alert of "SST file is ahead of WAL", all the CFs with no SST file (CF empty) will bypass the consistency check. We fixed a false-positive, but introduced a very rare true-negative which will be triggered in the following conditions: A CF with some delete operations in the last a few queries which will result in an empty CF (those are flushed to SST file and a compaction triggered which combines this file and all other SST files and generates an empty CF, or there is another reason to write a manifest entry for this CF after a flush that generates no SST file from an empty CF). The deletion entries are logged in a WAL and this WAL was corrupted, while the CF's log number points to the next WAL (due to the flush). Therefore, the DB can only recover to the point without these trailing deletions and cause the inconsistent DB status. * Due to the fix of false-postive alert of "SST file is ahead of WAL", all the CFs with no SST file (CF empty) will bypass the consistency check. We fixed a false-positive, but introduced a very rare true-negative which will be triggered in the following conditions: A CF with some delete operations in the last a few queries which will result in an empty CF (those are flushed to SST file and a compaction triggered which combines this file and all other SST files and generates an empty CF, or there is another reason to write a manifest entry for this CF after a flush that generates no SST file from an empty CF). The deletion entries are logged in a WAL and this WAL was corrupted, while the CF's log number points to the next WAL (due to the flush). Therefore, the DB can only recover to the point without these trailing deletions and cause the inconsistent DB status.

View File

@ -1136,6 +1136,41 @@ TEST_F(ExternalSSTFileBasicTest, SyncFailure) {
} }
} }
TEST_F(ExternalSSTFileBasicTest, ReopenNotSupported) {
Options options;
options.create_if_missing = true;
options.env = env_;
SyncPoint::GetInstance()->SetCallBack(
"ExternalSstFileIngestionJob::Prepare:Reopen", [&](void* arg) {
Status* s = static_cast<Status*>(arg);
*s = Status::NotSupported();
});
SyncPoint::GetInstance()->EnableProcessing();
DestroyAndReopen(options);
Options sst_file_writer_options;
sst_file_writer_options.env = env_;
std::unique_ptr<SstFileWriter> sst_file_writer(
new SstFileWriter(EnvOptions(), sst_file_writer_options));
std::string file_name =
sst_files_dir_ + "reopen_not_supported_test_" + ".sst";
ASSERT_OK(sst_file_writer->Open(file_name));
ASSERT_OK(sst_file_writer->Put("bar", "v2"));
ASSERT_OK(sst_file_writer->Finish());
IngestExternalFileOptions ingest_opt;
ingest_opt.move_files = true;
const Snapshot* snapshot = db_->GetSnapshot();
ASSERT_OK(db_->IngestExternalFile({file_name}, ingest_opt));
db_->ReleaseSnapshot(snapshot);
SyncPoint::GetInstance()->DisableProcessing();
SyncPoint::GetInstance()->ClearAllCallBacks();
Destroy(options);
}
TEST_F(ExternalSSTFileBasicTest, VerifyChecksumReadahead) { TEST_F(ExternalSSTFileBasicTest, VerifyChecksumReadahead) {
Options options; Options options;
options.create_if_missing = true; options.create_if_missing = true;

View File

@ -109,19 +109,28 @@ Status ExternalSstFileIngestionJob::Prepare(
// directory before ingest the file. For integrity of RocksDB we need // directory before ingest the file. For integrity of RocksDB we need
// to sync the file. // to sync the file.
std::unique_ptr<FSWritableFile> file_to_sync; std::unique_ptr<FSWritableFile> file_to_sync;
status = fs_->ReopenWritableFile(path_inside_db, env_options_, Status s = fs_->ReopenWritableFile(path_inside_db, env_options_,
&file_to_sync, nullptr); &file_to_sync, nullptr);
TEST_SYNC_POINT_CALLBACK("ExternalSstFileIngestionJob::Prepare:Reopen",
&s);
// Some file systems (especially remote/distributed) don't support
// reopening a file for writing and don't require reopening and
// syncing the file. Ignore the NotSupported error in that case.
if (!s.IsNotSupported()) {
status = s;
if (status.ok()) { if (status.ok()) {
TEST_SYNC_POINT( TEST_SYNC_POINT(
"ExternalSstFileIngestionJob::BeforeSyncIngestedFile"); "ExternalSstFileIngestionJob::BeforeSyncIngestedFile");
status = SyncIngestedFile(file_to_sync.get()); status = SyncIngestedFile(file_to_sync.get());
TEST_SYNC_POINT("ExternalSstFileIngestionJob::AfterSyncIngestedFile"); TEST_SYNC_POINT(
"ExternalSstFileIngestionJob::AfterSyncIngestedFile");
if (!status.ok()) { if (!status.ok()) {
ROCKS_LOG_WARN(db_options_.info_log, ROCKS_LOG_WARN(db_options_.info_log,
"Failed to sync ingested file %s: %s", "Failed to sync ingested file %s: %s",
path_inside_db.c_str(), status.ToString().c_str()); path_inside_db.c_str(), status.ToString().c_str());
} }
} }
}
} else if (status.IsNotSupported() && } else if (status.IsNotSupported() &&
ingestion_options_.failed_move_fall_back_to_copy) { ingestion_options_.failed_move_fall_back_to_copy) {
// Original file is on a different FS, use copy instead of hard linking. // Original file is on a different FS, use copy instead of hard linking.