update HISTORY.md and bump version for 6.19.4

Fix the false positive alert of CF consistency check in WAL recovery (#8207 )
Summary: In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue. Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here. Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207 Test Plan: added unit test, make crash_test Reviewed By: riversand963 Differential Revision: D27898839 Pulled By: zhichao-cao fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e
2021-04-23 15:42:04 -07:00 · 2021-04-23 15:39:35 -07:00 · 2021-04-23 15:37:41 -07:00 · 2021-04-20 15:24:25 -07:00 · 2021-04-19 19:57:42 -07:00 · 2021-04-19 19:56:59 -07:00
34 changed files with 513 additions and 311 deletions
--- a/HISTORY.md
+++ b/HISTORY.md
@ -1,4 +1,24 @@
 # Rocksdb Change Log
+## 6.19.4 (04/23/2021)
+### Bug Fixes
+* Fixed a bug where ingested files were written with incorrect boundary key metadata. In rare cases this could have led to a level's files being wrongly ordered and queries for the boundary keys returning wrong results.
+* Fixed the false-positive alert when recovering from the WAL file. Avoid reporting "SST file is ahead of WAL" on a newly created empty column family, if the previous WAL file is corrupted.
+
+### Behavior Changes
+* Due to the fix of false-postive alert of "SST file is ahead of WAL", all the CFs with no SST file (CF empty) will bypass the consistency check. We fixed a false-positive, but introduced a very rare true-negative which will be triggered in the following conditions: A CF with some delete operations in the last a few queries which will result in an empty CF (those are flushed to SST file and a compaction triggered which combines this file and all other SST files and generates an empty CF, or there is another reason to write a manifest entry for this CF after a flush that generates no SST file from an empty CF). The deletion entries are logged in a WAL and this WAL was corrupted, while the CF's log number points to the next WAL (due to the flush). Therefore, the DB can only recover to the point without these trailing deletions and cause the inconsistent DB status.
+
+## 6.19.3 (04/19/2021)
+### Bug Fixes
+* Fixed a bug in handling file rename error in distributed/network file systems when the server succeeds but client returns error. The bug can cause CURRENT file to point to non-existing MANIFEST file, thus DB cannot be opened.
+
+## 6.19.2 (04/08/2021)
+### Bug Fixes
+* Fixed a backward iteration bug with partitioned filter enabled: not including the prefix of the last key of the previous filter partition in current filter partition can cause wrong iteration result.
+
+## 6.19.1 (04/01/2021)
+### Bug Fixes
+* Fixed crash (divide by zero) when compression dictionary is applied to a file containing only range tombstones.
+
 ## 6.19.0 (03/21/2021)
 ### Bug Fixes
 * Fixed the truncation error found in APIs/tools when dumping block-based SST files in a human-readable format. After fix, the block-based table can be fully dumped as a readable file.
--- a/10
+++ b/10
@ -501,6 +501,12 @@ ifeq ($(USE_FOLLY_DISTRIBUTED_MUTEX),1)
  LIB_OBJECTS += $(patsubst %.cpp, $(OBJ_DIR)/%.o, $(FOLLY_SOURCES))
 endif

+# range_tree is not compatible with non GNU libc on ppc64
+# see https://jira.percona.com/browse/PS-7559
+ifneq ($(PPC_LIBC_IS_GNU),0)
+  LIB_OBJECTS += $(patsubst %.cc, $(OBJ_DIR)/%.o, $(RANGE_TREE_SOURCES))
+endif
+
 GTEST = $(OBJ_DIR)/$(GTEST_DIR)/gtest/gtest-all.o
 TESTUTIL = $(OBJ_DIR)/test_util/testutil.o
 TESTHARNESS = $(OBJ_DIR)/test_util/testharness.o $(TESTUTIL) $(GTEST)
@ -2183,8 +2189,8 @@ SNAPPY_DOWNLOAD_BASE ?= https://github.com/google/snappy/archive
 LZ4_VER ?= 1.9.3
 LZ4_SHA256 ?= 030644df4611007ff7dc962d981f390361e6c97a34e5cbc393ddfbe019ffe2c1
 LZ4_DOWNLOAD_BASE ?= https://github.com/lz4/lz4/archive
-ZSTD_VER ?= 1.4.7
-ZSTD_SHA256 ?= 085500c8d0b9c83afbc1dc0d8b4889336ad019eba930c5d6a9c6c86c20c769c8
+ZSTD_VER ?= 1.4.9
+ZSTD_SHA256 ?= acf714d98e3db7b876e5b540cbf6dee298f60eb3c0723104f6d3f065cd60d6a8
 ZSTD_DOWNLOAD_BASE ?= https://github.com/facebook/zstd/archive
 CURL_SSL_OPTS ?= --tlsv1

--- a/16
+++ b/16
@ -87,10 +87,12 @@ ROCKSDB_PREPROCESSOR_FLAGS = [

    # Added missing flags from output of build_detect_platform
    "-DROCKSDB_BACKTRACE",
+]

-    # Directories with files for #include
-    "-I" + REPO_PATH + "include/",
-    "-I" + REPO_PATH,
+# Directories with files for #include
+ROCKSDB_INCLUDE_PATHS = [
+    "",
+    "include",
 ]

 ROCKSDB_ARCH_PREPROCESSOR_FLAGS = {
@ -428,6 +430,7 @@ cpp_library(
    os_deps = ROCKSDB_OS_DEPS,
    os_preprocessor_flags = ROCKSDB_OS_PREPROCESSOR_FLAGS,
    preprocessor_flags = ROCKSDB_PREPROCESSOR_FLAGS,
+    include_paths = ROCKSDB_INCLUDE_PATHS,
    deps = [],
    external_deps = ROCKSDB_EXTERNAL_DEPS,
    link_whole = False,
@ -735,6 +738,7 @@ cpp_library(
    os_deps = ROCKSDB_OS_DEPS,
    os_preprocessor_flags = ROCKSDB_OS_PREPROCESSOR_FLAGS,
    preprocessor_flags = ROCKSDB_PREPROCESSOR_FLAGS,
+    include_paths = ROCKSDB_INCLUDE_PATHS,
    deps = [],
    external_deps = ROCKSDB_EXTERNAL_DEPS,
    link_whole = True,
@ -758,6 +762,7 @@ cpp_library(
    os_deps = ROCKSDB_OS_DEPS,
    os_preprocessor_flags = ROCKSDB_OS_PREPROCESSOR_FLAGS,
    preprocessor_flags = ROCKSDB_PREPROCESSOR_FLAGS,
+    include_paths = ROCKSDB_INCLUDE_PATHS,
    deps = [":rocksdb_lib"],
    external_deps = ROCKSDB_EXTERNAL_DEPS + [
        ("googletest", None, "gtest"),
@ -779,6 +784,7 @@ cpp_library(
    os_deps = ROCKSDB_OS_DEPS,
    os_preprocessor_flags = ROCKSDB_OS_PREPROCESSOR_FLAGS,
    preprocessor_flags = ROCKSDB_PREPROCESSOR_FLAGS,
+    include_paths = ROCKSDB_INCLUDE_PATHS,
    deps = [":rocksdb_lib"],
    external_deps = ROCKSDB_EXTERNAL_DEPS,
    link_whole = False,
@ -806,6 +812,7 @@ cpp_library(
    os_deps = ROCKSDB_OS_DEPS,
    os_preprocessor_flags = ROCKSDB_OS_PREPROCESSOR_FLAGS,
    preprocessor_flags = ROCKSDB_PREPROCESSOR_FLAGS,
+    include_paths = ROCKSDB_INCLUDE_PATHS,
    deps = ROCKSDB_LIB_DEPS,
    external_deps = ROCKSDB_EXTERNAL_DEPS,
 )
@ -817,6 +824,7 @@ cpp_binary(
    os_preprocessor_flags = ROCKSDB_OS_PREPROCESSOR_FLAGS,
    compiler_flags = ROCKSDB_COMPILER_FLAGS,
    preprocessor_flags = ROCKSDB_PREPROCESSOR_FLAGS,
+    include_paths = ROCKSDB_INCLUDE_PATHS,
    deps = [":rocksdb_test_lib"],
 ) if not is_opt_mode else None

@ -838,6 +846,7 @@ cpp_library(
    os_deps = ROCKSDB_OS_DEPS,
    os_preprocessor_flags = ROCKSDB_OS_PREPROCESSOR_FLAGS,
    preprocessor_flags = ROCKSDB_PREPROCESSOR_FLAGS,
+    include_paths = ROCKSDB_INCLUDE_PATHS,
    deps = [":rocksdb_test_lib"],
    external_deps = ROCKSDB_EXTERNAL_DEPS,
    link_whole = False,
@ -2118,6 +2127,7 @@ ROCKS_TESTS = [
        os_preprocessor_flags = ROCKSDB_OS_PREPROCESSOR_FLAGS,
        compiler_flags = ROCKSDB_COMPILER_FLAGS + extra_compiler_flags,
        preprocessor_flags = ROCKSDB_PREPROCESSOR_FLAGS,
+        include_paths = ROCKSDB_INCLUDE_PATHS,
        deps = [":rocksdb_test_lib"] + extra_deps,
        external_deps = ROCKSDB_EXTERNAL_DEPS + [
            ("googletest", None, "gtest"),
--- a/buckifier/buckify_rocksdb.py
+++ b/buckifier/buckify_rocksdb.py
@ -141,11 +141,15 @@ def generate_targets(repo_path, deps_map):
    TARGETS.add_library(
        "rocksdb_lib",
        src_mk["LIB_SOURCES"] +
+        # always add range_tree, it's only excluded on ppc64, which we don't use internally
+        src_mk["RANGE_TREE_SOURCES"] +
        src_mk["TOOL_LIB_SOURCES"])
    # rocksdb_whole_archive_lib
    TARGETS.add_library(
        "rocksdb_whole_archive_lib",
        src_mk["LIB_SOURCES"] +
+        # always add range_tree, it's only excluded on ppc64, which we don't use internally
+        src_mk["RANGE_TREE_SOURCES"] +
        src_mk["TOOL_LIB_SOURCES"],
        deps=None,
        headers=None,
--- a/buckifier/targets_builder.py
+++ b/buckifier/targets_builder.py
@ -87,6 +87,7 @@ cpp_binary(
    os_preprocessor_flags = ROCKSDB_OS_PREPROCESSOR_FLAGS,
    compiler_flags = ROCKSDB_COMPILER_FLAGS,
    preprocessor_flags = ROCKSDB_PREPROCESSOR_FLAGS,
+    include_paths = ROCKSDB_INCLUDE_PATHS,
    deps = [":rocksdb_test_lib"],
 ) if not is_opt_mode else None

--- a/buckifier/targets_cfg.py
+++ b/buckifier/targets_cfg.py
@ -94,10 +94,12 @@ ROCKSDB_PREPROCESSOR_FLAGS = [

    # Added missing flags from output of build_detect_platform
    "-DROCKSDB_BACKTRACE",
+]

-    # Directories with files for #include
-    "-I" + REPO_PATH + "include/",
-    "-I" + REPO_PATH,
+# Directories with files for #include
+ROCKSDB_INCLUDE_PATHS = [
+    "",
+    "include",
 ]

 ROCKSDB_ARCH_PREPROCESSOR_FLAGS = {{
@ -145,6 +147,7 @@ cpp_library(
    os_deps = ROCKSDB_OS_DEPS,
    os_preprocessor_flags = ROCKSDB_OS_PREPROCESSOR_FLAGS,
    preprocessor_flags = ROCKSDB_PREPROCESSOR_FLAGS,
+    include_paths = ROCKSDB_INCLUDE_PATHS,
    deps = [{deps}],
    external_deps = ROCKSDB_EXTERNAL_DEPS{extra_external_deps},
    link_whole = {link_whole},
@ -161,6 +164,7 @@ cpp_library(
    os_deps = ROCKSDB_OS_DEPS,
    os_preprocessor_flags = ROCKSDB_OS_PREPROCESSOR_FLAGS,
    preprocessor_flags = ROCKSDB_PREPROCESSOR_FLAGS,
+    include_paths = ROCKSDB_INCLUDE_PATHS,
    deps = ROCKSDB_LIB_DEPS,
    external_deps = ROCKSDB_EXTERNAL_DEPS,
 )
@ -173,6 +177,7 @@ cpp_binary(
    arch_preprocessor_flags = ROCKSDB_ARCH_PREPROCESSOR_FLAGS,
    compiler_flags = ROCKSDB_COMPILER_FLAGS,
    preprocessor_flags = ROCKSDB_PREPROCESSOR_FLAGS,
+    include_paths = ROCKSDB_INCLUDE_PATHS,
    deps = [{deps}],
    external_deps = ROCKSDB_EXTERNAL_DEPS,
 )
@ -203,6 +208,7 @@ ROCKS_TESTS = [
        os_preprocessor_flags = ROCKSDB_OS_PREPROCESSOR_FLAGS,
        compiler_flags = ROCKSDB_COMPILER_FLAGS + extra_compiler_flags,
        preprocessor_flags = ROCKSDB_PREPROCESSOR_FLAGS,
+        include_paths = ROCKSDB_INCLUDE_PATHS,
        deps = [":rocksdb_test_lib"] + extra_deps,
        external_deps = ROCKSDB_EXTERNAL_DEPS + [
            ("googletest", None, "gtest"),
--- a/build_tools/build_detect_platform
+++ b/build_tools/build_detect_platform
@ -668,6 +668,23 @@ else
  fi
 fi

+if test -n "`echo $TARGET_ARCHITECTURE | grep ^ppc64`"; then
+  # check for GNU libc on ppc64
+  $CXX -x c++ - -o /dev/null 2>/dev/null <<EOF
+    #include <stdio.h>
+    #include <stdlib.h>
+    #include <gnu/libc-version.h>
+
+    int main(int argc, char *argv[]) {
+      printf("GNU libc version: %s\n", gnu_get_libc_version());
+      return 0;
+    }
+EOF
+  if [ "$?" != 0 ]; then
+      PPC_LIBC_IS_GNU=0
+  fi
+fi
+
 if test "$TRY_SSE_ETC"; then
  # The USE_SSE flag now means "attempt to compile with widely-available
  # Intel architecture extensions utilized by specific optimizations in the
@ -861,3 +878,6 @@ echo "LUA_PATH=$LUA_PATH" >> "$OUTPUT"
 if test -n "$USE_FOLLY_DISTRIBUTED_MUTEX"; then
  echo "USE_FOLLY_DISTRIBUTED_MUTEX=$USE_FOLLY_DISTRIBUTED_MUTEX" >> "$OUTPUT"
 fi
+if test -n "$PPC_LIBC_IS_GNU"; then
+  echo "PPC_LIBC_IS_GNU=$PPC_LIBC_IS_GNU" >> "$OUTPUT"
+fi
--- a/db/db_bloom_filter_test.cc
+++ b/db/db_bloom_filter_test.cc
@ -7,6 +7,9 @@
 // Use of this source code is governed by a BSD-style license that can be
 // found in the LICENSE file. See the AUTHORS file for names of contributors.

+#include <iomanip>
+#include <sstream>
+
 #include "db/db_test_util.h"
 #include "options/options_helper.h"
 #include "port/stack_trace.h"
@ -2117,6 +2120,54 @@ TEST_F(DBBloomFilterTest, DynamicBloomFilterOptions) {
  }
 }

+TEST_F(DBBloomFilterTest, SeekForPrevWithPartitionedFilters) {
+  Options options = CurrentOptions();
+  constexpr size_t kNumKeys = 10000;
+  static_assert(kNumKeys <= 10000, "kNumKeys have to be <= 10000");
+  options.memtable_factory.reset(new SpecialSkipListFactory(kNumKeys + 10));
+  options.create_if_missing = true;
+  constexpr size_t kPrefixLength = 4;
+  options.prefix_extractor.reset(NewFixedPrefixTransform(kPrefixLength));
+  options.compression = kNoCompression;
+  BlockBasedTableOptions bbto;
+  bbto.filter_policy.reset(NewBloomFilterPolicy(50));
+  bbto.index_shortening =
+      BlockBasedTableOptions::IndexShorteningMode::kNoShortening;
+  bbto.block_size = 128;
+  bbto.metadata_block_size = 128;
+  bbto.partition_filters = true;
+  bbto.index_type = BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch;
+  options.table_factory.reset(NewBlockBasedTableFactory(bbto));
+  DestroyAndReopen(options);
+
+  const std::string value(64, '\0');
+
+  WriteOptions write_opts;
+  write_opts.disableWAL = true;
+  for (size_t i = 0; i < kNumKeys; ++i) {
+    std::ostringstream oss;
+    oss << std::setfill('0') << std::setw(4) << std::fixed << i;
+    ASSERT_OK(db_->Put(write_opts, oss.str(), value));
+  }
+  ASSERT_OK(Flush());
+
+  ReadOptions read_opts;
+  // Use legacy, implicit prefix seek
+  read_opts.total_order_seek = false;
+  read_opts.auto_prefix_mode = false;
+  std::unique_ptr<Iterator> it(db_->NewIterator(read_opts));
+  for (size_t i = 0; i < kNumKeys; ++i) {
+    // Seek with a key after each one added but with same prefix. One will
+    // surely cross a partition boundary.
+    std::ostringstream oss;
+    oss << std::setfill('0') << std::setw(4) << std::fixed << i << "a";
+    it->SeekForPrev(oss.str());
+    ASSERT_OK(it->status());
+    ASSERT_TRUE(it->Valid());
+  }
+  it.reset();
+}
+
 #endif  // ROCKSDB_LITE

 }  // namespace ROCKSDB_NAMESPACE
--- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@ -260,18 +260,16 @@ Status DBImpl::FlushMemTableToOutputFile(
      // be pessimistic and try write to a new MANIFEST.
      // TODO: distinguish between MANIFEST write and CURRENT renaming
      if (!versions_->io_status().ok()) {
-        if (total_log_size_ > 0) {
-          // If the WAL is empty, we use different error reason
-          error_handler_.SetBGError(io_s,
-                                    BackgroundErrorReason::kManifestWrite);
-        } else {
-          error_handler_.SetBGError(io_s,
-                                    BackgroundErrorReason::kManifestWriteNoWAL);
-        }
-      } else if (total_log_size_ > 0 || !log_io_s.ok()) {
-        error_handler_.SetBGError(io_s, BackgroundErrorReason::kFlush);
+        // If WAL sync is successful (either WAL size is 0 or there is no IO
+        // error), all the Manifest write will be map to soft error.
+        // TODO: kManifestWriteNoWAL and kFlushNoWAL are misleading. Refactor is
+        // needed.
+        error_handler_.SetBGError(io_s,
+                                  BackgroundErrorReason::kManifestWriteNoWAL);
      } else {
-        // If the WAL is empty, we use different error reason
+        // If WAL sync is successful (either WAL size is 0 or there is no IO
+        // error), all the other SST file write errors will be set as
+        // kFlushNoWAL.
        error_handler_.SetBGError(io_s, BackgroundErrorReason::kFlushNoWAL);
      }
    } else {
@ -687,18 +685,16 @@ Status DBImpl::AtomicFlushMemTablesToOutputFiles(
      // be pessimistic and try write to a new MANIFEST.
      // TODO: distinguish between MANIFEST write and CURRENT renaming
      if (!versions_->io_status().ok()) {
-        if (total_log_size_ > 0) {
-          // If the WAL is empty, we use different error reason
-          error_handler_.SetBGError(io_s,
-                                    BackgroundErrorReason::kManifestWrite);
-        } else {
-          error_handler_.SetBGError(io_s,
-                                    BackgroundErrorReason::kManifestWriteNoWAL);
-        }
-      } else if (total_log_size_ > 0) {
-        error_handler_.SetBGError(io_s, BackgroundErrorReason::kFlush);
+        // If WAL sync is successful (either WAL size is 0 or there is no IO
+        // error), all the Manifest write will be map to soft error.
+        // TODO: kManifestWriteNoWAL and kFlushNoWAL are misleading. Refactor
+        // is needed.
+        error_handler_.SetBGError(io_s,
+                                  BackgroundErrorReason::kManifestWriteNoWAL);
      } else {
-        // If the WAL is empty, we use different error reason
+        // If WAL sync is successful (either WAL size is 0 or there is no IO
+        // error), all the other SST file write errors will be set as
+        // kFlushNoWAL.
        error_handler_.SetBGError(io_s, BackgroundErrorReason::kFlushNoWAL);
      }
    } else {
@ -2567,6 +2563,8 @@ void DBImpl::BackgroundCallFlush(Env::Priority thread_pri) {

  LogBuffer log_buffer(InfoLogLevel::INFO_LEVEL,
                       immutable_db_options_.info_log.get());
+  TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:Start:1");
+  TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:Start:2");
  {
    InstrumentedMutexLock l(&mutex_);
    assert(bg_flush_scheduled_);
--- a/db/db_impl/db_impl_files.cc
+++ b/db/db_impl/db_impl_files.cc
@ -937,7 +937,7 @@ Status DBImpl::DeleteUnreferencedSstFiles() {
    return s;
  }

-  if (largest_file_number > next_file_number) {
+  if (largest_file_number >= next_file_number) {
    versions_->next_file_number_.store(largest_file_number + 1);
  }

--- a/db/db_impl/db_impl_open.cc
+++ b/db/db_impl/db_impl_open.cc
@ -283,6 +283,9 @@ Status DBImpl::NewDB(std::vector<std::string>* new_filenames) {
  ROCKS_LOG_INFO(immutable_db_options_.info_log, "Creating manifest 1 \n");
  const std::string manifest = DescriptorFileName(dbname_, 1);
  {
+    if (fs_->FileExists(manifest, IOOptions(), nullptr).ok()) {
+      fs_->DeleteFile(manifest, IOOptions(), nullptr).PermitUncheckedError();
+    }
    std::unique_ptr<FSWritableFile> file;
    FileOptions file_options = fs_->OptimizeForManifestWrite(file_options_);
    s = NewWritableFile(fs_.get(), manifest, &file, file_options);
@ -312,7 +315,7 @@ Status DBImpl::NewDB(std::vector<std::string>* new_filenames) {
          manifest.substr(manifest.find_last_of("/\\") + 1));
    }
  } else {
-    fs_->DeleteFile(manifest, IOOptions(), nullptr);
+    fs_->DeleteFile(manifest, IOOptions(), nullptr).PermitUncheckedError();
  }
  return s;
 }
@ -1132,11 +1135,29 @@ Status DBImpl::RecoverLogFiles(const std::vector<uint64_t>& wal_numbers,
       immutable_db_options_.wal_recovery_mode ==
           WALRecoveryMode::kTolerateCorruptedTailRecords)) {
    for (auto cfd : *versions_->GetColumnFamilySet()) {
-      if (cfd->GetLogNumber() > corrupted_wal_number) {
+      // One special case cause cfd->GetLogNumber() > corrupted_wal_number but
+      // the CF is still consistent: If a new column family is created during
+      // the flush and the WAL sync fails at the same time, the new CF points to
+      // the new WAL but the old WAL is curropted. Since the new CF is empty, it
+      // is still consistent. We add the check of CF sst file size to avoid the
+      // false positive alert.
+
+      // Note that, the check of (cfd->GetLiveSstFilesSize() > 0) may leads to
+      // the ignorance of a very rare inconsistency case caused in data
+      // canclation. One CF is empty due to KV deletion. But those operations
+      // are in the WAL. If the WAL is corrupted, the status of this CF might
+      // not be consistent with others. However, the consistency check will be
+      // bypassed due to empty CF.
+      // TODO: a better and complete implementation is needed to ensure strict
+      // consistency check in WAL recovery including hanlding the tailing
+      // issues.
+      if (cfd->GetLogNumber() > corrupted_wal_number &&
+          cfd->GetLiveSstFilesSize() > 0) {
        ROCKS_LOG_ERROR(immutable_db_options_.info_log,
                        "Column family inconsistency: SST file contains data"
                        " beyond the point of corruption.");
-        return Status::Corruption("SST file is ahead of WALs");
+        return Status::Corruption("SST file is ahead of WALs in CF " +
+                                  cfd->GetName());
      }
    }
  }
--- a/db/db_range_del_test.cc
+++ b/db/db_range_del_test.cc
@ -73,6 +73,15 @@ TEST_F(DBRangeDelTest, FlushOutputHasOnlyRangeTombstones) {
  } while (ChangeOptions(kRangeDelSkipConfigs));
 }

+TEST_F(DBRangeDelTest, DictionaryCompressionWithOnlyRangeTombstones) {
+  Options opts = CurrentOptions();
+  opts.compression_opts.max_dict_bytes = 16384;
+  Reopen(opts);
+  ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), "dr1",
+                             "dr2"));
+  ASSERT_OK(db_->Flush(FlushOptions()));
+}
+
 TEST_F(DBRangeDelTest, CompactionOutputHasOnlyRangeTombstone) {
  do {
    Options opts = CurrentOptions();
--- a/db/db_test2.cc
+++ b/db/db_test2.cc
@ -5428,6 +5428,98 @@ TEST_F(DBTest2, AutoPrefixMode1) {
    ASSERT_EQ("a1", iterator->key().ToString());
  }
 }
+
+class RenameCurrentTest : public DBTestBase,
+                          public testing::WithParamInterface<std::string> {
+ public:
+  RenameCurrentTest()
+      : DBTestBase("rename_current_test", /*env_do_fsync=*/true),
+        sync_point_(GetParam()) {}
+
+  ~RenameCurrentTest() override {}
+
+  void SetUp() override {
+    env_->no_file_overwrite_.store(true, std::memory_order_release);
+  }
+
+  void TearDown() override {
+    env_->no_file_overwrite_.store(false, std::memory_order_release);
+  }
+
+  void SetupSyncPoints() {
+    SyncPoint::GetInstance()->DisableProcessing();
+    SyncPoint::GetInstance()->SetCallBack(sync_point_, [&](void* arg) {
+      Status* s = reinterpret_cast<Status*>(arg);
+      assert(s);
+      *s = Status::IOError("Injected IO error.");
+    });
+  }
+
+  const std::string sync_point_;
+};
+
+INSTANTIATE_TEST_CASE_P(DistributedFS, RenameCurrentTest,
+                        ::testing::Values("SetCurrentFile:BeforeRename",
+                                          "SetCurrentFile:AfterRename"));
+
+TEST_P(RenameCurrentTest, Open) {
+  Destroy(last_options_);
+  Options options = GetDefaultOptions();
+  options.create_if_missing = true;
+  SetupSyncPoints();
+  SyncPoint::GetInstance()->EnableProcessing();
+  Status s = TryReopen(options);
+  ASSERT_NOK(s);
+
+  SyncPoint::GetInstance()->DisableProcessing();
+  Reopen(options);
+}
+
+TEST_P(RenameCurrentTest, Flush) {
+  Destroy(last_options_);
+  Options options = GetDefaultOptions();
+  options.max_manifest_file_size = 1;
+  options.create_if_missing = true;
+  Reopen(options);
+  ASSERT_OK(Put("key", "value"));
+  SetupSyncPoints();
+  SyncPoint::GetInstance()->EnableProcessing();
+  ASSERT_NOK(Flush());
+
+  ASSERT_NOK(Put("foo", "value"));
+
+  SyncPoint::GetInstance()->DisableProcessing();
+  Reopen(options);
+  ASSERT_EQ("value", Get("key"));
+  ASSERT_EQ("NOT_FOUND", Get("foo"));
+}
+
+TEST_P(RenameCurrentTest, Compaction) {
+  Destroy(last_options_);
+  Options options = GetDefaultOptions();
+  options.max_manifest_file_size = 1;
+  options.create_if_missing = true;
+  Reopen(options);
+  ASSERT_OK(Put("a", "a_value"));
+  ASSERT_OK(Put("c", "c_value"));
+  ASSERT_OK(Flush());
+
+  ASSERT_OK(Put("b", "b_value"));
+  ASSERT_OK(Put("d", "d_value"));
+  ASSERT_OK(Flush());
+
+  SetupSyncPoints();
+  SyncPoint::GetInstance()->EnableProcessing();
+  ASSERT_NOK(db_->CompactRange(CompactRangeOptions(), /*begin=*/nullptr,
+                               /*end=*/nullptr));
+
+  ASSERT_NOK(Put("foo", "value"));
+
+  SyncPoint::GetInstance()->DisableProcessing();
+  Reopen(options);
+  ASSERT_EQ("NOT_FOUND", Get("foo"));
+  ASSERT_EQ("d_value", Get("d"));
+}
 #endif  // ROCKSDB_LITE

 // WAL recovery mode is WALRecoveryMode::kPointInTimeRecovery.
@ -5455,6 +5547,35 @@ TEST_F(DBTest2, PointInTimeRecoveryWithIOErrorWhileReadingWal) {
  Status s = TryReopen(options);
  ASSERT_TRUE(s.IsIOError());
 }
+
+TEST_F(DBTest2, PointInTimeRecoveryWithSyncFailureInCFCreation) {
+  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
+      {{"DBImpl::BackgroundCallFlush:Start:1",
+        "PointInTimeRecoveryWithSyncFailureInCFCreation:1"},
+       {"PointInTimeRecoveryWithSyncFailureInCFCreation:2",
+        "DBImpl::BackgroundCallFlush:Start:2"}});
+  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
+
+  CreateColumnFamilies({"test1"}, Options());
+  ASSERT_OK(Put("foo", "bar"));
+
+  // Creating a CF when a flush is going on, log is synced but the
+  // closed log file is not synced and corrupted.
+  port::Thread flush_thread([&]() { ASSERT_NOK(Flush()); });
+  TEST_SYNC_POINT("PointInTimeRecoveryWithSyncFailureInCFCreation:1");
+  CreateColumnFamilies({"test2"}, Options());
+  env_->corrupt_in_sync_ = true;
+  TEST_SYNC_POINT("PointInTimeRecoveryWithSyncFailureInCFCreation:2");
+  flush_thread.join();
+  env_->corrupt_in_sync_ = false;
+  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
+
+  // Reopening the DB should not corrupt anything
+  Options options = CurrentOptions();
+  options.wal_recovery_mode = WALRecoveryMode::kPointInTimeRecovery;
+  ReopenWithColumnFamilies({"default", "test1", "test2"}, options);
+}
+
 }  // namespace ROCKSDB_NAMESPACE

 #ifdef ROCKSDB_UNITTESTS_WITH_CUSTOM_OBJECTS_FROM_STATIC_LIBS
--- a/db/db_test_util.cc
+++ b/db/db_test_util.cc
@ -44,6 +44,7 @@ SpecialEnv::SpecialEnv(Env* base, bool time_elapse_only_sleep)
  manifest_sync_error_.store(false, std::memory_order_release);
  manifest_write_error_.store(false, std::memory_order_release);
  log_write_error_.store(false, std::memory_order_release);
+  no_file_overwrite_.store(false, std::memory_order_release);
  random_file_open_counter_.store(0, std::memory_order_relaxed);
  delete_count_.store(0, std::memory_order_relaxed);
  num_open_wal_file_.store(0);
--- a/db/db_test_util.h
+++ b/db/db_test_util.h
@ -393,6 +393,10 @@ class SpecialEnv : public EnvWrapper {
      Status Flush() override { return base_->Flush(); }
      Status Sync() override {
        ++env_->sync_counter_;
+        if (env_->corrupt_in_sync_) {
+          Append(std::string(33000, ' '));
+          return Status::IOError("Ingested Sync Failure");
+        }
        if (env_->skip_fsync_) {
          return Status::OK();
        } else {
@ -440,6 +444,11 @@ class SpecialEnv : public EnvWrapper {
      std::unique_ptr<WritableFile> base_;
    };

+    if (no_file_overwrite_.load(std::memory_order_acquire) &&
+        target()->FileExists(f).ok()) {
+      return Status::NotSupported("SpecialEnv::no_file_overwrite_ is true.");
+    }
+
    if (non_writeable_rate_.load(std::memory_order_acquire) > 0) {
      uint32_t random_number;
      {
@ -687,6 +696,9 @@ class SpecialEnv : public EnvWrapper {
  // Slow down every log write, in micro-seconds.
  std::atomic<int> log_write_slowdown_;

+  // If true, returns Status::NotSupported for file overwrite.
+  std::atomic<bool> no_file_overwrite_;
+
  // Number of WAL files that are still open for write.
  std::atomic<int> num_open_wal_file_;

@ -709,6 +721,9 @@ class SpecialEnv : public EnvWrapper {
  // If true, all fsync to files and directories are skipped.
  bool skip_fsync_ = false;

+  // If true, ingest the corruption to file during sync.
+  bool corrupt_in_sync_ = false;
+
  std::atomic<uint32_t> non_writeable_rate_;

  std::atomic<uint32_t> new_writable_count_;
--- a/db/error_handler_fs_test.cc
+++ b/db/error_handler_fs_test.cc
@ -216,7 +216,7 @@ TEST_F(DBErrorHandlingFSTest, FLushWriteRetryableError) {
      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
  SyncPoint::GetInstance()->EnableProcessing();
  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
+  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);
  SyncPoint::GetInstance()->DisableProcessing();
  fault_fs_->SetFilesystemActive(true);
  s = dbfull()->Resume();
@ -242,7 +242,7 @@ TEST_F(DBErrorHandlingFSTest, FLushWriteRetryableError) {
      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
  SyncPoint::GetInstance()->EnableProcessing();
  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
+  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);
  SyncPoint::GetInstance()->DisableProcessing();
  fault_fs_->SetFilesystemActive(true);
  s = dbfull()->Resume();
@ -256,7 +256,7 @@ TEST_F(DBErrorHandlingFSTest, FLushWriteRetryableError) {
      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
  SyncPoint::GetInstance()->EnableProcessing();
  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
+  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);
  SyncPoint::GetInstance()->DisableProcessing();
  fault_fs_->SetFilesystemActive(true);
  s = dbfull()->Resume();
@ -292,7 +292,7 @@ TEST_F(DBErrorHandlingFSTest, FLushWriteFileScopeError) {
      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
  SyncPoint::GetInstance()->EnableProcessing();
  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
+  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);
  SyncPoint::GetInstance()->DisableProcessing();
  fault_fs_->SetFilesystemActive(true);
  s = dbfull()->Resume();
@ -306,7 +306,7 @@ TEST_F(DBErrorHandlingFSTest, FLushWriteFileScopeError) {
      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
  SyncPoint::GetInstance()->EnableProcessing();
  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
+  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);
  SyncPoint::GetInstance()->DisableProcessing();
  fault_fs_->SetFilesystemActive(true);
  s = dbfull()->Resume();
@ -320,7 +320,7 @@ TEST_F(DBErrorHandlingFSTest, FLushWriteFileScopeError) {
      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
  SyncPoint::GetInstance()->EnableProcessing();
  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
+  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);
  SyncPoint::GetInstance()->DisableProcessing();
  fault_fs_->SetFilesystemActive(true);
  s = dbfull()->Resume();
@ -340,7 +340,7 @@ TEST_F(DBErrorHandlingFSTest, FLushWriteFileScopeError) {
      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
  SyncPoint::GetInstance()->EnableProcessing();
  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
+  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);
  SyncPoint::GetInstance()->DisableProcessing();
  fault_fs_->SetFilesystemActive(true);
  s = dbfull()->Resume();
@ -649,7 +649,7 @@ TEST_F(DBErrorHandlingFSTest, ManifestWriteRetryableError) {
      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
  SyncPoint::GetInstance()->EnableProcessing();
  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
+  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);
  SyncPoint::GetInstance()->ClearAllCallBacks();
  SyncPoint::GetInstance()->DisableProcessing();
  fault_fs_->SetFilesystemActive(true);
@ -695,7 +695,7 @@ TEST_F(DBErrorHandlingFSTest, ManifestWriteFileScopeError) {
      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
  SyncPoint::GetInstance()->EnableProcessing();
  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
+  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);
  SyncPoint::GetInstance()->ClearAllCallBacks();
  SyncPoint::GetInstance()->DisableProcessing();
  fault_fs_->SetFilesystemActive(true);
@ -1698,7 +1698,7 @@ TEST_F(DBErrorHandlingFSTest, MultiDBVariousErrors) {
 // to soft error and trigger auto resume. During auto resume, SwitchMemtable
 // is disabled to avoid small SST tables. Write can still be applied before
 // the bg error is cleaned unless the memtable is full.
-TEST_F(DBErrorHandlingFSTest, FLushWritNoWALRetryableeErrorAutoRecover1) {
+TEST_F(DBErrorHandlingFSTest, FLushWritNoWALRetryableErrorAutoRecover1) {
  // Activate the FS before the first resume
  std::shared_ptr<ErrorHandlerFSListener> listener(
      new ErrorHandlerFSListener());
@ -1768,7 +1768,7 @@ TEST_F(DBErrorHandlingFSTest, FLushWritNoWALRetryableeErrorAutoRecover1) {
  Destroy(options);
 }

-TEST_F(DBErrorHandlingFSTest, FLushWritNoWALRetryableeErrorAutoRecover2) {
+TEST_F(DBErrorHandlingFSTest, FLushWritNoWALRetryableErrorAutoRecover2) {
  // Activate the FS before the first resume
  std::shared_ptr<ErrorHandlerFSListener> listener(
      new ErrorHandlerFSListener());
@ -1810,14 +1810,14 @@ TEST_F(DBErrorHandlingFSTest, FLushWritNoWALRetryableeErrorAutoRecover2) {
                   ERROR_HANDLER_BG_RETRYABLE_IO_ERROR_COUNT));
  ASSERT_EQ(1, options.statistics->getAndResetTickerCount(
                   ERROR_HANDLER_AUTORESUME_COUNT));
-  ASSERT_EQ(1, options.statistics->getAndResetTickerCount(
+  ASSERT_LE(0, options.statistics->getAndResetTickerCount(
                   ERROR_HANDLER_AUTORESUME_RETRY_TOTAL_COUNT));
-  ASSERT_EQ(1, options.statistics->getAndResetTickerCount(
+  ASSERT_LE(0, options.statistics->getAndResetTickerCount(
                   ERROR_HANDLER_AUTORESUME_SUCCESS_COUNT));
  HistogramData autoresume_retry;
  options.statistics->histogramData(ERROR_HANDLER_AUTORESUME_RETRY_COUNT,
                                    &autoresume_retry);
-  ASSERT_EQ(autoresume_retry.max, 1);
+  ASSERT_GE(autoresume_retry.max, 0);
  ASSERT_OK(Put(Key(2), "val2", wo));
  s = Flush();
  // Since auto resume is successful, the bg error is cleaned, flush will
@ -1827,56 +1827,7 @@ TEST_F(DBErrorHandlingFSTest, FLushWritNoWALRetryableeErrorAutoRecover2) {
  Destroy(options);
 }

-TEST_F(DBErrorHandlingFSTest, DISABLED_FLushWritRetryableeErrorAutoRecover1) {
-  // Fail the first resume and make the second resume successful
-  std::shared_ptr<ErrorHandlerFSListener> listener(
-      new ErrorHandlerFSListener());
-  Options options = GetDefaultOptions();
-  options.env = fault_env_.get();
-  options.create_if_missing = true;
-  options.listeners.emplace_back(listener);
-  options.max_bgerror_resume_count = 2;
-  options.bgerror_resume_retry_interval = 100000;  // 0.1 second
-  Status s;
-
-  listener->EnableAutoRecovery(false);
-  DestroyAndReopen(options);
-
-  IOStatus error_msg = IOStatus::IOError("Retryable IO Error");
-  error_msg.SetRetryable(true);
-
-  ASSERT_OK(Put(Key(1), "val1"));
-  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
-      {{"RecoverFromRetryableBGIOError:BeforeWait0",
-        "FLushWritRetryableeErrorAutoRecover1:0"},
-       {"FLushWritRetryableeErrorAutoRecover1:1",
-        "RecoverFromRetryableBGIOError:BeforeWait1"},
-       {"RecoverFromRetryableBGIOError:RecoverSuccess",
-        "FLushWritRetryableeErrorAutoRecover1:2"}});
-  SyncPoint::GetInstance()->SetCallBack(
-      "BuildTable:BeforeFinishBuildTable",
-      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
-  SyncPoint::GetInstance()->EnableProcessing();
-  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
-  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover1:0");
-  fault_fs_->SetFilesystemActive(true);
-  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
-  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover1:1");
-  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover1:2");
-  SyncPoint::GetInstance()->DisableProcessing();
-
-  ASSERT_EQ("val1", Get(Key(1)));
-  Reopen(options);
-  ASSERT_EQ("val1", Get(Key(1)));
-  ASSERT_OK(Put(Key(2), "val2"));
-  ASSERT_OK(Flush());
-  ASSERT_EQ("val2", Get(Key(2)));
-
-  Destroy(options);
-}
-
-TEST_F(DBErrorHandlingFSTest, FLushWritRetryableeErrorAutoRecover2) {
+TEST_F(DBErrorHandlingFSTest, FLushWritRetryableErrorAutoRecover1) {
  // Activate the FS before the first resume
  std::shared_ptr<ErrorHandlerFSListener> listener(
      new ErrorHandlerFSListener());
@ -1901,7 +1852,7 @@ TEST_F(DBErrorHandlingFSTest, FLushWritRetryableeErrorAutoRecover2) {

  SyncPoint::GetInstance()->EnableProcessing();
  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
+  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);
  SyncPoint::GetInstance()->DisableProcessing();
  fault_fs_->SetFilesystemActive(true);
  ASSERT_EQ(listener->WaitForRecovery(5000000), true);
@ -1916,7 +1867,7 @@ TEST_F(DBErrorHandlingFSTest, FLushWritRetryableeErrorAutoRecover2) {
  Destroy(options);
 }

-TEST_F(DBErrorHandlingFSTest, FLushWritRetryableeErrorAutoRecover3) {
+TEST_F(DBErrorHandlingFSTest, FLushWritRetryableErrorAutoRecover2) {
  // Fail all the resume and let user to resume
  std::shared_ptr<ErrorHandlerFSListener> listener(
      new ErrorHandlerFSListener());
@ -1936,18 +1887,18 @@ TEST_F(DBErrorHandlingFSTest, FLushWritRetryableeErrorAutoRecover3) {

  ASSERT_OK(Put(Key(1), "val1"));
  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
-      {{"FLushWritRetryableeErrorAutoRecover3:0",
+      {{"FLushWritRetryableeErrorAutoRecover2:0",
        "RecoverFromRetryableBGIOError:BeforeStart"},
       {"RecoverFromRetryableBGIOError:LoopOut",
-        "FLushWritRetryableeErrorAutoRecover3:1"}});
+        "FLushWritRetryableeErrorAutoRecover2:1"}});
  SyncPoint::GetInstance()->SetCallBack(
      "BuildTable:BeforeFinishBuildTable",
      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
  SyncPoint::GetInstance()->EnableProcessing();
  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
-  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover3:0");
-  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover3:1");
+  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);
+  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover2:0");
+  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover2:1");
  fault_fs_->SetFilesystemActive(true);
  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
  SyncPoint::GetInstance()->DisableProcessing();
@ -1965,173 +1916,6 @@ TEST_F(DBErrorHandlingFSTest, FLushWritRetryableeErrorAutoRecover3) {
  Destroy(options);
 }

-TEST_F(DBErrorHandlingFSTest, DISABLED_FLushWritRetryableeErrorAutoRecover4) {
-  // Fail the first resume and does not do resume second time because
-  // the IO error severity is Fatal Error and not Retryable.
-  std::shared_ptr<ErrorHandlerFSListener> listener(
-      new ErrorHandlerFSListener());
-  Options options = GetDefaultOptions();
-  options.env = fault_env_.get();
-  options.create_if_missing = true;
-  options.listeners.emplace_back(listener);
-  options.max_bgerror_resume_count = 2;
-  options.bgerror_resume_retry_interval = 10;  // 0.1 second
-  Status s;
-
-  listener->EnableAutoRecovery(false);
-  DestroyAndReopen(options);
-
-  IOStatus error_msg = IOStatus::IOError("Retryable IO Error");
-  error_msg.SetRetryable(true);
-  IOStatus nr_msg = IOStatus::IOError("No Retryable Fatal IO Error");
-  nr_msg.SetRetryable(false);
-
-  ASSERT_OK(Put(Key(1), "val1"));
-  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
-      {{"RecoverFromRetryableBGIOError:BeforeStart",
-        "FLushWritRetryableeErrorAutoRecover4:0"},
-       {"FLushWritRetryableeErrorAutoRecover4:2",
-        "RecoverFromRetryableBGIOError:RecoverFail0"}});
-  SyncPoint::GetInstance()->SetCallBack(
-      "BuildTable:BeforeFinishBuildTable",
-      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
-  SyncPoint::GetInstance()->SetCallBack(
-      "RecoverFromRetryableBGIOError:BeforeResume1",
-      [&](void*) { fault_fs_->SetFilesystemActive(false, nr_msg); });
-
-  SyncPoint::GetInstance()->EnableProcessing();
-  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
-  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover4:0");
-  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover4:2");
-  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
-  SyncPoint::GetInstance()->DisableProcessing();
-  fault_fs_->SetFilesystemActive(true);
-  // Even the FS is recoverd, due to the Fatal Error in bg_error_ the resume
-  // and flush will all fail.
-  ASSERT_EQ("val1", Get(Key(1)));
-  ASSERT_NOK(dbfull()->Resume());
-  ASSERT_EQ("val1", Get(Key(1)));
-  ASSERT_OK(Put(Key(2), "val2"));
-  ASSERT_NOK(Flush());
-  ASSERT_EQ("NOT_FOUND", Get(Key(2)));
-
-  Reopen(options);
-  ASSERT_EQ("val1", Get(Key(1)));
-  ASSERT_OK(Put(Key(2), "val2"));
-  ASSERT_OK(Flush());
-  ASSERT_EQ("val2", Get(Key(2)));
-
-  Destroy(options);
-}
-
-TEST_F(DBErrorHandlingFSTest, DISABLED_FLushWritRetryableeErrorAutoRecover5) {
-  // During the resume, call DB->CLose, make sure the resume thread exist
-  // before close continues. Due to the shutdown, the resume is not successful
-  // and the FS does not become active, so close status is still IO error
-  std::shared_ptr<ErrorHandlerFSListener> listener(
-      new ErrorHandlerFSListener());
-  Options options = GetDefaultOptions();
-  options.env = fault_env_.get();
-  options.create_if_missing = true;
-  options.listeners.emplace_back(listener);
-  options.max_bgerror_resume_count = 2;
-  options.bgerror_resume_retry_interval = 10;  // 0.1 second
-  Status s;
-
-  listener->EnableAutoRecovery(false);
-  DestroyAndReopen(options);
-
-  IOStatus error_msg = IOStatus::IOError("Retryable IO Error");
-  error_msg.SetRetryable(true);
-
-  ASSERT_OK(Put(Key(1), "val1"));
-  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
-      {{"RecoverFromRetryableBGIOError:BeforeStart",
-        "FLushWritRetryableeErrorAutoRecover5:0"}});
-  SyncPoint::GetInstance()->SetCallBack(
-      "BuildTable:BeforeFinishBuildTable",
-      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
-  SyncPoint::GetInstance()->EnableProcessing();
-  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
-  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover5:0");
-  // The first resume will cause recovery_error and its severity is the
-  // Fatal error
-  s = dbfull()->Close();
-  ASSERT_NOK(s);
-  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
-  SyncPoint::GetInstance()->DisableProcessing();
-  fault_fs_->SetFilesystemActive(true);
-
-  Reopen(options);
-  ASSERT_NE("val1", Get(Key(1)));
-  ASSERT_OK(Put(Key(2), "val2"));
-  s = Flush();
-  ASSERT_OK(s);
-  ASSERT_EQ("val2", Get(Key(2)));
-
-  Destroy(options);
-}
-
-TEST_F(DBErrorHandlingFSTest, FLushWritRetryableeErrorAutoRecover6) {
-  // During the resume, call DB->CLose, make sure the resume thread exist
-  // before close continues. Due to the shutdown, the resume is not successful
-  // and the FS does not become active, so close status is still IO error
-  std::shared_ptr<ErrorHandlerFSListener> listener(
-      new ErrorHandlerFSListener());
-  Options options = GetDefaultOptions();
-  options.env = fault_env_.get();
-  options.create_if_missing = true;
-  options.listeners.emplace_back(listener);
-  options.max_bgerror_resume_count = 2;
-  options.bgerror_resume_retry_interval = 10;  // 0.1 second
-  Status s;
-
-  listener->EnableAutoRecovery(false);
-  DestroyAndReopen(options);
-
-  IOStatus error_msg = IOStatus::IOError("Retryable IO Error");
-  error_msg.SetRetryable(true);
-
-  ASSERT_OK(Put(Key(1), "val1"));
-  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
-      {{"FLushWritRetryableeErrorAutoRecover6:0",
-        "RecoverFromRetryableBGIOError:BeforeStart"},
-       {"RecoverFromRetryableBGIOError:BeforeWait0",
-        "FLushWritRetryableeErrorAutoRecover6:1"},
-       {"FLushWritRetryableeErrorAutoRecover6:2",
-        "RecoverFromRetryableBGIOError:BeforeWait1"},
-       {"RecoverFromRetryableBGIOError:AfterWait0",
-        "FLushWritRetryableeErrorAutoRecover6:3"}});
-  SyncPoint::GetInstance()->SetCallBack(
-      "BuildTable:BeforeFinishBuildTable",
-      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
-  SyncPoint::GetInstance()->EnableProcessing();
-  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
-  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover6:0");
-  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover6:1");
-  fault_fs_->SetFilesystemActive(true);
-  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
-  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover6:2");
-  TEST_SYNC_POINT("FLushWritRetryableeErrorAutoRecover6:3");
-  // The first resume will cause recovery_error and its severity is the
-  // Fatal error
-  s = dbfull()->Close();
-  ASSERT_OK(s);
-  SyncPoint::GetInstance()->DisableProcessing();
-
-  Reopen(options);
-  ASSERT_EQ("val1", Get(Key(1)));
-  ASSERT_OK(Put(Key(2), "val2"));
-  s = Flush();
-  ASSERT_OK(s);
-  ASSERT_EQ("val2", Get(Key(2)));
-
-  Destroy(options);
-}
-
 TEST_F(DBErrorHandlingFSTest, ManifestWriteRetryableErrorAutoRecover) {
  // Fail the first resume and let the second resume be successful
  std::shared_ptr<ErrorHandlerFSListener> listener(
@ -2168,7 +1952,7 @@ TEST_F(DBErrorHandlingFSTest, ManifestWriteRetryableErrorAutoRecover) {
      [&](void*) { fault_fs_->SetFilesystemActive(false, error_msg); });
  SyncPoint::GetInstance()->EnableProcessing();
  s = Flush();
-  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kHardError);
+  ASSERT_EQ(s.severity(), ROCKSDB_NAMESPACE::Status::Severity::kSoftError);
  TEST_SYNC_POINT("ManifestWriteRetryableErrorAutoRecover:0");
  fault_fs_->SetFilesystemActive(true);
  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
--- a/db/external_sst_file_basic_test.cc
+++ b/db/external_sst_file_basic_test.cc
@ -1542,6 +1542,44 @@ TEST_F(ExternalSSTFileBasicTest, OverlappingFiles) {
  ASSERT_EQ(2, NumTableFilesAtLevel(0));
 }

+TEST_F(ExternalSSTFileBasicTest, IngestFileAfterDBPut) {
+  // Repro https://github.com/facebook/rocksdb/issues/6245.
+  // Flush three files to L0. Ingest one more file to trigger L0->L1 compaction
+  // via trivial move. The bug happened when L1 files were incorrectly sorted
+  // resulting in an old value for "k" returned by `Get()`.
+  Options options = CurrentOptions();
+
+  ASSERT_OK(Put("k", "a"));
+  Flush();
+  ASSERT_OK(Put("k", "a"));
+  Flush();
+  ASSERT_OK(Put("k", "a"));
+  Flush();
+  SstFileWriter sst_file_writer(EnvOptions(), options);
+
+  // Current file size should be 0 after sst_file_writer init and before open a
+  // file.
+  ASSERT_EQ(sst_file_writer.FileSize(), 0);
+
+  std::string file1 = sst_files_dir_ + "file1.sst";
+  ASSERT_OK(sst_file_writer.Open(file1));
+  ASSERT_OK(sst_file_writer.Put("k", "b"));
+
+  ExternalSstFileInfo file1_info;
+  Status s = sst_file_writer.Finish(&file1_info);
+  ASSERT_OK(s) << s.ToString();
+
+  // Current file size should be non-zero after success write.
+  ASSERT_GT(sst_file_writer.FileSize(), 0);
+
+  IngestExternalFileOptions ifo;
+  s = db_->IngestExternalFile({file1}, ifo);
+  ASSERT_OK(s);
+  ASSERT_OK(dbfull()->TEST_WaitForCompact());
+
+  ASSERT_EQ(Get("k"), "b");
+}
+
 INSTANTIATE_TEST_CASE_P(ExternalSSTFileBasicTest, ExternalSSTFileBasicTest,
                        testing::Values(std::make_tuple(true, true),
                                        std::make_tuple(true, false),
--- a/db/external_sst_file_ingestion_job.cc
+++ b/db/external_sst_file_ingestion_job.cc
@ -368,9 +368,32 @@ Status ExternalSstFileIngestionJob::Run() {
          super_version, force_global_seqno, cfd_->ioptions()->compaction_style,
          last_seqno, &f, &assigned_seqno);
    }
+
+    // Modify the smallest/largest internal key to include the sequence number
+    // that we just learned. Only overwrite sequence number zero. There could
+    // be a nonzero sequence number already to indicate a range tombstone's
+    // exclusive endpoint.
+    ParsedInternalKey smallest_parsed, largest_parsed;
+    if (status.ok()) {
+      status = ParseInternalKey(*f.smallest_internal_key.rep(),
+                                &smallest_parsed, false /* log_err_key */);
+    }
+    if (status.ok()) {
+      status = ParseInternalKey(*f.largest_internal_key.rep(), &largest_parsed,
+                                false /* log_err_key */);
+    }
    if (!status.ok()) {
      return status;
    }
+    if (smallest_parsed.sequence == 0) {
+      UpdateInternalKey(f.smallest_internal_key.rep(), assigned_seqno,
+                        smallest_parsed.type);
+    }
+    if (largest_parsed.sequence == 0) {
+      UpdateInternalKey(f.largest_internal_key.rep(), assigned_seqno,
+                        largest_parsed.type);
+    }
+
    status = AssignGlobalSeqnoForIngestedFile(&f, assigned_seqno);
    TEST_SYNC_POINT_CALLBACK("ExternalSstFileIngestionJob::Run",
                             &assigned_seqno);
--- a/db/version_set.cc
+++ b/db/version_set.cc
@ -4083,6 +4083,7 @@ Status VersionSet::ProcessManifestWrites(
  uint64_t new_manifest_file_size = 0;
  Status s;
  IOStatus io_s;
+  IOStatus manifest_io_status;
  {
    FileOptions opt_file_opts = fs_->OptimizeForManifestWrite(file_options_);
    mu->Unlock();
@ -4134,6 +4135,7 @@ Status VersionSet::ProcessManifestWrites(
        s = WriteCurrentStateToManifest(curr_state, wal_additions,
                                        descriptor_log_.get(), io_s);
      } else {
+        manifest_io_status = io_s;
        s = io_s;
      }
    }
@ -4171,11 +4173,13 @@ Status VersionSet::ProcessManifestWrites(
        io_s = descriptor_log_->AddRecord(record);
        if (!io_s.ok()) {
          s = io_s;
+          manifest_io_status = io_s;
          break;
        }
      }
      if (s.ok()) {
        io_s = SyncManifest(db_options_, descriptor_log_->file());
+        manifest_io_status = io_s;
        TEST_SYNC_POINT_CALLBACK(
            "VersionSet::ProcessManifestWrites:AfterSyncManifest", &io_s);
      }
@ -4188,6 +4192,9 @@ Status VersionSet::ProcessManifestWrites(

    // If we just created a new descriptor file, install it by writing a
    // new CURRENT file that points to it.
+    if (s.ok()) {
+      assert(manifest_io_status.ok());
+    }
    if (s.ok() && new_descriptor_log) {
      io_s = SetCurrentFile(fs_.get(), dbname_, pending_manifest_file_number_,
                            db_directory);
@ -4303,11 +4310,41 @@ Status VersionSet::ProcessManifestWrites(
    for (auto v : versions) {
      delete v;
    }
+    if (manifest_io_status.ok()) {
+      manifest_file_number_ = pending_manifest_file_number_;
+      manifest_file_size_ = new_manifest_file_size;
+    }
    // If manifest append failed for whatever reason, the file could be
    // corrupted. So we need to force the next version update to start a
    // new manifest file.
    descriptor_log_.reset();
-    if (new_descriptor_log) {
+    // If manifest operations failed, then we know the CURRENT file still
+    // points to the original MANIFEST. Therefore, we can safely delete the
+    // new MANIFEST.
+    // If manifest operations succeeded, and we are here, then it is possible
+    // that renaming tmp file to CURRENT failed.
+    //
+    // On local POSIX-compliant FS, the CURRENT must point to the original
+    // MANIFEST. We can delete the new MANIFEST for simplicity, but we can also
+    // keep it. Future recovery will ignore this MANIFEST. It's also ok for the
+    // process not to crash and continue using the db. Any future LogAndApply()
+    // call will switch to a new MANIFEST and update CURRENT, still ignoring
+    // this one.
+    //
+    // On non-local FS, it is
+    // possible that the rename operation succeeded on the server (remote)
+    // side, but the client somehow returns a non-ok status to RocksDB. Note
+    // that this does not violate atomicity. Should we delete the new MANIFEST
+    // successfully, a subsequent recovery attempt will likely see the CURRENT
+    // pointing to the new MANIFEST, thus fail. We will not be able to open the
+    // DB again. Therefore, if manifest operations succeed, we should keep the
+    // the new MANIFEST. If the process proceeds, any future LogAndApply() call
+    // will switch to a new MANIFEST and update CURRENT. If user tries to
+    // re-open the DB,
+    // a) CURRENT points to the new MANIFEST, and the new MANIFEST is present.
+    // b) CURRENT points to the original MANIFEST, and the original MANIFEST
+    //    also exists.
+    if (new_descriptor_log && !manifest_io_status.ok()) {
      ROCKS_LOG_INFO(db_options_->info_log,
                     "Deleting manifest %" PRIu64 " current manifest %" PRIu64
                     "\n",
--- a/env/env_hdfs.cc
+++ b/env/env_hdfs.cc
@ -213,6 +213,8 @@ class HdfsWritableFile: public WritableFile {
    }
  }

+  using WritableFile::Append;
+
  // If the file was successfully created, then this returns true.
  // Otherwise returns false.
  bool isValid() {
--- a/env/fs_posix.cc
+++ b/env/fs_posix.cc
@ -620,9 +620,10 @@ class PosixFileSystem : public FileSystem {
      }
    }

-    const auto pre_read_errno = errno;  // errno may be modified by readdir
+    // reset errno before calling readdir()
+    errno = 0;
    struct dirent* entry;
-    while ((entry = readdir(d)) != nullptr && errno == pre_read_errno) {
+    while ((entry = readdir(d)) != nullptr) {
      // filter out '.' and '..' directory entries
      // which appear only on some platforms
      const bool ignore =
@ -631,19 +632,20 @@ class PosixFileSystem : public FileSystem {
      if (!ignore) {
        result->push_back(entry->d_name);
      }
+      errno = 0;  // reset errno if readdir() success
    }

    // always attempt to close the dir
    const auto pre_close_errno = errno;  // errno may be modified by closedir
    const int close_result = closedir(d);

-    if (pre_close_errno != pre_read_errno) {
-      // error occured during readdir
+    if (pre_close_errno != 0) {
+      // error occurred during readdir
      return IOError("While readdir", dir, pre_close_errno);
    }

    if (close_result != 0) {
-      // error occured during closedir
+      // error occurred during closedir
      return IOError("While closedir", dir, errno);
    }

--- a/file/filename.cc
+++ b/file/filename.cc
@ -383,10 +383,12 @@ IOStatus SetCurrentFile(FileSystem* fs, const std::string& dbname,
  contents.remove_prefix(dbname.size() + 1);
  std::string tmp = TempFileName(dbname, descriptor_number);
  IOStatus s = WriteStringToFile(fs, contents.ToString() + "\n", tmp, true);
+  TEST_SYNC_POINT_CALLBACK("SetCurrentFile:BeforeRename", &s);
  if (s.ok()) {
    TEST_KILL_RANDOM("SetCurrentFile:0", rocksdb_kill_odds * REDUCE_ODDS2);
    s = fs->RenameFile(tmp, CurrentFileName(dbname), IOOptions(), nullptr);
    TEST_KILL_RANDOM("SetCurrentFile:1", rocksdb_kill_odds * REDUCE_ODDS2);
+    TEST_SYNC_POINT_CALLBACK("SetCurrentFile:AfterRename", &s);
  }
  if (s.ok()) {
    if (directory_to_fsync != nullptr) {
--- a/include/rocksdb/version.h
+++ b/include/rocksdb/version.h
@ -11,7 +11,7 @@

 #define ROCKSDB_MAJOR 6
 #define ROCKSDB_MINOR 19
-#define ROCKSDB_PATCH 0
+#define ROCKSDB_PATCH 4

 // Do not use these. We made the mistake of declaring macros starting with
 // double underscore. Now we have to live with our choice. We'll deprecate these
--- a/java/src/test/java/org/rocksdb/RocksDBTest.java
+++ b/java/src/test/java/org/rocksdb/RocksDBTest.java
@ -1459,8 +1459,8 @@ public class RocksDBTest {
        assertThat(livefiles.manifestFileSize).isEqualTo(57);
        assertThat(livefiles.files.size()).isEqualTo(3);
        assertThat(livefiles.files.get(0)).isEqualTo("/CURRENT");
-        assertThat(livefiles.files.get(1)).isEqualTo("/MANIFEST-000003");
-        assertThat(livefiles.files.get(2)).isEqualTo("/OPTIONS-000006");
+        assertThat(livefiles.files.get(1)).isEqualTo("/MANIFEST-000004");
+        assertThat(livefiles.files.get(2)).isEqualTo("/OPTIONS-000007");
      }
    }
  }
--- a/port/win/env_win.cc
+++ b/port/win/env_win.cc
@ -680,7 +680,8 @@ IOStatus WinFileSystem::GetChildren(const std::string& dir,
    // which appear only on some platforms
    const bool ignore =
        ((data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) != 0) &&
-        (strcmp(data.cFileName, ".") == 0 || strcmp(data.cFileName, "..") == 0);
+        (RX_FNCMP(data.cFileName, ".") == 0 ||
+         RX_FNCMP(data.cFileName, "..") == 0);
    if (!ignore) {
      auto x = RX_FILESTRING(data.cFileName, RX_FNLEN(data.cFileName));
      result->push_back(FN_TO_RX(x));
--- a/port/win/port_win.h
+++ b/port/win/port_win.h
@ -355,6 +355,7 @@ extern void SetCpuPriority(ThreadId id, CpuPriority priority);
 #define RX_FILESTRING std::wstring
 #define RX_FN(a) ROCKSDB_NAMESPACE::port::utf8_to_utf16(a)
 #define FN_TO_RX(a) ROCKSDB_NAMESPACE::port::utf16_to_utf8(a)
+#define RX_FNCMP(a, b) ::wcscmp(a, RX_FN(b).c_str())
 #define RX_FNLEN(a) ::wcslen(a)

 #define RX_DeleteFile DeleteFileW
@ -379,6 +380,7 @@ extern void SetCpuPriority(ThreadId id, CpuPriority priority);
 #define RX_FILESTRING std::string
 #define RX_FN(a) a
 #define FN_TO_RX(a) a
+#define RX_FNCMP(a, b) strcmp(a, b)
 #define RX_FNLEN(a) strlen(a)

 #define RX_DeleteFile DeleteFileA
--- a/src.mk
+++ b/src.mk
@ -258,20 +258,6 @@ LIB_SOURCES =                                                   \
  utilities/transactions/lock/lock_manager.cc                   \
  utilities/transactions/lock/point/point_lock_tracker.cc       \
  utilities/transactions/lock/point/point_lock_manager.cc       \
-  utilities/transactions/lock/range/range_tree/lib/locktree/concurrent_tree.cc \
-  utilities/transactions/lock/range/range_tree/lib/locktree/keyrange.cc        \
-  utilities/transactions/lock/range/range_tree/lib/locktree/lock_request.cc    \
-  utilities/transactions/lock/range/range_tree/lib/locktree/locktree.cc        \
-  utilities/transactions/lock/range/range_tree/lib/locktree/manager.cc         \
-  utilities/transactions/lock/range/range_tree/lib/locktree/range_buffer.cc    \
-  utilities/transactions/lock/range/range_tree/lib/locktree/treenode.cc        \
-  utilities/transactions/lock/range/range_tree/lib/locktree/txnid_set.cc       \
-  utilities/transactions/lock/range/range_tree/lib/locktree/wfg.cc             \
-  utilities/transactions/lock/range/range_tree/lib/standalone_port.cc          \
-  utilities/transactions/lock/range/range_tree/lib/util/dbt.cc                 \
-  utilities/transactions/lock/range/range_tree/lib/util/memarena.cc            \
-  utilities/transactions/lock/range/range_tree/range_tree_lock_manager.cc      \
-  utilities/transactions/lock/range/range_tree/range_tree_lock_tracker.cc      \
  utilities/transactions/optimistic_transaction.cc              \
  utilities/transactions/optimistic_transaction_db_impl.cc      \
  utilities/transactions/pessimistic_transaction.cc             \
@ -303,6 +289,22 @@ LIB_SOURCES_ASM =
 LIB_SOURCES_C =
 endif

+RANGE_TREE_SOURCES =\
+  utilities/transactions/lock/range/range_tree/lib/locktree/concurrent_tree.cc \
+  utilities/transactions/lock/range/range_tree/lib/locktree/keyrange.cc        \
+  utilities/transactions/lock/range/range_tree/lib/locktree/lock_request.cc    \
+  utilities/transactions/lock/range/range_tree/lib/locktree/locktree.cc        \
+  utilities/transactions/lock/range/range_tree/lib/locktree/manager.cc         \
+  utilities/transactions/lock/range/range_tree/lib/locktree/range_buffer.cc    \
+  utilities/transactions/lock/range/range_tree/lib/locktree/treenode.cc        \
+  utilities/transactions/lock/range/range_tree/lib/locktree/txnid_set.cc       \
+  utilities/transactions/lock/range/range_tree/lib/locktree/wfg.cc             \
+  utilities/transactions/lock/range/range_tree/lib/standalone_port.cc          \
+  utilities/transactions/lock/range/range_tree/lib/util/dbt.cc                 \
+  utilities/transactions/lock/range/range_tree/lib/util/memarena.cc            \
+  utilities/transactions/lock/range/range_tree/range_tree_lock_manager.cc      \
+  utilities/transactions/lock/range/range_tree/range_tree_lock_tracker.cc
+
 TOOL_LIB_SOURCES =                                              \
  tools/io_tracer_parser_tool.cc                                \
  tools/ldb_cmd.cc                                              \
--- a/table/block_based/block_based_table_builder.cc
+++ b/table/block_based/block_based_table_builder.cc
@ -1645,6 +1645,11 @@ void BlockBasedTableBuilder::EnterUnbuffered() {
                                  ? r->compression_opts.zstd_max_train_bytes
                                  : r->compression_opts.max_dict_bytes;
  const size_t kNumBlocksBuffered = r->data_block_and_keys_buffers.size();
+  if (kNumBlocksBuffered == 0) {
+    // The below code is neither safe nor necessary for handling zero data
+    // blocks.
+    return;
+  }

  // Abstract algebra teaches us that a finite cyclic group (such as the
  // additive group of integers modulo N) can be generated by a number that is
--- a/table/block_based/full_filter_block.cc
+++ b/table/block_based/full_filter_block.cc
@ -22,6 +22,7 @@ FullFilterBlockBuilder::FullFilterBlockBuilder(
      whole_key_filtering_(whole_key_filtering),
      last_whole_key_recorded_(false),
      last_prefix_recorded_(false),
+      last_key_in_domain_(false),
      num_added_(0) {
  assert(filter_bits_builder != nullptr);
  filter_bits_builder_.reset(filter_bits_builder);
@ -30,6 +31,15 @@ FullFilterBlockBuilder::FullFilterBlockBuilder(
 void FullFilterBlockBuilder::Add(const Slice& key_without_ts) {
  const bool add_prefix =
      prefix_extractor_ && prefix_extractor_->InDomain(key_without_ts);
+
+  if (!last_prefix_recorded_ && last_key_in_domain_) {
+    // We can reach here when a new filter partition starts in partitioned
+    // filter. The last prefix in the previous partition should be added if
+    // necessary regardless of key_without_ts, to support prefix SeekForPrev.
+    AddKey(last_prefix_str_);
+    last_prefix_recorded_ = true;
+  }
+
  if (whole_key_filtering_) {
    if (!add_prefix) {
      AddKey(key_without_ts);
@ -49,7 +59,10 @@ void FullFilterBlockBuilder::Add(const Slice& key_without_ts) {
    }
  }
  if (add_prefix) {
+    last_key_in_domain_ = true;
    AddPrefix(key_without_ts);
+  } else {
+    last_key_in_domain_ = false;
  }
 }

@ -61,6 +74,7 @@ inline void FullFilterBlockBuilder::AddKey(const Slice& key) {

 // Add prefix to filter if needed
 void FullFilterBlockBuilder::AddPrefix(const Slice& key) {
+  assert(prefix_extractor_ && prefix_extractor_->InDomain(key));
  Slice prefix = prefix_extractor_->Transform(key);
  if (whole_key_filtering_) {
    // if both whole_key and prefix are added to bloom then we will have whole
--- a/table/block_based/full_filter_block.h
+++ b/table/block_based/full_filter_block.h
@ -61,6 +61,7 @@ class FullFilterBlockBuilder : public FilterBlockBuilder {
  virtual void Reset();
  void AddPrefix(const Slice& key);
  const SliceTransform* prefix_extractor() { return prefix_extractor_; }
+  const std::string& last_prefix_str() const { return last_prefix_str_; }

 private:
  // important: all of these might point to invalid addresses
@ -72,10 +73,14 @@ class FullFilterBlockBuilder : public FilterBlockBuilder {
  std::string last_whole_key_str_;
  bool last_prefix_recorded_;
  std::string last_prefix_str_;
+  // Whether prefix_extractor_->InDomain(last_whole_key_) is true.
+  // Used in partitioned filters so that the last prefix from the previous
+  // filter partition will be added to the current partition if
+  // last_key_in_domain_ is true, regardless of the current key.
+  bool last_key_in_domain_;

  uint32_t num_added_;
  std::unique_ptr<const char[]> filter_data_;
-
 };

 // A FilterBlockReader is used to parse filter from SST table.
--- a/table/block_based/partitioned_filter_block.cc
+++ b/table/block_based/partitioned_filter_block.cc
@ -73,13 +73,16 @@ void PartitionedFilterBlockBuilder::MaybeCutAFilterBlock(
  }
  filter_gc.push_back(std::unique_ptr<const char[]>(nullptr));

-  // Add the prefix of the next key before finishing the partition. This hack,
-  // fixes a bug with format_verison=3 where seeking for the prefix would lead
-  // us to the previous partition.
-  const bool add_prefix =
+  // Add the prefix of the next key before finishing the partition without
+  // updating last_prefix_str_. This hack, fixes a bug with format_verison=3
+  // where seeking for the prefix would lead us to the previous partition.
+  const bool maybe_add_prefix =
      next_key && prefix_extractor() && prefix_extractor()->InDomain(*next_key);
-  if (add_prefix) {
-    FullFilterBlockBuilder::AddPrefix(*next_key);
+  if (maybe_add_prefix) {
+    const Slice next_key_prefix = prefix_extractor()->Transform(*next_key);
+    if (next_key_prefix.compare(last_prefix_str()) != 0) {
+      AddKey(next_key_prefix);
+    }
  }

  Slice filter = filter_bits_builder_->Finish(&filter_gc.back());
--- a/table/table_test.cc
+++ b/table/table_test.cc
@ -3990,8 +3990,7 @@ class TestPrefixExtractor : public ROCKSDB_NAMESPACE::SliceTransform {
  }

  bool InDomain(const ROCKSDB_NAMESPACE::Slice& src) const override {
-    assert(IsValid(src));
-    return true;
+    return IsValid(src);
  }

  bool InRange(const ROCKSDB_NAMESPACE::Slice& /*dst*/) const override {
--- a/utilities/backupable/backupable_db_test.cc
+++ b/utilities/backupable/backupable_db_test.cc
@ -2618,19 +2618,19 @@ TEST_F(BackupableDBTest, GarbageCollectionBeforeBackup) {
  OpenDBAndBackupEngine(true);

  ASSERT_OK(backup_chroot_env_->CreateDirIfMissing(backupdir_ + "/shared"));
-  std::string file_five = backupdir_ + "/shared/000008.sst";
+  std::string file_five = backupdir_ + "/shared/000009.sst";
  std::string file_five_contents = "I'm not really a sst file";
-  // this depends on the fact that 00008.sst is the first file created by the DB
+  // this depends on the fact that 00009.sst is the first file created by the DB
  ASSERT_OK(file_manager_->WriteToFile(file_five, file_five_contents));

  FillDB(db_.get(), 0, 100);
-  // backup overwrites file 000008.sst
+  // backup overwrites file 000009.sst
  ASSERT_OK(backup_engine_->CreateNewBackup(db_.get(), true));

  std::string new_file_five_contents;
  ASSERT_OK(ReadFileToString(backup_chroot_env_.get(), file_five,
                             &new_file_five_contents));
-  // file 000008.sst was overwritten
+  // file 000009.sst was overwritten
  ASSERT_TRUE(new_file_five_contents != file_five_contents);

  CloseDBAndBackupEngine();
--- a/utilities/fault_injection_fs.h
+++ b/utilities/fault_injection_fs.h
@ -22,7 +22,7 @@
 #include <string>

 #include "file/filename.h"
-#include "include/rocksdb/file_system.h"
+#include "rocksdb/file_system.h"
 #include "util/mutexlock.h"
 #include "util/random.h"
 #include "util/thread_local.h"
Author	SHA1	Message	Date
Zhichao Cao	fd8247b1be	update HISTORY.md and bump version for 6.19.4	2021-04-23 15:42:04 -07:00
Zhichao Cao	fe352574b4	Fix the false positive alert of CF consistency check in WAL recovery (#8207 ) Summary: In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue. Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here. Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207 Test Plan: added unit test, make crash_test Reviewed By: riversand963 Differential Revision: D27898839 Pulled By: zhichao-cao fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e	2021-04-23 15:39:35 -07:00
Andrew Kryczka	6fbe42a6fd	Fix seqno in ingested file boundary key metadata (#8209 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/6245. Adapted from https://github.com/facebook/rocksdb/issues/8201 and https://github.com/facebook/rocksdb/issues/8205. Previously we were writing the ingested file's smallest/largest internal keys with sequence number zero, or `kMaxSequenceNumber` in case of range tombstone. The former (sequence number zero) is incorrect and can lead to files being incorrectly ordered. The fix in this PR is to overwrite boundary keys that have sequence number zero with the ingested file's assigned sequence number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8209 Test Plan: repro unit test Reviewed By: riversand963 Differential Revision: D27885678 Pulled By: ajkr fbshipit-source-id: 4a9f2c6efdfff81c3a9923e915ea88b250ee7b6a	2021-04-23 15:37:41 -07:00
Andrew Gallagher	e6c47cd6cc	Cleanup include (#8208 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8208 Make include of "file_system.h" use the same include path as everywhere else. Reviewed By: riversand963, akankshamahajan15 Differential Revision: D27881606 fbshipit-source-id: fc1e076229fde21041a813c655ce017b5070c8b3	2021-04-20 15:24:25 -07:00
Yanqin Jin	645c445978	Bump version	2021-04-19 19:57:42 -07:00
Yanqin Jin	f952de5be2	Handle rename() failure in non-local FS (#8192 ) Summary: In a distributed environment, a file `rename()` operation can succeed on server (remote) side, but the client can somehow return non-ok status to RocksDB. Possible reasons include network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a new MANIFEST. We currently always delete the new MANIFEST if an error occurs. This is problematic in distributed world. If the server-side successfully updates the CURRENT file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail. As a fix, we can track the execution result of IO operations on the new MANIFEST. - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original MANIFEST. Therefore, it is safe to remove the new MANIFEST. - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the new MANIFEST.) Therefore, we keep the new MANIFEST. - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT. - If process reopens the db immediately after the failure, then the CURRENT file can point to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can succeed and ignore the other. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D27804648 Pulled By: riversand963 fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4	2021-04-19 19:56:59 -07:00
Adam Retter	4362985805	Fix Windows strcmp for Unicode (#8190 ) Summary: The code for strcmp that was present does work when compiled for Windows unicode file paths. Needs backporting to: * 6.17.fb * 6.18.fb * 6.19.fb Pull Request resolved: https://github.com/facebook/rocksdb/pull/8190 Reviewed By: akankshamahajan15 Differential Revision: D27765588 Pulled By: jay-zhuang fbshipit-source-id: 89f8a5ac61fd7edc758340dfd335b0a5f96dae6e	2021-04-19 08:41:06 -07:00
Yanqin Jin	85cdad116b	Update history and bump version	2021-04-08 11:26:13 -07:00
Yanqin Jin	07e3794972	Fix a bug for SeekForPrev with partitioned filter and prefix (#8137 ) Summary: According to https://github.com/facebook/rocksdb/issues/5907, each filter partition "should include the bloom of the prefix of the last key in the previous partition" so that SeekForPrev() in prefix mode can return correct result. The prefix of the last key in the previous partition does not necessarily have the same prefix as the first key in the current partition. Regardless of the first key in current partition, the prefix of the last key in the previous partition should be added. The existing code, however, does not follow this. Furthermore, there is another issue: when finishing current filter partition, `FullFilterBlockBuilder::AddPrefix()` is called for the first key in next filter partition, which effectively overwrites `last_prefix_str_` prematurely. Consequently, when the filter block builder proceeds to the next partition, `last_prefix_str_` will be the prefix of its first key, leaving no way of adding the bloom of the prefix of the last key of the previous partition. Prefix extractor is FixedLength.2. ``` [ filter part 1 ] [ filter part 2 ] abc d ``` When SeekForPrev("abcd"), checking the filter partition will land on filter part 2 because "abcd" > "abc" but smaller than "d". If the filter in filter part 2 happens to return false for the test for "ab", then SeekForPrev("abcd") will build incorrect iterator tree in non-total-order mode. Also fix a unit test which starts to fail following this PR. `InDomain` should not fail due to assertion error when checking on an arbitrary key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8137 Test Plan: ``` make check ``` Without this fix, the following command will fail pretty soon. ``` ./db_stress --acquire_snapshot_one_in=10000 --avoid_flush_during_recovery=0 \ --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 \ --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=17 \ --bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 \ --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \ --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 \ --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 \ --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 \ --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox \ --db_write_buffer_size=8388608 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_blob_files=0 \ --enable_compaction_filter=0 --enable_pipelined_write=1 --file_checksum_impl=big --flush_one_in=1000000 \ --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \ --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=2 --ingest_external_file_one_in=0 \ --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \ --log2_keys_per_lock=10 --long_running_snapshots=1 --mark_for_compaction_one_file_in=0 \ --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000000 --max_key_len=3 \ --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 \ --max_write_buffer_size_to_maintain=8388608 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False \ --nooverwritepercent=0 --open_files=500000 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=1 --partition_pinning=0 --pause_background_one_in=1000000 \ --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --read_fault_one_in=0 --read_only=0 \ --readpercent=45 --recycle_log_file_num=0 --reopen=20 --secondary_catch_up_one_in=0 \ --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 \ --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False \ --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 \ --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_blob_db=0 --use_block_based_filter=0 \ --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 \ --use_multiget=0 --use_ribbon_filter=0 --use_txn=0 --user_timestamp_size=8 --verify_checksum=1 \ --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 \ --write_dbid_to_manifest=1 --writepercent=35 ``` Reviewed By: pdillinger Differential Revision: D27553054 Pulled By: riversand963 fbshipit-source-id: 60e391e4a2d8d98a9a3172ec5d6176b90ec3de98	2021-04-08 11:21:00 -07:00
Adam Retter	1624f20934	Update ZStd. Fixes an issue with Make 3.82 (#8155 ) Summary: The previous version of ZStd doesn't build correctly with Make 3.82. Updating it resolves the issue. jay-zhuang This also needs to be cherry-picked to: 1. 6.17.fb 2. 6.18.fb 3. 6.19.fb Pull Request resolved: https://github.com/facebook/rocksdb/pull/8155 Reviewed By: riversand963 Differential Revision: D27596460 Pulled By: ajkr fbshipit-source-id: ac8492245e6273f54efcc1587346a797a91c9441	2021-04-08 09:04:45 -07:00
Andrew Gallagher	1e96a70be4	Use `include_paths` instead of raw `-I` in TARGETS (#8143 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8143 The latter assume the location of the compile root, which can break if the build root changes. Switch to the slightly more intelligent `include_paths`, which should provide the same functionality, but do with independent of include root. Reviewed By: riversand963 Differential Revision: D27535869 fbshipit-source-id: 0129e47c0ce23e08528c9139114a591c14866fa8	2021-04-05 14:38:18 -07:00
Andrew Kryczka	efe9d5c823	update HISTORY.md and bump version for 6.19.1	2021-04-01 10:01:18 -07:00
Andrew Kryczka	a265ac75ab	Fix compression dictionary sampling with dedicated range tombstone SSTs (#8141 ) Summary: Return early in case there are zero data blocks when `BlockBasedTableBuilder::EnterUnbuffered()` is called. This crash can only be triggered by applying dictionary compression to SST files that contain only range tombstones. It cannot be triggered by a low buffer limit alone since we only consider entering unbuffered mode after buffering a data block causing the limit to be breached, or `Finish()`ing the file. It also cannot be triggered by a totally empty file because those go through `Abandon()` rather than `Finish()` so unbuffered mode is never entered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8141 Test Plan: added a unit test that repro'd the "Floating point exception" Reviewed By: riversand963 Differential Revision: D27495640 Pulled By: ajkr fbshipit-source-id: a463cfba476919dc5c5c380800a75a86c31ffa23	2021-04-01 10:00:27 -07:00
Adam Retter	9435b2e959	range_tree requires GNU libc on ppc64 (#8070 ) Summary: If the platform is ppc64 and the libc is not GNU libc, then we exclude the range_tree from compilation. See https://jira.percona.com/browse/PS-7559 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8070 Reviewed By: jay-zhuang Differential Revision: D27246004 Pulled By: mrambacher fbshipit-source-id: 59d8433242ce7ce608988341becb4f83312445f5	2021-03-30 16:04:52 -07:00
Zhichao Cao	0f8c041ea7	Remove disabled tests (#8123 ) Summary: Remove disabled tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/8123 Test Plan: make check Reviewed By: ltamasi Differential Revision: D27367066 Pulled By: zhichao-cao fbshipit-source-id: 71fa1d492d9b0144decff0a1d0e0ef25c0ecc4ba	2021-03-28 23:06:08 -07:00
Jay Zhuang	9da6019f9d	Avoid checking errno on success call (#8119 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8119 Reviewed By: sushilpa Differential Revision: D27365407 Pulled By: jay-zhuang fbshipit-source-id: 327c09bf76834ce0be4287680640adc8b88bcec2	2021-03-28 23:05:56 -07:00
Zhichao Cao	d5e2462946	Fix flush no wal IO error bug (#8107 ) Summary: There is bug in the current code base introduced in https://github.com/facebook/rocksdb/issues/8049 , we still set the SST file write IO Error only case as hard error. Fix it by removing the logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8107 Test Plan: make check, error_handler_fs_test Reviewed By: anand1976 Differential Revision: D27321422 Pulled By: zhichao-cao fbshipit-source-id: c014afc1553ca66b655e3bbf9d0bf6eb417ccf94	2021-03-25 21:52:27 -07:00
Zhichao Cao	5b3ebdc3d1	Added append with checksum handoff API to hdfs (#8084 ) Summary: Added append with checksum handoff API to hdfs Pull Request resolved: https://github.com/facebook/rocksdb/pull/8084 Test Plan: make check Reviewed By: ajkr Differential Revision: D27237823 Pulled By: zhichao-cao fbshipit-source-id: 93b38db23b1811a6daa049afb89240089ec6f67c	2021-03-23 17:25:42 -07:00