Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9686
According to https://www.cplusplus.com/reference/deque/deque/back/,
"
The container is accessed (neither the const nor the non-const versions modify the container).
The last element is potentially accessed or modified by the caller. Concurrently accessing or modifying other elements is safe.
"
Also according to https://www.cplusplus.com/reference/deque/deque/pop_front/,
"
The container is modified.
The first element is modified. Concurrently accessing or modifying other elements is safe (although see iterator validity above).
"
In RocksDB, we never pop the last element of `DBImpl::alive_log_files_`. We have been
exploiting this fact and the above two properties when ensuring correctness when
`DBImpl::alive_log_files_` may be accessed concurrently. Specifically, it can be accessed
in the write path when db mutex is released. Sometimes, the log_mute_ is held. It can also be accessed in `FindObsoleteFiles()`
when db mutex is always held. It can also be accessed
during recovery when db mutex is also held.
Given the fact that we never pop the last element of alive_log_files_, we currently do not
acquire additional locks when accessing it in `WriteToWAL()` as follows
```
alive_log_files_.back().AddSize(log_entry.size());
```
This is problematic.
Check source code of deque.h
```
back() _GLIBCXX_NOEXCEPT
{
__glibcxx_requires_nonempty();
...
}
pop_front() _GLIBCXX_NOEXCEPT
{
...
if (this->_M_impl._M_start._M_cur
!= this->_M_impl._M_start._M_last - 1)
{
...
++this->_M_impl._M_start._M_cur;
}
...
}
```
`back()` will actually call `__glibcxx_requires_nonempty()` first.
If `__glibcxx_requires_nonempty()` is enabled and not an empty macro,
it will call `empty()`
```
bool empty() {
return this->_M_impl._M_finish == this->_M_impl._M_start;
}
```
You can see that it will access `this->_M_impl._M_start`, racing with `pop_front()`.
Therefore, TSAN will actually catch the bug in this case.
To be able to use TSAN on our library and unit tests, we should always coordinate
concurrent accesses to STL containers properly.
We need to pass information about db mutex and log mutex into `WriteToWAL()`, otherwise
it's impossible to know which mutex to acquire inside the function.
To fix this, we can catch the tail of `alive_log_files_` by reference, so that we do not have to call `back()` in `WriteToWAL()`.
Reviewed By: pdillinger
Differential Revision: D34780309
fbshipit-source-id: 1def9821f0c437f2736c6a26445d75890377889b
Summary:
There was a mistake that incorrectly cast SstPartitionerFactory (missed shared pointer). It worked for database (correct cast), but not for family. Trying to set it in family has caused Access violation.
I have also added test and improved it. Older version was passing even without sst partitioner which is weird, because on Level1 we had two SST files with same key "aaaa1". I was not sure if it is a new feature and changed it to overlaping keys "aaaa0" - "aaaa2" overlaps "aaaa1".
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9622
Reviewed By: ajkr
Differential Revision: D34871968
Pulled By: pdillinger
fbshipit-source-id: a08009766da49fc198692a610e8beb19caf737e6
Summary:
https://github.com/facebook/rocksdb/issues/9625 didn't change the unschedule condition which was waiting for the background thread to clean-up the compaction.
make sure we only unschedule the task when it's scheduled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9659
Reviewed By: ajkr
Differential Revision: D34651820
Pulled By: jay-zhuang
fbshipit-source-id: 23f42081b15ec8886cd81cbf131b116e0c74dc2f
Summary:
Timer crash when multiple DB instances doing heavy DB open and close
operations concurrently. Which is caused by adding a timer task with
smaller timestamp than the current running task. Fix it by moving the
getting new task timestamp part within timer mutex protection.
And other fixes:
- Disallow adding duplicated function name to timer
- Fix a minor memory leak in timer when a running task is cancelled
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9656
Reviewed By: ajkr
Differential Revision: D34626296
Pulled By: jay-zhuang
fbshipit-source-id: 6b6d96a5149746bf503546244912a9e41a0c5f6b
Summary:
As disscussed in (https://github.com/facebook/rocksdb/issues/9223), Here added a new API named DB::OpenAndTrimHistory, this API will open DB and trim data to the timestamp specofied by **trim_ts** (The data with newer timestamp than specified trim bound will be removed). This API should only be used at a timestamp-enabled db instance recovery.
And this PR implemented a new iterator named HistoryTrimmingIterator to support trimming history with a new API named DB::OpenAndTrimHistory. HistoryTrimmingIterator wrapped around the underlying InternalITerator such that keys whose timestamps newer than **trim_ts** should not be returned to the compaction iterator while **trim_ts** is not null.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9410
Reviewed By: ltamasi
Differential Revision: D34410207
Pulled By: riversand963
fbshipit-source-id: e54049dc234eccd673244c566b15df58df5a6236
Summary:
Provide support for Async Read and Poll in Posix file system using IOUring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9578
Test Plan: In progress
Reviewed By: anand1976
Differential Revision: D34690256
Pulled By: akankshamahajan15
fbshipit-source-id: 291cbd1380a3cb904b726c34c0560d1b2ce44a2e
Summary:
Change the `MemPurge` code to address a failure during a crash test reported in https://github.com/facebook/rocksdb/issues/8958.
### Details and results of the crash investigation:
These failures happened in a specific scenario where the list of immutable tables was composed of 2 or more memtables, and the last memtable was the output of a previous `Mempurge` operation. Because the `PickMemtablesToFlush` function included a sorting of the memtables (previous PR related to the Mempurge project), and because the `VersionEdit` of the flush class is piggybacked onto a single one of these memtables, the `VersionEdit` was not properly selected and applied to the `VersionSet` of the DB. Since the `VersionSet` was not edited properly, the database was losing track of the SST file created during the flush process, which was subsequently deleted (and as you can expect, caused the tests to crash).
The following command consistently failed, which was quite convenient to investigate the issue:
`$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done`
### Solution proposed
The memtables are no longer sorted based on their `memtableID` in the `PickMemtablesToFlush` function. Additionally, the `next_log_number` of the memtable created as an output of the `Mempurge` function now takes in the correct value (the log number of the first memtable being mempurged). Finally, the VersionEdit object of the flush class now takes the maximum `next_log_number` of the stack of memtables being flushed, which doesnt change anything when Mempurge is `off` but becomes necessary when Mempurge is `on`.
### Testing of the solution
The following command no longer fails:
``$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done``
Additionally, I ran `db_crashtest` (`whitebox` and `blackbox`) for 2.5 hours with MemPurge on and did not observe any crash.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9671
Reviewed By: pdillinger
Differential Revision: D34697424
Pulled By: bjlemaire
fbshipit-source-id: d1ab675b361904351ac81a35c184030e52222874
Summary:
Fixes https://github.com/facebook/rocksdb/issues/9560. Only use popcnt intrinsic when HAVE_SSE42 is set. Also avoid setting it based on compiler test in portable builds because such test will pass on MSVC even without proper arch flags (ref: https://devblogs.microsoft.com/oldnewthing/20201026-00/?p=104397).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9680
Test Plan: verified the combinations of -DPORTABLE and -DFORCE_SSE42 produce expected compiler flags on Linux. Verified MSVC build using PORTABLE=1 (in CircleCI) does not set HAVE_SSE42.
Reviewed By: pdillinger
Differential Revision: D34739033
Pulled By: ajkr
fbshipit-source-id: d10456f3392945fc3e59430a1777840f7b60b276
Summary:
Integrate the streaming compress/uncompress API into WAL compression.
The streaming compression object is stored in the log_writer along with a reusable output buffer to store the compressed buffer(s).
The streaming uncompress object is stored in the log_reader along with a reusable output buffer to store the uncompressed buffer(s).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9642
Test Plan:
Added unit tests to verify different scenarios - large buffers, split compressed buffers, etc.
Future optimizations:
The overhead for small records is quite high, so it makes sense to compress only buffers above a certain threshold and use a separate record type to indicate that those records are compressed.
Reviewed By: anand1976
Differential Revision: D34709167
Pulled By: sidroyc
fbshipit-source-id: a37a3cd1301adff6152fb3fcd23726106af07dd4
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9629
Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are
locked upon first operation that writes the key or has the intention of writing. For example,
`PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`,
`PessimisticTransaction::SingleDelete()` will write to or delete a key, while
`PessimisticTransaction::GetForUpdate()` is used by application to indicate
to RocksDB that the transaction has the intention of performing write operation later
in the same transaction.
Pessimistic transactions support two-phase commit (2PC). A transaction can be
`Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once
`Prepare()` succeeds, the transaction has acquired the necessary resources to commit.
The resources include locks, persistence of WAL, etc.
Write-committed transaction is the default pessimistic transaction implementation. In
RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare
section. `Commit()` will write a commit marker to the WAL and then write data to the
memtables. While writing to the memtables, different keys in the transaction's write batch
will be assigned different sequence numbers in ascending order.
Until commit/rollback, the transaction holds locks on the keys so that no other transaction
can write to the same keys. Furthermore, the keys' sequence numbers represent the order
in which they are committed and should be made visible. This is convenient for us to
implement support for user-defined timestamps.
Since column families with and without timestamps can co-exist in the same database,
a transaction may or may not involve timestamps. Based on this observation, we add two
optional members to each `PessimisticTransaction`, `read_timestamp_` and
`commit_timestamp_`. If no key in the transaction's write batch has timestamp, then
setting these two variables do not have any effect. For the rest of this commit, we discuss
only the cases when these two variables are meaningful.
read_timestamp_ is used mainly for validation, and should be set before first call to
`GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls
`TryLock()` that can verify if another transaction has written the same key since
`read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed
written the same key, then validation fails, and RocksDB allows this transaction to
refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()`
with a different timestamp to read, but the result of the read should not be used to
determine data that will be written later.
commit_timestamp_ must be set after finishing writing and before transaction commit.
This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after
prepare phase succeeds.
We currently require that the commit timestamp be chosen after all keys are locked. This
means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used
by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`,
`PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`,
etc. will return non-ok status because they specify timestamps before locking the keys.
Users are also prompted to use the `Transaction` APIs when they receive the non-ok status.
Reviewed By: ltamasi
Differential Revision: D31822445
fbshipit-source-id: b82abf8e230216dc89cc519564a588224a88fd43
Summary:
Remove redundant assignment code for member `state` in the constructor of `ImmutableDBOptions`.
There are two identical and redundant statements `stats = statistics.get();` in lines 740 and 748 of the code.
This commit removed the line 740.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9665
Reviewed By: ajkr
Differential Revision: D34686649
Pulled By: riversand963
fbshipit-source-id: 8f246ece382b6845528f4e2c843ce09bb66b2b0f
Summary:
The shared SstFileManager in db_stress can create background
work that races with TestCheckpoint such that DestroyDir fails because
of file rename while it is running. Analogous to change already made
for TestBackupRestore
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9673
Test Plan:
make blackbox_crash_test for a while with
checkpoint_one_in=100
Reviewed By: ajkr
Differential Revision: D34702215
Pulled By: pdillinger
fbshipit-source-id: ac3e166efa28cba6c6f4b9b391e799394603ebfd
Summary:
- Make `compression_per_level` dynamical changeable with `SetOptions`;
- Fix a bug that `compression_per_level` is not used for flush;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9658
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D34700749
Pulled By: jay-zhuang
fbshipit-source-id: a23b9dfa7ad03d393c1d71781d19e91de796f49c
Summary:
SMB mounts do not support hard links. The ENOTSUP error code is
returned, which should be interpreted by PosixFileSystem as
IOStatus::NotSupported().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9657
Reviewed By: mrambacher
Differential Revision: D34634783
Pulled By: anand1976
fbshipit-source-id: 0d57f5b2e6118e4c20e9ed1a293327428c3aecac
Summary:
In preparation for more support for file Temperatures in BackupEngine,
this change does some test refactoring:
* Move DBTest2::BackupFileTemperature test to
BackupEngineTest::FileTemperatures, with some updates to make it work
in the new home. This test will soon be expanded for deeper backup work.
* Move FileTemperatureTestFS from db_test2.cc to db_test_util.h, to
support sharing because of above moved test, but split off the "no link"
part to the test needing it.
* Use custom FileSystems in backupable_db_test rather than custom Envs,
because going through Env file interfaces doesn't support temperatures.
* Fix RemapFileSystem to map DirFsyncOptions::renamed_new_name
parameter to FsyncWithDirOptions, which was required because this
limitation caused a crash only after moving to higher fidelity of
FileSystem interface (vs. LegacyDirectoryWrapper throwing away some
parameter details)
* `backupable_options_` -> `engine_options_` as part of the ongoing
work to get rid of the obsolete "backupable" naming.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9655
Test Plan: test code updates only
Reviewed By: jay-zhuang
Differential Revision: D34622183
Pulled By: pdillinger
fbshipit-source-id: f24b7a596a89b9e089e960f4e5d772575513e93f
Summary:
**Context:**
`DBLogicalBlockSizeCacheTest.CreateColumnFamilies` is flaky on a rare occurrence of assertion failure below
```
db/db_logical_block_size_cache_test.cc:210
Expected equality of these values:
1
cache_->GetRefCount(cf_path_0_)
Which is: 2
```
Root-cause: `ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[0]));` in the test may not successfully decrease the ref count of `cf_path_0_` since the decreasing only happens in the clean-up of `ColumnFamilyData` when `ColumnFamilyData` has no referencing to it, which may not be true when `db->DestroyColumnFamilyHandle(cfs[0])` is called since background work such as `DumpStats()` can hold reference to that `ColumnFamilyData` (suggested and repro-d by ajkr ). Similar case `ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[1]));`.
See following for a deterministic repro:
```
diff --git a/db/db_impl/db_impl.cc b/db/db_impl/db_impl.cc
index 196b428a3..4e7a834c4 100644
--- a/db/db_impl/db_impl.cc
+++ b/db/db_impl/db_impl.cc
@@ -956,10 +956,16 @@ void DBImpl::DumpStats() {
// near-atomically.
// Get a ref before unlocking
cfd->Ref();
+ if (cfd->GetName() == "cf1" || cfd->GetName() == "cf2") {
+ TEST_SYNC_POINT("DBImpl::DumpStats:PostCFDRef");
+ }
{
InstrumentedMutexUnlock u(&mutex_);
cfd->internal_stats()->CollectCacheEntryStats(/*foreground=*/false);
}
+ if (cfd->GetName() == "cf1" || cfd->GetName() == "cf2") {
+ TEST_SYNC_POINT("DBImpl::DumpStats::PreCFDUnrefAndTryDelete");
+ }
cfd->UnrefAndTryDelete();
}
}
diff --git a/db/db_logical_block_size_cache_test.cc b/db/db_logical_block_size_cache_test.cc
index 1057871c9..c3872c036 100644
--- a/db/db_logical_block_size_cache_test.cc
+++ b/db/db_logical_block_size_cache_test.cc
@@ -9,6 +9,7 @@
#include "env/io_posix.h"
#include "rocksdb/db.h"
#include "rocksdb/env.h"
+#include "test_util/sync_point.h"
namespace ROCKSDB_NAMESPACE {
class EnvWithCustomLogicalBlockSizeCache : public EnvWrapper {
@@ -183,6 +184,15 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
ASSERT_EQ(1, cache_->GetRefCount(dbname_));
std::vector<ColumnFamilyHandle*> cfs;
+ ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
+ ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
+ {{"DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PostSetupTwoCFH",
+ "DBImpl::DumpStats:StartRunning"},
+ {"DBImpl::DumpStats:PostCFDRef",
+ "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PreDeleteTwoCFH"},
+ {"DBLogicalBlockSizeCacheTest::CreateColumnFamilies::"
+ "PostFinishCheckingRef",
+ "DBImpl::DumpStats::PreCFDUnrefAndTryDelete"}});
ASSERT_OK(db->CreateColumnFamilies(cf_options, {"cf1", "cf2"}, &cfs));
ASSERT_EQ(2, cache_->Size());
ASSERT_TRUE(cache_->Contains(dbname_));
@@ -190,7 +200,7 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
ASSERT_TRUE(cache_->Contains(cf_path_0_));
ASSERT_EQ(2, cache_->GetRefCount(cf_path_0_));
}
// Delete one handle will not drop cache because another handle is still
// referencing cf_path_0_.
+ TEST_SYNC_POINT(
+ "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PostSetupTwoCFH");
+ TEST_SYNC_POINT(
+ "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PreDeleteTwoCFH");
ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[0]));
ASSERT_EQ(2, cache_->Size());
ASSERT_TRUE(cache_->Contains(dbname_));
@@ -209,16 +221,20 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
ASSERT_TRUE(cache_->Contains(cf_path_0_));
// Will fail
ASSERT_EQ(1, cache_->GetRefCount(cf_path_0_));
// Delete the last handle will drop cache.
ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[1]));
ASSERT_EQ(1, cache_->Size());
ASSERT_TRUE(cache_->Contains(dbname_));
// Will fail
ASSERT_EQ(1, cache_->GetRefCount(dbname_));
+ TEST_SYNC_POINT(
+ "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::"
+ "PostFinishCheckingRef");
delete db;
ASSERT_EQ(0, cache_->Size());
ASSERT_OK(DestroyDB(dbname_, options,
{{"cf1", cf_options}, {"cf2", cf_options}}));
+ ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
}
```
**Summary**
- Removed the flaky assertion
- Clarified the comments for the test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9516
Test Plan:
- CI
- Monitor for future flakiness
Reviewed By: ajkr
Differential Revision: D34055232
Pulled By: hx235
fbshipit-source-id: 9bf83ae5fa88bf6fc829876494d4692082e4c357
Summary:
**Context/Summary:**
As requested, `BlockBasedTableOptions::detect_filter_construct_corruption` can now be dynamically configured using `DB::SetOptions` after this PR
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9654
Test Plan: - New unit test
Reviewed By: pdillinger
Differential Revision: D34622609
Pulled By: hx235
fbshipit-source-id: c06773ef3d029e6bf1724d3a72dffd37a8ec66d9
Summary:
The UniqueIdVerifier constructor currently calls ReopenWritableFile on
the FileSystem, which might not be supported. Instead of relying on
reopening the unique IDs file for writing, create a new file and copy
the original contents.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9649
Test Plan: Run db_stress
Reviewed By: pdillinger
Differential Revision: D34572307
Pulled By: anand1976
fbshipit-source-id: 3a777908582d79dae57488d4278bad126774f698
Summary:
Improve the CI build speed:
- split the macos tests to 2 parallel jobs
- split tsan tests to 2 parallel jobs
- move non-shm tests to nightly build
- slow jobs use lager machine
- fast jobs use smaller machine
- add microbench to no-test jobs
- add run-microbench to nightly build
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9605
Reviewed By: riversand963
Differential Revision: D34358982
Pulled By: jay-zhuang
fbshipit-source-id: d5091b3f4ef6d25c5c37920fb614f3342ee60e4a
Summary:
This bug affects use cases that meet the following conditions
- (has only the default column family or disables WAL) and
- has at least one event listener
- atomic flush is NOT affected.
If the above conditions meet, then RocksDB can release the db mutex before picking all the
existing memtables to flush. In the meantime, a snapshot can be created and db's sequence
number can still be incremented. The upcoming flush will ignore this snapshot.
A later read using this snapshot can return incorrect result.
To fix this issue, we call the listeners callbacks after picking the memtables so that we avoid
creating snapshots during this interval.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9648
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D34555456
Pulled By: riversand963
fbshipit-source-id: 1438981e9f069a5916686b1a0ad7627f734cf0ee
Summary:
Certain STLs use raw pointers and ADL does not work for them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9608
Reviewed By: ajkr
Differential Revision: D34583012
Pulled By: riversand963
fbshipit-source-id: 7de6bbc8a080c3e7243ce0d758fe83f1663168aa
Summary:
The plain data length may not be big enough if the compression actually expands data. So use deflateBound() to get the upper limit on the compressed output before deflate().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9572
Reviewed By: riversand963
Differential Revision: D34326475
Pulled By: ajkr
fbshipit-source-id: 4b679cb7a83a62782a127785b4d5eb9aa4646449
Summary:
PR https://github.com/facebook/rocksdb/issues/9557 introduced a race condition between manual compaction
foreground thread and background compaction thread.
This PR adds the ability to really unschedule manual compaction from
thread-pool queue by differentiate tag name for manual compaction and
other tasks.
Also fix an issue that db `close()` didn't cancel the manual compaction thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9625
Test Plan: unittest not hang
Reviewed By: ajkr
Differential Revision: D34410811
Pulled By: jay-zhuang
fbshipit-source-id: cb14065eabb8cf1345fa042b5652d4f788c0c40c
Summary:
Update the signature of Poll and ReadAsync APIs in filesystem.
Instead of unique_ptr, void** will be passed as io_handle and the delete function.
io_handle and delete function should be provided by underlying
FileSystem and its lifetime will be maintained by RocksDB. io_handle
will be deleted by RocksDB once callback is made to update the results or Poll is
called to get the results.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9623
Test Plan: Add a new unit test.
Reviewed By: anand1976
Differential Revision: D34403529
Pulled By: akankshamahajan15
fbshipit-source-id: ea185a5f4c7bec334631e4f781ea7ba4135645f0
Summary:
BlockBasedTableOptions.hash_index_allow_collision is already deprecated and has no effect. Delete it for preparing 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9454
Test Plan: Run all existing tests.
Reviewed By: ajkr
Differential Revision: D33805827
fbshipit-source-id: ed8a436d1d083173ec6aef2a762ba02e1eefdc9d
Summary:
Fix g++ -march=native detection and reenable s390x in travis
This PR fixes s390x assembler messages:
```
Error: invalid switch -march=z14
Error: unrecognized option -march=z14
```
The s390x travis build was failing with gcc-7 because the assembler on
ubuntu 16.04 is too old to recognize the z14 model so it doesn't work
with -march=native on a z14 machine. It fixes the check for the
-march=native flag so that the assembler will get called and correctly
fail on ubuntu 16.04 which will cause the build to fall back to
-march=z196 which works.
The other changes are needed so builds work more consistently on
s390x:
1. Set make parallelism to 1 for s390x: The default was 4 previously
but I saw frequent internal compiler errors on travis probably due to
low resources. The `platform_dependent` job works more consistently
but is roughly 10 minutes slower although it varies.
2. Remove status_checked jobs, as we are relying on CircleCI for
these now and do not really need platform coverage on them.
Fixes https://github.com/facebook/rocksdb/issues/9524
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9631
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D34553989
Pulled By: pdillinger
fbshipit-source-id: a6e3a7276446721c4c0bebc4ed217c2ca2b53f11
Summary:
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.12.5 to 1.13.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/sparklemotion/nokogiri/releases">nokogiri's releases</a>.</em></p>
<blockquote>
<h2>1.13.3 / 2022-02-21</h2>
<h3>Fixed</h3>
<ul>
<li>[CRuby] Revert a HTML4 parser bug in libxml 2.9.13 (introduced in Nokogiri v1.13.2). The bug causes libxml2's HTML4 parser to fail to recover when encountering a bare <code><</code> character in some contexts. This version of Nokogiri restores the earlier behavior, which is to recover from the parse error and treat the <code><</code> as normal character data (which will be serialized as <code>&lt;</code> in a text node). The bug (and the fix) is only relevant when the <code>RECOVER</code> parse option is set, as it is by default. [<a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2461">https://github.com/facebook/rocksdb/issues/2461</a>]</li>
</ul>
<hr />
<p>SHA256 checksums:</p>
<pre><code>025a4e333f6f903072a919f5f75b03a8f70e4969dab4280375b73f9d8ff8d2c0 nokogiri-1.13.3-aarch64-linux.gem
b9cb59c6a6da8cf4dbee5dbb569c7cc95a6741392e69053544e0f40b15ab9ad5 nokogiri-1.13.3-arm64-darwin.gem
e55d18cee64c19d51d35ad80634e465dbcdd46ac4233cb42c1e410307244ebae nokogiri-1.13.3-java.gem
53e2d68116cd00a873406b8bdb90c78a6f10e00df7ddf917a639ac137719b67b nokogiri-1.13.3-x64-mingw-ucrt.gem
b5f39ebb662a1be7d1c61f8f0a2a683f1bb11690a6f00a99a1aa23a071f80145 nokogiri-1.13.3-x64-mingw32.gem
7c0de5863aace4bbbc73c4766cf084d1f0b7a495591e46d1666200cede404432 nokogiri-1.13.3-x86-linux.gem
675cc3e7d7cca0d6790047a062cd3aa3eab59e3cb9b19374c34f98bade588c66 nokogiri-1.13.3-x86-mingw32.gem
f445596a5a76941a9d1980747535ab50d3399d1b46c32989bc26b7dd988ee498 nokogiri-1.13.3-x86_64-darwin.gem
3f6340661c2a283b337d227ea224f859623775b2f5c09a6bf197b786563958df nokogiri-1.13.3-x86_64-linux.gem
bf1b1bceff910abb0b7ad825535951101a0361b859c2ad1be155c010081ecbdc nokogiri-1.13.3.gem
</code></pre>
<h2>1.13.2 / 2022-02-21</h2>
<h3>Security</h3>
<ul>
<li>[CRuby] Vendored libxml2 is updated from 2.9.12 to 2.9.13. This update addresses <a href="https://nvd.nist.gov/vuln/detail/CVE-2022-23308">CVE-2022-23308</a>.</li>
<li>[CRuby] Vendored libxslt is updated from 1.1.34 to 1.1.35. This update addresses <a href="https://nvd.nist.gov/vuln/detail/CVE-2021-30560">CVE-2021-30560</a>.</li>
</ul>
<p>Please see <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-fq42-c5rg-92c2">GHSA-fq42-c5rg-92c2</a> for more information about these CVEs.</p>
<h3>Dependencies</h3>
<ul>
<li>[CRuby] Vendored libxml2 is updated from 2.9.12 to 2.9.13. Full changelog is available at <a href="https://download.gnome.org/sources/libxml2/2.9/libxml2-2.9.13.news">https://download.gnome.org/sources/libxml2/2.9/libxml2-2.9.13.news</a></li>
<li>[CRuby] Vendored libxslt is updated from 1.1.34 to 1.1.35. Full changelog is available at <a href="https://download.gnome.org/sources/libxslt/1.1/libxslt-1.1.35.news">https://download.gnome.org/sources/libxslt/1.1/libxslt-1.1.35.news</a></li>
</ul>
<hr />
<p>SHA256 checksums:</p>
<pre><code>63469a9bb56a21c62fbaea58d15f54f8f167ff6fde51c5c2262072f939926fdd nokogiri-1.13.2-aarch64-linux.gem
2986617f982f645c06f22515b721e6d2613dd69493e5c41ddd03c4830c3b3065 nokogiri-1.13.2-arm64-darwin.gem
aca1d66206740b29d0d586b1d049116adcb31e6cdd7c4dd3a96eb77da215a0c4 nokogiri-1.13.2-java.gem
b9e4eea1a200d9a927a5bc7d662c427e128779cba0098ea49ddbdb3ffc3ddaec nokogiri-1.13.2-x64-mingw-ucrt.gem
48d5493fec495867c5516a908a068c1387a1d17c5aeca6a1c98c089d9d9fdcf8 nokogiri-1.13.2-x64-mingw32.gem
62034d7aaaa83fbfcb8876273cc5551489396841a66230d3200b67919ef76cf9 nokogiri-1.13.2-x86-linux.gem
e07237b82394017c2bfec73c637317ee7dbfb56e92546151666abec551e46d1d nokogiri-1.13.2-x86-mingw32.gem
</tr></table>
</code></pre>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md">nokogiri's changelog</a>.</em></p>
<blockquote>
<h2>1.13.3 / 2022-02-21</h2>
<h3>Fixed</h3>
<ul>
<li>[CRuby] Revert a HTML4 parser bug in libxml 2.9.13 (introduced in Nokogiri v1.13.2). The bug causes libxml2's HTML4 parser to fail to recover when encountering a bare <code><</code> character in some contexts. This version of Nokogiri restores the earlier behavior, which is to recover from the parse error and treat the <code><</code> as normal character data (which will be serialized as <code>&lt;</code> in a text node). The bug (and the fix) is only relevant when the <code>RECOVER</code> parse option is set, as it is by default. [<a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2461">https://github.com/facebook/rocksdb/issues/2461</a>]</li>
</ul>
<h2>1.13.2 / 2022-02-21</h2>
<h3>Security</h3>
<ul>
<li>[CRuby] Vendored libxml2 is updated from 2.9.12 to 2.9.13. This update addresses <a href="https://nvd.nist.gov/vuln/detail/CVE-2022-23308">CVE-2022-23308</a>.</li>
<li>[CRuby] Vendored libxslt is updated from 1.1.34 to 1.1.35. This update addresses <a href="https://nvd.nist.gov/vuln/detail/CVE-2021-30560">CVE-2021-30560</a>.</li>
</ul>
<p>Please see <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-fq42-c5rg-92c2">GHSA-fq42-c5rg-92c2</a> for more information about these CVEs.</p>
<h3>Dependencies</h3>
<ul>
<li>[CRuby] Vendored libxml2 is updated from 2.9.12 to 2.9.13. Full changelog is available at <a href="https://download.gnome.org/sources/libxml2/2.9/libxml2-2.9.13.news">https://download.gnome.org/sources/libxml2/2.9/libxml2-2.9.13.news</a></li>
<li>[CRuby] Vendored libxslt is updated from 1.1.34 to 1.1.35. Full changelog is available at <a href="https://download.gnome.org/sources/libxslt/1.1/libxslt-1.1.35.news">https://download.gnome.org/sources/libxslt/1.1/libxslt-1.1.35.news</a></li>
</ul>
<h2>1.13.1 / 2022-01-13</h2>
<h3>Fixed</h3>
<ul>
<li>Fix <code>Nokogiri::XSLT.quote_params</code> regression in v1.13.0 that raised an exception when non-string stylesheet parameters were passed. Non-string parameters (e.g., integers and symbols) are now explicitly supported and both keys and values will be stringified with <code>#to_s</code>. [<a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2418">https://github.com/facebook/rocksdb/issues/2418</a>]</li>
<li>Fix CSS selector query regression in v1.13.0 that raised an <code>Nokogiri::XML::XPath::SyntaxError</code> when parsing XPath attributes mixed into the CSS query. Although this mash-up of XPath and CSS syntax previously worked unintentionally, it is now an officially supported feature and is documented as such. [<a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2419">https://github.com/facebook/rocksdb/issues/2419</a>]</li>
</ul>
<h2>1.13.0 / 2022-01-06</h2>
<h3>Notes</h3>
<h4>Ruby</h4>
<p>This release introduces native gem support for Ruby 3.1. Please note that Windows users should use the <code>x64-mingw-ucrt</code> platform gem for Ruby 3.1, and <code>x64-mingw32</code> for Ruby 2.6–3.0 (see <a href="https://rubyinstaller.org/2021/12/31/rubyinstaller-3.1.0-1-released.html">RubyInstaller 3.1.0 release notes</a>).</p>
<p>This release ends support for:</p>
<ul>
<li>Ruby 2.5, for which <a href="https://www.ruby-lang.org/en/downloads/branches/">official support ended 2021-03-31</a>.</li>
<li>JRuby 9.2, which is a Ruby 2.5-compatible release.</li>
</ul>
<h4>Faster, more reliable installation: Native Gem for ARM64 Linux</h4>
<p>This version of Nokogiri ships experimental native gem support for the <code>aarch64-linux</code> platform, which should support AWS Graviton and other ARM Linux platforms. We don't yet have CI running for this platform, and so we're interested in hearing back from y'all whether this is working, and what problems you're seeing. Please send us feedback here: <a href="https://github.com/sparklemotion/nokogiri/discussions/2359">Feedback: Have you used the <code>aarch64-linux</code> native gem?</a></p>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="7d74cedf27"><code>7d74ced</code></a> version bump to v1.13.3</li>
<li><a href="5970fd95c8"><code>5970fd9</code></a> fix: revert libxml2 regression with HTML4 recovery</li>
<li><a href="49b86631b7"><code>49b8663</code></a> version bump to v1.13.2</li>
<li><a href="4729133787"><code>4729133</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2457">https://github.com/facebook/rocksdb/issues/2457</a> from sparklemotion/flavorjones-libxml-2.9.13-v1.13.x</li>
<li><a href="379f757ef5"><code>379f757</code></a> dev(package): work around gnome mirrors with expired certs</li>
<li><a href="95cf66ca9f"><code>95cf66c</code></a> dep: upgrade libxml2 2.9.12 → 2.9.13</li>
<li><a href="d37dd02ea5"><code>d37dd02</code></a> dep: upgrade libxslt 1.1.34 → 1.1.35</li>
<li><a href="59a93986ec"><code>59a9398</code></a> dep: upgrade mini_portile 2.7 to 2.8</li>
<li><a href="e885463285"><code>e885463</code></a> dev(package): handle either .tar.gz or .tar.xz archive names</li>
<li><a href="7957c7b009"><code>7957c7b</code></a> style: rubocop</li>
<li>Additional commits viewable in <a href="https://github.com/sparklemotion/nokogiri/compare/v1.12.5...v1.13.3">compare view</a></li>
</ul>
</details>
<br />
[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=nokogiri&package-manager=bundler&previous-version=1.12.5&new-version=1.13.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `dependabot rebase` will rebase this PR
- `dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `dependabot merge` will merge this PR after your CI passes on it
- `dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `dependabot cancel merge` will cancel a previously requested merge and block automerging
- `dependabot reopen` will reopen this PR if it is closed
- `dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
- `dependabot use these labels` will set the current labels as the default for future PRs for this repo and language
- `dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language
- `dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language
- `dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/facebook/rocksdb/network/alerts).
</details>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9636
Reviewed By: anand1976
Differential Revision: D34556272
Pulled By: jay-zhuang
fbshipit-source-id: 76aa7e92ca3fcf5d34c53091b94bfe5b0af7b55d
Summary:
We should use the released clang-format version instead of the one from
dev branch. Otherwise the format report could be inconsistent with local
development env and CI. e.g.: https://github.com/facebook/rocksdb/issues/9644
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9646
Test Plan: CI
Reviewed By: riversand963
Differential Revision: D34554065
Pulled By: jay-zhuang
fbshipit-source-id: b841bc400becb4272be18c803eb03a7a1172da6f
Summary:
Related to: https://github.com/facebook/rocksdb/pull/9215
* Adds build_detect_platform support for RISCV on Linux (at least on SiFive Unmatched platforms)
This still leaves some linking issues on RISCV remaining (e.g. when building `db_test`):
```
/usr/bin/ld: ./librocksdb_debug.a(memtable.o): in function `__gnu_cxx::new_allocator<char>::deallocate(char*, unsigned long)':
/usr/include/c++/10/ext/new_allocator.h:133: undefined reference to `__atomic_compare_exchange_1'
/usr/bin/ld: ./librocksdb_debug.a(memtable.o): in function `std::__atomic_base<bool>::compare_exchange_weak(bool&, bool, std::memory_order, std::memory_order)':
/usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
/usr/bin/ld: /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
/usr/bin/ld: /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
/usr/bin/ld: /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
/usr/bin/ld: ./librocksdb_debug.a(memtable.o):/usr/include/c++/10/bits/atomic_base.h:464: more undefined references to `__atomic_compare_exchange_1' follow
/usr/bin/ld: ./librocksdb_debug.a(db_impl.o): in function `rocksdb::DBImpl::NewIteratorImpl(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyData*, unsigned long, rocksdb::ReadCallback*, bool, bool)':
/home/adamretter/rocksdb/db/db_impl/db_impl.cc:3019: undefined reference to `__atomic_exchange_1'
/usr/bin/ld: ./librocksdb_debug.a(write_thread.o): in function `rocksdb::WriteThread::Writer::CreateMutex()':
/home/adamretter/rocksdb/./db/write_thread.h:205: undefined reference to `__atomic_compare_exchange_1'
/usr/bin/ld: ./librocksdb_debug.a(write_thread.o): in function `rocksdb::WriteThread::SetState(rocksdb::WriteThread::Writer*, unsigned char)':
/home/adamretter/rocksdb/db/write_thread.cc:222: undefined reference to `__atomic_compare_exchange_1'
collect2: error: ld returned 1 exit status
make: *** [Makefile:1449: db_test] Error 1
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9366
Reviewed By: jay-zhuang
Differential Revision: D34377664
Pulled By: mrambacher
fbshipit-source-id: c86f9d0cd1cb0c18de72b06f1bf5847f23f51118
Summary:
In crash test with fault injection, we were seeing stack traces like the following:
```
https://github.com/facebook/rocksdb/issues/3 0x00007f75f763c533 in __GI___assert_fail (assertion=assertion@entry=0x1c5b2a0 "end_offset >= start_offset", file=file@entry=0x1c580a0 "table/block_based/block_based_table_reader.cc", line=line@entry=3245,
function=function@entry=0x1c60e60 "virtual uint64_t rocksdb::BlockBasedTable::ApproximateSize(const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::TableReaderCaller)") at assert.c:101
https://github.com/facebook/rocksdb/issues/4 0x00000000010ea9b4 in rocksdb::BlockBasedTable::ApproximateSize (this=<optimized out>, start=..., end=..., caller=<optimized out>) at table/block_based/block_based_table_reader.cc:3224
https://github.com/facebook/rocksdb/issues/5 0x0000000000be61fb in rocksdb::TableCache::ApproximateSize (this=0x60f0000161b0, start=..., end=..., fd=..., caller=caller@entry=rocksdb::kCompaction, internal_comparator=..., prefix_extractor=...) at db/table_cache.cc:719
https://github.com/facebook/rocksdb/issues/6 0x0000000000c3eaec in rocksdb::VersionSet::ApproximateSize (this=<optimized out>, v=<optimized out>, f=..., start=..., end=..., caller=<optimized out>) at ./db/version_set.h:850
https://github.com/facebook/rocksdb/issues/7 0x0000000000c6ebc3 in rocksdb::VersionSet::ApproximateSize (this=<optimized out>, options=..., v=v@entry=0x621000047500, start=..., end=..., start_level=start_level@entry=0, end_level=<optimized out>, caller=<optimized out>)
at db/version_set.cc:5657
https://github.com/facebook/rocksdb/issues/8 0x000000000166e894 in rocksdb::CompactionJob::GenSubcompactionBoundaries (this=<optimized out>) at ./include/rocksdb/options.h:1869
https://github.com/facebook/rocksdb/issues/9 0x000000000168c526 in rocksdb::CompactionJob::Prepare (this=this@entry=0x7f75f3ffcf00) at db/compaction/compaction_job.cc:546
```
The problem occurred in `ApproximateSize()` when the index `Seek()` for the first `ApproximateDataOffsetOf()` encountered an I/O error, while the second `Seek()` did not. In the old code that scenario caused `start_offset == data_size` , thus it was easy to trip the assertion that `end_offset >= start_offset`.
The fix is to set `start_offset == 0` when the first index `Seek()` fails, and `end_offset == data_size` when the second index `Seek()` fails. I doubt these give an "on average correct" answer for how this function is used, but I/O errors in index seeks are hopefully rare, it looked consistent with what was already there, and it was easier to calculate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9615
Test Plan:
run the repro command for a while and stopped seeing coredumps -
```
$ while ! ./db_stress --block_size=128 --cache_size=32768 --clear_column_family_one_in=0 --column_families=1 --continuous_verification_interval=0 --db=/dev/shm/rocksdb_crashtest --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --index_type=2 --iterpercent=10 --kill_random_test=18887 --max_key=1000000 --max_bytes_for_level_base=2048576 --nooverwritepercent=1 --open_files=-1 --open_read_fault_one_in=32 --ops_per_thread=1000000 --prefixpercent=5 --read_fault_one_in=0 --readpercent=45 --reopen=0 --skip_verifydb=1 --subcompactions=2 --target_file_size_base=524288 --test_batches_snapshots=0 --value_size_mult=32 --write_buffer_size=524288 --writepercent=35 ; do : ; done
```
Reviewed By: pdillinger
Differential Revision: D34383069
Pulled By: ajkr
fbshipit-source-id: fac26c3b20ea962e75387515ba5f2724dc48719f
Summary:
We found a case of cacheline bouncing due to writers locking/unlocking `mutex_` and readers accessing `block_cache_tracer_`. We discovered it only after the issue was fixed by https://github.com/facebook/rocksdb/issues/9462 shifting the `DBImpl` members such that `mutex_` and `block_cache_tracer_` were naturally placed in separate cachelines in our regression testing setup. This PR forces the cacheline alignment of `mutex_` so we don't accidentally reintroduce the problem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9637
Reviewed By: riversand963
Differential Revision: D34502233
Pulled By: ajkr
fbshipit-source-id: 46aa313b7fe83e80c3de254e332b6fb242434c07
Summary:
**Context:**
As part of https://github.com/facebook/rocksdb/pull/6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed](e66199d848 (diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R396)). Before re-enabling file deletion, it `assert(versions_->io_status().ok());`, which IMO assumes `versions_` is **the** `version_` in the recovery process.
However, this is not necessarily true due to `s = error_handler_.ClearBGError();` happening before that assertion can unblock some foreground thread by [`EventHelpers::NotifyOnErrorRecoveryEnd()`](3122cb4358/db/error_handler.cc (L552-L553)) as part of the `ClearBGError()`. That foreground thread can do whatever it wants including closing/reopening the db and clean up that same `versions_`.
As a consequence, `assert(versions_->io_status().ok());`, will access `io_status()` of a nullptr and test like `DBErrorHandlingFSTest.MultiCFWALWriteError` becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to [reopen the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494), where [`versions_` gets reset to nullptr](https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl.cc?fbclid=IwAR2uRhwBiPKgmE9q_6CM2mzbfwjoRgsGpXOrHruSJUDcAKc9rYZtVSvKdOY#L678) as part of the old db clean-up. If this happens right before `assert(versions_->io_status().ok()); ` gets excuted in the background thread, then we can see error like
```
db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet'
assert(versions_->io_status().ok());
```
**Summary:**
- I proposed to call `s = error_handler_.ClearBGError();` after we know it's fine to wake up foreground, which I think is right before we LOG `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");`
- As the context, the orignal https://github.com/facebook/rocksdb/pull/3997 introducing `DBImpl::Resume()` calls `s = error_handler_.ClearBGError();` very close to calling `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` while the later https://github.com/facebook/rocksdb/pull/6949 distances these two calls a bit.
- And it seems fine to me that `s = error_handler_.ClearBGError();` happens after `EnableFileDeletions(/*force=*/true);` at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before `s = error_handler_.ClearBGError();`, which basically is resetting some state variables.
- In addition, to preserve the previous behavior of https://github.com/facebook/rocksdb/pull/6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated `enable_file_deletion_s` from the general `s`
- In addition, to make `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` more clear, I separated it into its own if-block.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9496
Test Plan:
- Manually reproduce the assertion failure in`DBErrorHandlingFSTest.MultiCFWALWriteError` by injecting sleep like below so that it's more likely for `assert(versions_->io_status().ok());` to execute after [reopening the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494) in the foreground (i.e, testing) thread
```
sleep(1);
assert(versions_->io_status().ok());
```
`python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBErrorHandlingFSTest
[ RUN ] DBErrorHandlingFSTest.MultiCFWALWriteError
Received signal 11 (Segmentation fault)
#0 rocksdb/error_handler_fs_test() [0x5818a4] rocksdb::DBImpl::ResumeImpl(rocksdb::DBRecoverContext) /data/users/huixiao/rocksdb/db/db_impl/db_impl.cc:421
https://github.com/facebook/rocksdb/issues/1 rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600
https://github.com/facebook/rocksdb/issues/2 rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError() /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310
https://github.com/facebook/rocksdb/issues/3 rocksdb/error_handler_fs_test()
```
- The assertion failure does not happen with PR
`python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
`[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms) `
Reviewed By: riversand963, anand1976
Differential Revision: D33990099
Pulled By: hx235
fbshipit-source-id: 2e0259a471fa8892ff177da91b3e1c0792dd7bab