Summary:
PR https://github.com/facebook/rocksdb/issues/5908 added `flush_jobs_info_` to `FlushJob` to make sure
`OnFlushCompleted()` is called after committing flush results to
MANIFEST. However, `flush_jobs_info_` is not updated in atomic
flush, causing `NotifyOnFlushCompleted()` to skip `OnFlushCompleted()`.
This PR fixes this, in a similar way to https://github.com/facebook/rocksdb/issues/5908 that handles regular flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8585
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D29913720
Pulled By: riversand963
fbshipit-source-id: 4ff023c98372fa2c93188d4a5c8a4e9ffa0f4dda
Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
https://github.com/facebook/rocksdb/issues/8440 ,
https://github.com/facebook/rocksdb/issues/8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.
Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the
unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.
The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605
Test Plan: Ran `make check` and stress tests locally.
Reviewed By: pdillinger
Differential Revision: D30051035
Pulled By: ltamasi
fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528
Summary:
ObjectLibrary is shared between multiple DB instances, the
Register() could have race condition.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8574
Test Plan: pass the failed test
Reviewed By: ajkr
Differential Revision: D29855096
Pulled By: jay-zhuang
fbshipit-source-id: 541eed0bd495d2c963d858d81e7eabf1ba16153c
Summary:
There is an extra " in options.h (`"index block""`)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8550
Test Plan: None
Reviewed By: jay-zhuang
Differential Revision: D29746077
Pulled By: autopear
fbshipit-source-id: 2e5117296e5414b7c7440d990926bc1e567a0b4f
Summary:
When DB Stress enables write failure in reopen, WAL files are also created with a wrapper writalbe file which buffers write until fsync. However, crash test currently expects all writes to WAL is persistent. This is at odd with the unsynced bytes dropped. To work it around temporarily, we disable WAL write failure for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8548
Test Plan: Run db_stress. Manual printf to make sure only WAL files are skipped.
Reviewed By: jay-zhuang
Differential Revision: D29745095
fbshipit-source-id: 1879dd2c01abad7879ca243ee94570ec47c347f3
Summary:
Otherwise the build may report warning about missing
`benchmark.h` for some targets, the error won't break the build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8523
Test Plan:
`make blackbox_ubsan_crash_test` on a machine without
benchmark lib installed.
Reviewed By: pdillinger
Differential Revision: D29682478
Pulled By: jay-zhuang
fbshipit-source-id: e1261fbcda46bc6bd3cd39b7b03b7f78927d0430
Summary:
Some URIs for creating instances (ala SecondaryCache) use complex URIs like (cache://name;prop=value). These URIs were treated as name-value properties. With this change, if the URI does not contain an "id=XX" setting, it will be treated as a single string value (and not an ID and map of name-value properties).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8547
Reviewed By: anand1976
Differential Revision: D29741386
Pulled By: mrambacher
fbshipit-source-id: 0621f62bec3a6699a7b66c7c0b5634b2856653aa
Summary:
I previously didn't notice the DB mutex was being held during
block cache entry stat scans, probably because I primarily checked for
read performance regressions, because they require the block cache and
are traditionally latency-sensitive.
This change does some refactoring to avoid holding DB mutex and to
avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
Some tests have to be updated because now the stats collector is
populated in the Cache aggressively on DB startup rather than lazily.
(I hope to clean up some of this added complexity in the future.)
This change also ensures proper treatment of need_out_of_mutex for
non-int DB properties.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538
Test Plan:
Added unit test logic that uses sync points to fail if the DB mutex
is held during a scan, covering the various ways that a scan might be
triggered.
Performance test - the known impact to holding the DB mutex is on
TransactionDB, and the easiest way to see the impact is to hack the
scan code to almost always miss and take an artificially long time
scanning. Here I've injected an unconditional 5s sleep at the call to
ApplyToAllEntries.
Before (hacked):
$ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0)
rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
$ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0)
rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323
Notice the 5s P100 write time.
After (hacked):
$ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0)
rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
$ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0)
rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918
P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:
$ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0)
rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
$ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0)
rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832
No substantial difference.
Reviewed By: siying
Differential Revision: D29738847
Pulled By: pdillinger
fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
Summary:
Right now, db_bench with seekrandom and multiple DB setup creates iterator for all DBs just to query one of them. It's different from most real workloads. Fix it by only creating iterators that will be queried.
Also fix a bug that DBs are not destroyed in multi-DB mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7818
Test Plan: Run db_bench with single/multiDB X using/not using tailing iterator with ASAN build, and validate the behavior is expected.
Reviewed By: ajkr
Differential Revision: D25720226
fbshipit-source-id: c2ff7ff7120e5ba64287a30b057c5d29b2cbe20b
Summary:
Add `experiemental_allow_mempurge` flag support for `db_stress` and `db_crashtest.py`, with a `false` default value.
I succesfully tested locally both `whitebox` and `blackbox` crash tests with `experiemental_allow_mempurge` flag set as true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8545
Reviewed By: akankshamahajan15
Differential Revision: D29734513
Pulled By: bjlemaire
fbshipit-source-id: 24316c0eccf6caf409e95c035f31d822c66714ae
Summary:
Made the EncryptionProvider and BlockCipher classes inherit from Customizable. Added/fixed the CreateFromString method to these classes to create instances from builtin or registered classes. Added tests to verify that instances can be registered and retrieved as appropriate.
Added the ability to configure the builtin (CTR, ROT13) classes from configurable properties. Added the appropriate tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8354
Reviewed By: zhichao-cao
Differential Revision: D29558949
Pulled By: mrambacher
fbshipit-source-id: c20286b32d179777e060f51a58943e9b0cf81d04
Summary:
In https://github.com/facebook/rocksdb/issues/8539 I accidentally only checked for GCC TSAN, which is
what I tested locally, while CircleCI and FB CI use clang TSAN. Related:
other existing code like in stack_trace.cc only check for clang TSAN.
I've now standardized these to the GCC convention in port/lang.h, so now
#ifdef __SANITIZE_THREAD__
can check for any TSAN (assuming lang.h include)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8543
Test Plan:
Put an assert(false) in slice_test and look for the NOTE
about "signal-unsafe call", both GCC and clang. Eventually, CircleCI
TSAN in https://github.com/facebook/rocksdb/issues/8538
Reviewed By: zhichao-cao
Differential Revision: D29728483
Pulled By: pdillinger
fbshipit-source-id: 8a3b8015c2ed48078214c3ee17146a2c3f11c9f7
Summary:
In this PR, `mempurge` is made compatible with the Write Ahead Log: in case of recovery, the DB is now capable of recovering the data that was "mempurged" and kept in the `imm()` list of immutable memtables.
The twist was to add a uint64_t to the `memtable` struct to store the number of the earliest log file containing entries from the `memtable`. When a `Flush` operation is replaced with a `MemPurge`, the `VersionEdit` (which usually contains the new min log file number to pick up for recovery and the level 0 file path of the newly created SST file) is no longer appended to the manifest log, and every time the `deleteWal` method is called, a check is made on the list of immutable memtables.
This PR also includes a unit test that verifies that no data is lost upon Reopening of the database when the mempurge feature is activated. This extensive unit test includes two column families, with valid data contained in the imm() at time of "crash"/reopening (recovery).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8528
Reviewed By: pdillinger
Differential Revision: D29701097
Pulled By: bjlemaire
fbshipit-source-id: 072a900fb6ccc1edcf5eef6caf88f3060238edf9
Summary:
Some bits are mutated and read while holding a lock, other
immutable bits (esp. secondary cache compatibility) can be read by
arbitrary threads without holding a lock. AFAIK, this doesn't cause an
issue on any architecture we care about, because you will get some
legitimate version of the value that includes the initialization, as
long as synchronization guarantees the initialization happens before the
read.
I've only seen this in https://github.com/facebook/rocksdb/issues/8538 so far, but it should be fixed regardless.
Otherwise, we'll surely get these false reports again some time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8539
Test Plan: some local TSAN test runs and in CircleCI
Reviewed By: zhichao-cao
Differential Revision: D29720262
Pulled By: pdillinger
fbshipit-source-id: 365fd7e565577c648815161f71b339bcb5ce12d5
Summary:
Some cmake and test configuration are set in pre-steps
enviroment variables. Add the missing steps.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8524
Test Plan: CI pass
Reviewed By: siying
Differential Revision: D29682731
Pulled By: jay-zhuang
fbshipit-source-id: afda1acf6a7b76989db450442b0b27f387388b9d
Summary:
The removed function in this PR, just only have declared and dose not have any reference used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8508
Reviewed By: mrambacher
Differential Revision: D29649033
Pulled By: jay-zhuang
fbshipit-source-id: df98143b73d6c184a2a60c9f7ea2548a065ee35d
Summary:
This PR is for https://github.com/facebook/rocksdb/issues/8453
We need to update `s = biter.status();` when `biter.status().IsIncomplete()` is true. By doing this, can fix the problem in issue.
Besides, we still need to update `db_statistics` in `get_context.ReportCounters()` before return back.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8485
Reviewed By: jay-zhuang
Differential Revision: D29604835
Pulled By: ajkr
fbshipit-source-id: c7f2f1cd058223ce1b507ec05d57cf264b9c9710
Summary:
Fixed a few MSVC (VCToolsVersion=14.0) build errors and warnings
* `DEFINE_string` is a macro and VC compiler complains that it cannot put [ifdef-inside-define](https://stackoverflow.com/questions/5586429/ifdef-inside-define)
* `sleep()` is not a recognizable function. Use `FLAGS_env->SleepForMicroseconds` instead
* Define precise type in comparison to avoid mismatch warning
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8519
Reviewed By: jay-zhuang
Differential Revision: D29683086
fbshipit-source-id: 8c80941472089f8daba84ae29597e75e603850e4
Summary:
Bumps [addressable](https://github.com/sporkmonger/addressable) from 2.7.0 to 2.8.0.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/sporkmonger/addressable/blob/main/CHANGELOG.md">addressable's changelog</a>.</em></p>
<blockquote>
<h1>Addressable 2.8.0</h1>
<ul>
<li>fixes ReDoS vulnerability in Addressable::Template#match</li>
<li>no longer replaces <code>+</code> with spaces in queries for non-http(s) schemes</li>
<li>fixed encoding ipv6 literals</li>
<li>the <code>:compacted</code> flag for <code>normalized_query</code> now dedupes parameters</li>
<li>fix broken <code>escape_component</code> alias</li>
<li>dropping support for Ruby 2.0 and 2.1</li>
<li>adding Ruby 3.0 compatibility for development tasks</li>
<li>drop support for <code>rack-mount</code> and remove Addressable::Template#generate</li>
<li>performance improvements</li>
<li>switch CI/CD to GitHub Actions</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="6469a232c0"><code>6469a23</code></a> Updating gemspec again</li>
<li><a href="24336385de"><code>2433638</code></a> Merge branch 'main' of github.com:sporkmonger/addressable into main</li>
<li><a href="e9c76b8897"><code>e9c76b8</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/sporkmonger/addressable/issues/378">https://github.com/facebook/rocksdb/issues/378</a> from ashmaroli/flat-map</li>
<li><a href="56c5cf7ece"><code>56c5cf7</code></a> Update the gemspec</li>
<li><a href="c1fed1ca0a"><code>c1fed1c</code></a> Require a non-vulnerable rake</li>
<li><a href="0d8a3127e3"><code>0d8a312</code></a> Adding note about ReDoS vulnerability</li>
<li><a href="89c76130ce"><code>89c7613</code></a> Merge branch 'template-regexp' into main</li>
<li><a href="cf8884f815"><code>cf8884f</code></a> Note about alias fix</li>
<li><a href="bb03f7112e"><code>bb03f71</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/sporkmonger/addressable/issues/371">https://github.com/facebook/rocksdb/issues/371</a> from charleystran/add_missing_encode_component_doc_entry</li>
<li><a href="6d1d8094a6"><code>6d1d809</code></a> Adding note about :compacted normalization</li>
<li>Additional commits viewable in <a href="https://github.com/sporkmonger/addressable/compare/addressable-2.7.0...addressable-2.8.0">compare view</a></li>
</ul>
</details>
<br />
[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=addressable&package-manager=bundler&previous-version=2.7.0&new-version=2.8.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `dependabot rebase` will rebase this PR
- `dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `dependabot merge` will merge this PR after your CI passes on it
- `dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `dependabot cancel merge` will cancel a previously requested merge and block automerging
- `dependabot reopen` will reopen this PR if it is closed
- `dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
- `dependabot use these labels` will set the current labels as the default for future PRs for this repo and language
- `dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language
- `dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language
- `dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/facebook/rocksdb/network/alerts).
</details>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8515
Reviewed By: jay-zhuang
Differential Revision: D29668988
Pulled By: ajkr
fbshipit-source-id: c4b7abd4a879a7b562cb8ba745088dba6644f503
Summary:
MyRocks apparently uses valgrind to check for unreachable
unfreed data, which is stricter than our valgrind checks. Internal ref:
D29257815
This patch adds valgrind support to STATIC_AVOID_DESTRUCTION so that it's
not reported with those stricter checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8503
Test Plan:
make valgrind_test
Also, with modified VALGRIND_OPTS (see Makefile), more kinds of
failures seen before than after this commit.
Reviewed By: ajkr, yizhang82
Differential Revision: D29597784
Pulled By: pdillinger
fbshipit-source-id: 360de157a176aec4d1be99ca20d160ecd47c0873
Summary:
1. Fix printing of stats when there are no writes (wamp=0). Previously had a div0 error
2. Added multireadrandom command as a valid target
3. Added ability to pass additional command line options to db_bench. Now can say things like benchmark.sh readrandom --mmap_read and the option will be passed to db_bench.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8346
Reviewed By: zhichao-cao
Differential Revision: D29500436
Pulled By: mrambacher
fbshipit-source-id: 54e90708aae9133be3a903e35efdf8f8abbd86fa
Summary:
The MemPurge output status can either be an Abort if the mempurge is aborted due to the new_mem memtable reaching more than the target capacity (currently 60%), or for other reasons. As a result, in the log, we want to differentiate between an abort status, which in this PR only leads to a ROCKS_LOG_INFO, and any other status, which in this PR leads to a ROCKS_LOG_WARN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8514
Reviewed By: pdillinger
Differential Revision: D29662446
Pulled By: bjlemaire
fbshipit-source-id: c9bec8e238ebc7ecb14fbbddf580e6887e281c16
Summary:
When db is open as secondary, there are basically 2 step process:
1) Collect column families from wal log
2) Apply changes to Memtable
In case primary db is TransactionDB instance, wal log will contain some additional data, like noop, etc. ColumnFamilyCollector doesn't implement methods to handle these, so it fails to open a wal log written by TransactionDB. (Everything works fine with standard DB::Open).
Memtable recovery process knows how to handle such wal logs, so only missing piece seems to be ColumnFamilyCollector.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8456
Reviewed By: ajkr
Differential Revision: D29455945
Pulled By: mrambacher
fbshipit-source-id: 5b29560fcbc008e17e95d0dc4b07558f3d63e26f
Summary:
In ```DBImpl::WriteImpl()```, we call ```PreprocessWrite()``` which, among other things, checks the BG error and returns it set. This return status is later on passed to ```WriteStatusCheck()```, which calls ```SetBGError()```. This results in a spurious call, and info logs, on every user write request. We should avoid passing the ```PreprocessWrite()``` return status to ```WriteStatusCheck()```, as the former would have called ```SetBGError()``` already if it encountered any new errors, such as error when creating a new WAL file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8511
Test Plan: Run existing tests
Reviewed By: zhichao-cao
Differential Revision: D29639917
Pulled By: anand1976
fbshipit-source-id: 19234163969e1645dbeb273712aaf5cd9ea2b182
Summary:
In https://github.com/facebook/rocksdb/issues/8454, I introduced a new process baptized `MemPurge` (memtable garbage collection). This new PR is built upon this past mempurge prototype.
In this PR, I made the `mempurge` process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...).
Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the `imm()` memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge.
MemPurge is activated by making the `experimental_allow_mempurge` flag `true`. When activated, the `MemPurge` process will always happen when the flush reason is `kWriteBufferFull`.
The 3 unit tests confirm that this process supports `Put`, `Get`, `Delete`, `DeleteRange` operators and is compatible with `Iterators` and `CompactionFilters`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8505
Reviewed By: pdillinger
Differential Revision: D29619283
Pulled By: bjlemaire
fbshipit-source-id: 8a99bee76b63a8211bff1a00e0ae32360aaece95
Summary:
Add google benchmark for microbench.
Add ribbon_bench for benchmark ribbon filter vs. other filters.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8493
Test Plan:
added test to CI
To run the benchmark on devhost:
Install benchmark: `$ sudo dnf install google-benchmark-devel`
Build and run:
`$ ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make microbench`
or with cmake:
`$ mkdir build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_BENCHMARK=1 && make microbench`
Reviewed By: pdillinger
Differential Revision: D29589649
Pulled By: jay-zhuang
fbshipit-source-id: 8fed13b562bef4472f161ecacec1ab6b18911dff
Summary:
… small overwritten files.
If a file is overwritten with renamed and the parent path is not synced, FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync() will delete the file. However, RocksDB relies on file renaming to be atomic no matter whether the parent directory is synced or not, and the current behavior breaks the assumption and caused some false positive: https://github.com/facebook/rocksdb/pull/8489
Since the atomic renaming is used in CURRENT files, to fix the problem, in FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync(), we recover the state of overwritten file if the file is small.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8501
Test Plan: Run stress test for a while and see it doesn't break.
Reviewed By: anand1976
Differential Revision: D29594384
fbshipit-source-id: 589b5c2f0a9d2aca53752d7bdb0231efa5b3ae92
Summary:
Various tests had disabled valgrind due to it slowing down and timing
out (as is the case right now) the CI runs. Where a test was disabled with no comment,
I assumed slowness was the cause. For these tests that were slow under
valgrind, as well as the ones identified in https://github.com/facebook/rocksdb/issues/8352, this PR moves them
behind the compiler flag `-DROCKSDB_FULL_VALGRIND_RUN`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8475
Test Plan: running `make full_valgrind_test`, `make valgrind_test`, `make check`; will verify they appear working correctly
Reviewed By: jay-zhuang
Differential Revision: D29504843
Pulled By: ajkr
fbshipit-source-id: 2aac90749cfbd30d5ce11cb29a07a1b9314eeea7
Summary:
Previously, the following command:
```USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j$(nproc) analyze```
was raising an error/warning the new_mem could potentially be a `nullptr`. This error appeared due to code changes from https://github.com/facebook/rocksdb/issues/8454, including an if-statement containing "`... && new_mem != nullptr && ...`", which made the analyzer believe that past this `if`-statement, a `new_mem==nullptr` was a possible scenario.
This code patch simply introduces `assert`s and removes this condition in the `if`-statement.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8492
Reviewed By: jay-zhuang
Differential Revision: D29571275
Pulled By: bjlemaire
fbshipit-source-id: 75d72246b70ebbbae7dea11ccb5778686d8bcbea
Summary:
```TestFSWritableFile``` buffers data in ```Append``` in order to simulate unsynced data loss on crash. This is only required for buffered IO and should be disabled for direct IO. Otherwise, it causes crash tests to assert on the buffer address alignment - ```db_stress: env/io_posix.cc:1194: virtual rocksdb::IOStatus rocksdb::PosixWritableFile::Append(const rocksdb::Slice&, const rocksdb::IOOptions&, rocksdb::IODebugContext*): Assertion `IsSectorAligned(data.data(), GetRequiredBufferAlignment())' failed.```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8490
Reviewed By: zhichao-cao
Differential Revision: D29565080
Pulled By: anand1976
fbshipit-source-id: 682831fd66ed3b9597caa74fc453e22dfaf9b973
Summary:
Write and metadata error injection during DB open was enabled in https://github.com/facebook/rocksdb/issues/8474. This causes crash tests to fail very frequently due to another fault injection feature that deletes files created after the last dir sync during DB open. In real life, a similar failure would happen if the FS returns error on the CURRENT file rename, but the rename actually succeeded and got partially persisted (dir entry for the old CURRENT file got removed, but the entry for the new one is not persisted). Temporarily disable the fault injection feature until we figure out the likelihood of this bug happening and the proper way to fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8489
Test Plan: Stress test can open the DB successfully
Reviewed By: siying
Differential Revision: D29564516
Pulled By: anand1976
fbshipit-source-id: ffd1650715ea3c5bf7131936b0ca6fcf66f4e14e
Summary:
Inject read failures in DB reopen, just as what we do for metadata writes and writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8476
Test Plan: Some manual tests and make sure failures are triggered.
Reviewed By: anand1976
Differential Revision: D29507283
fbshipit-source-id: d04da0163973447041038bd87701686a417c4e0c
Summary:
We ended up using a different approach for tracking the amount of
garbage in blob files (see e.g. https://github.com/facebook/rocksdb/pull/8450),
so the ability to apply only a range of table property collectors is
now unnecessary. The patch reverts this part of
https://github.com/facebook/rocksdb/pull/8298 while keeping the cleanup done
in that PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8465
Test Plan: `make check`
Reviewed By: jay-zhuang
Differential Revision: D29399921
Pulled By: ltamasi
fbshipit-source-id: af64816c357d0829b9d7ba8ca1477038138f6f0a
Summary:
Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454
Reviewed By: anand1976
Differential Revision: D29433971
Pulled By: bjlemaire
fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
Summary:
add the injest_error_severity to control if it is a retryable IO Error or a fatal or unrecoverable error. Use a flag to indicate, if fatal error comes, the flag is set and db is stopped (but not corrupted).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8479
Test Plan: run ./db_stress --reopen=0 --read_fault_one_in=1000 --write_fault_one_in=5 --disable_wal=true --write_buffer_size=3000000 -writepercent=5 -readpercent=50 --injest_error_severity=2 --column_families=1, make check
Reviewed By: anand1976
Differential Revision: D29524271
Pulled By: zhichao-cao
fbshipit-source-id: 1aa9fb9b5655b0adba6f5ad12005ca8c074c795b