Summary:
Use DB session ids in SST table properties to make cache keys
stable across DB re-open and copy / move / restore / etc.
These new cache keys are currently only enabled when FileSystem does not
provide GetUniqueId. For now, they are typically larger, so slightly
less efficient.
Relevant to https://github.com/facebook/rocksdb/issues/7405
This change has a minor regression in PersistentCache functionality:
metaindex blocks are no longer cached in PersistentCache. Table properties
blocks already were not but ideally should be. I didn't spent effort to
fix & test these issues because we don't believe PersistentCache is used much
if at all and expect SecondaryCache to replace it. (Though PRs are welcome.)
FIXME: there is more to be fixed for stable cache keys on external SST files
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8659
Test Plan:
new unit test added, which fails when disabling new
functionality
Reviewed By: zhichao-cao
Differential Revision: D30297705
Pulled By: pdillinger
fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a
Summary:
Previously, the `MemPurge` sampling function was assessing whether a random entry from a memtable was garbage or not by simply querying the given memtable (see https://github.com/facebook/rocksdb/issues/8628 for more details).
In this diff, I am updating the sampling function by querying not only the memtable the entry was drawn from, but also all subsequent memtables that have a greater memtable ID.
I also added the size of the value for KV entries in the payload/useful payload estimates (which was also one of the reasons why sampling was not as good as mempurging all the time in terms of L0 SST files reduction).
Once these changes were made, I was able to clean obsolete objects and functions from the `MemtableList` struct, and did a bit of cleanup everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8656
Reviewed By: pdillinger
Differential Revision: D30288583
Pulled By: bjlemaire
fbshipit-source-id: 7646a545ec56f4715949daa59ab5eee74540feb3
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.
User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.
Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611
Reviewed By: ajkr
Differential Revision: D30266329
Pulled By: autopear
fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628
Reviewed By: pdillinger
Differential Revision: D30149315
Pulled By: bjlemaire
fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
Summary:
The patch attempts to deflake `DBTestXactLogIterator.TransactionLogIteratorCorruptedLog`
by disabling file deletions while retrieving the list of WAL files and truncating the first WAL file.
This is to prevent the `PurgeObsoleteFiles` call triggered by `GetSortedWalFiles` from
invalidating the result of `GetSortedWalFiles`. The patch also cleans up the test case a bit
and changes it to using `test::TruncateFile` instead of calling the `truncate` syscall directly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8627
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D30147002
Pulled By: ltamasi
fbshipit-source-id: db11072a4ad8900a2f859cb5294e22b1888c23f6
Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310
Test Plan: Add several unit tests.
Reviewed By: ajkr
Differential Revision: D28493792
fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501
Summary:
`CompareKeyContext::operator()` on the trunk has a bug: when comparing
column family IDs, `lhs` is used for both sides of the comparison. This
results in the `KeyContext`s getting sorted solely based on key, which
in turn means that keys with the same column family do not necessarily
form a single range in the sorted list. This violates an assumption of the
batched `MultiGet` logic, leading to the same column family
showing up multiple times in the list of `MultiGetColumnFamilyData`.
The end result is the code attempting to check out the thread-local
`SuperVersion` for the same CF multiple times, causing an
assertion violation in debug builds and memory corruption/crash in
release builds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D30169182
Pulled By: ltamasi
fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023
Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
- Added Options/Configurable/Object Registration for TTL and Cassandra variants
- Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char. Made the delimiter into a configurable option
- Added tests for new functionality
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481
Reviewed By: zhichao-cao
Differential Revision: D30136050
Pulled By: mrambacher
fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af
Summary:
We've been seeing occasional crashes on CI while inserting into the
vectors in `ObsoleteFilesTest.DeleteObsoleteOptionsFile`. The crashes
don't reproduce locally (could be either a race or an object lifecycle
issue) but the good news is that the vectors in question are not really
used for anything meaningful by the test. (The assertion about the sizes
of the two vectors being equal is guaranteed to hold, since the two sync
points where they are populated are right after each other.) The patch
simply removes the vectors from the test, alongside the associated
callbacks and sync points.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8624
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D30118485
Pulled By: ltamasi
fbshipit-source-id: 0a4c3d06584e84cd2b1dcc212d274fa1b89cb647
Summary:
Previously we attempted to rename "LOG" to "LOG.old.*" without checking
its existence first. "LOG" had no reason to exist in a new DB.
Errors in renaming a non-existent "LOG" were swallowed via
`PermitUncheckedError()` so things worked. However the storage service's
error monitoring was detecting all these benign rename failures. So it
is better to fix it. Also with this PR we can now distinguish rename failure
for other reasons and return them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8622
Test Plan: new unit test
Reviewed By: akankshamahajan15
Differential Revision: D30115189
Pulled By: ajkr
fbshipit-source-id: e2f337ffb2bd171be0203172abc8e16e7809b170
Summary:
PR https://github.com/facebook/rocksdb/issues/5908 added `flush_jobs_info_` to `FlushJob` to make sure
`OnFlushCompleted()` is called after committing flush results to
MANIFEST. However, `flush_jobs_info_` is not updated in atomic
flush, causing `NotifyOnFlushCompleted()` to skip `OnFlushCompleted()`.
This PR fixes this, in a similar way to https://github.com/facebook/rocksdb/issues/5908 that handles regular flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8585
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D29913720
Pulled By: riversand963
fbshipit-source-id: 4ff023c98372fa2c93188d4a5c8a4e9ffa0f4dda
Summary:
Insert warm blocks (data, uncompressed dict, index and filter blocks) during flush in Block cache which is enabled under option BlockBasedTableOptions.prepopulate_block_cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8561
Test Plan: Added unit test
Reviewed By: anand1976
Differential Revision: D29773411
Pulled By: akankshamahajan15
fbshipit-source-id: 6631123c10134340ef0bd7e90baafaa6deba0e66
Summary:
The db_stress crash was caused by a call to `IsFlushPending()` made by a stats function which triggered an `assert([false])`, which I didn't plan when I created the `trigger_flush` bool. It turns out that this bool variable is not useful: I created it because I thought the `imm_flush_needed` atomic bool would actually trigger a flush.
It turns out that this bool is only checked in `IsFlushPending` - this is its only use - and a flush is triggered by either a background thread checking on the imm array, or by an explicit call to `SchedulePendingFlush` which creates a flush request, that is then added to a flush request queue.
In this PR, I reverted the MemtableList::Add function to what it was before my changes.
I tested the fix by running the exact command line that deterministically triggered the assert error (see below), which confirmed that this is where the error was coming from.
I also run `db_crashtest.py whitebox` and `blackbox` for a couple hours locally before committing this PR.
Experiment run:
```./db_stress --acquire_snapshot_one_in=0 --allow_concurrent_memtable_write=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=76.90653425292307 --bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=0 --compaction_ttl=2 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --enable_compaction_filter=1 --enable_pipelined_write=0 --expected_values_path=/dev/shm/rocksdb/rocksdb_crashtest_expected --experimental_allow_mempurge=1 --experimental_mempurge_policy=kAlternate --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --format_version=2 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=14 --index_type=0 --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --long_running_snapshots=1 --mark_for_compaction_one_file_in=10 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=32 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=1000 --prefix_size=-1 --prefixpercent=0 --progress_reports=0 --read_fault_one_in=0 --readpercent=60 --recycle_log_file_num=1 --reopen=20 --set_options_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync=1 --sync_fault_injection=False --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=3 --use_clock_cache=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_ribbon_filter=1 --user_timestamp_size=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8604
Reviewed By: pdillinger
Differential Revision: D30047295
Pulled By: bjlemaire
fbshipit-source-id: b9e379bfa3d6b9bd2b275725fb0bca4bd81a3dbe
Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
https://github.com/facebook/rocksdb/issues/8440 ,
https://github.com/facebook/rocksdb/issues/8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.
Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the
unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.
The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605
Test Plan: Ran `make check` and stress tests locally.
Reviewed By: pdillinger
Differential Revision: D30051035
Pulled By: ltamasi
fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528
Summary:
Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file.
After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path).
Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582
Reviewed By: pdillinger, autopear
Differential Revision: D29881122
Pulled By: mrambacher
fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb
Summary:
This PR tries to remove some unnecessary checks as well as unreachable code blocks to
improve readability. An obvious non-public API method naming typo is also corrected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565
Test Plan: make check
Reviewed By: lth
Differential Revision: D29963984
Pulled By: riversand963
fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a
Summary:
Calling the GetImpl function could leave reference to a local
callback function in a field of a parameter struct. As this is
performance-critical code, I'm not going to attempt to sanitize this
code too much, but make the existing hack a bit cleaner by reverting
what it overwrites in the input struct.
Added SaveAndRestore utility class to make that easier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8590
Test Plan:
added unit test for SaveAndRestore; existing tests for
GetImpl
Reviewed By: riversand963
Differential Revision: D29947983
Pulled By: pdillinger
fbshipit-source-id: 2f608853f970bc06724e834cc84dcc4b8599ddeb
Summary:
If DB::GetSortedWalFiles() runs without file deletion disbled, file might get deleted in the middle and error is returned to users. It makes the function hard to use. Fix it by disabling file deletion if it is not done.
Fix another minor issue of logging within DB mutex, which should not be done unless a major failure happens.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8591
Test Plan: Run all existing tests
Reviewed By: pdillinger
Differential Revision: D29969412
fbshipit-source-id: d5f42b5271608a35b9b07687ce18157d7447b0de
Summary:
* Basic handling of SST file with just range tombstones rather than
failing assertion about smallest_seqno <= largest_seqno
* Adds --verbose option so that there exists a way to see the INFO
output from Repairer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8544
Test Plan: unit test added, manual testing for --verbose
Reviewed By: ajkr
Differential Revision: D29954805
Pulled By: pdillinger
fbshipit-source-id: 696af25805fc36cc178b04ba6045922a22625fd9
Summary:
Internal task T96186510.
Created new inline member functions in `CompactionIterator`,
`DefinitelyInSnapshot`, `DefinitelyNotInSnapshot`, and
`InEarliestSnapshot` to replace the macros at the top of
`compaction_iterator.cc`.
Placed the definitions in `compaction_iterator.h` in accordance with
Google's style guide for inline functions. Separated the declarations
and definitions, and only placed the `inline` keyword on the
definitions, in line with ISO CPP recommendations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8592
Test Plan: Ran `make check`. Successful build and all tests appeared to pass.
Reviewed By: riversand963
Differential Revision: D29966782
Pulled By: jimmycFB
fbshipit-source-id: 3584290bbbabf862e9ab58852281f46d37f58be6
Summary:
FileOptions has an implicit conversion from EnvOptions and some
internal APIs take `const FileOptions&` and save the reference, which is
counter to Google C++ guidelines,
> Avoid defining functions that require a const reference parameter to outlive the call, because const reference parameters bind to temporaries. Instead, find a way to eliminate the lifetime requirement (for example, by copying the parameter), or pass it by const pointer and document the lifetime and non-null requirements.
This is at least a problem for repair.cc, which passes an EnvOptions to
TableCache(), which would save a reference to the temporary copy as
FileOptions. This was unfortunately only caught as a side effect of
changes in https://github.com/facebook/rocksdb/issues/8544.
This change fixes the repair.cc case and updates the involved internal
APIs that save a reference to use `const FileOptions*` instead.
Unfortunately, I don't know how to get any of our sanitizers to reliably
report bugs like this, so I can't rule out more existing in our
codebase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8571
Test Plan:
Test that issues seen with https://github.com/facebook/rocksdb/issues/8544 are fixed (can reproduce on
AWS EC2)
Reviewed By: ajkr
Differential Revision: D29943890
Pulled By: pdillinger
fbshipit-source-id: 95f9c5251548777b4dc994c1a083dd2add5799c9
Summary:
This appears to be little used code so not a major bug, but is
blocking https://github.com/facebook/rocksdb/issues/8544
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8589
Test Plan:
Added regression test to the end of
DBRangeDelTest::TableEvictedDuringScan. Without this fix, ASAN reports
memory leak.
Reviewed By: ajkr
Differential Revision: D29943623
Pulled By: pdillinger
fbshipit-source-id: f7115fa6d4440aef83888ff609aa03d09216463b
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473
Reviewed By: zhichao-cao
Differential Revision: D29901488
Pulled By: mrambacher
fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
Summary:
Add `experimental_mempurge_policy` option flag and introduce two new `MemPurge` (Memtable Garbage Collection) policies: 'ALWAYS' and 'ALTERNATE'. Default value: ALTERNATE.
`ALWAYS`: every flush will first go through a `MemPurge` process. If the output is too big to fit into a single memtable, then the mempurge is aborted and a regular flush process carries on. `ALWAYS` is designed for user that need to reduce the number of L0 SST file created to a strict minimum, and can afford a small dent in performance (possibly hits to CPU usage, read efficiency, and maximum burst write throughput).
`ALTERNATE`: a flush is transformed into a `MemPurge` except if one of the memtables being flushed is the product of a previous `MemPurge`. `ALTERNATE` is a good tradeoff between reduction in number of L0 SST files created and performance. `ALTERNATE` perform particularly well for completely random garbage ratios, or garbage ratios anywhere in (0%,50%], and even higher when there is a wild variability in garbage ratios.
This PR also includes support for `experimental_mempurge_policy` in `db_bench`.
Testing was done locally by replacing all the `MemPurge` policies of the unit tests with `ALTERNATE`, as well as local testing with `db_crashtest.py` `whitebox` and `blackbox`. Overall, if an `ALWAYS` mempurge policy passes the tests, there is no reasons why an `ALTERNATE` policy would fail, and therefore the mempurge policy was set to `ALWAYS` for all mempurge unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8583
Reviewed By: pdillinger
Differential Revision: D29888050
Pulled By: bjlemaire
fbshipit-source-id: e2cf26646d66679f6f5fb29842624615610759c1
Summary:
event log info may be truncated, the default buffer size is 512, this PR changes buffer size to 8192.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8563
Reviewed By: ajkr
Differential Revision: D29838229
Pulled By: jay-zhuang
fbshipit-source-id: 00c5dea3caff0641a209f02c972e92d65b505f50
Summary:
The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555
Reviewed By: ajkr
Differential Revision: D29758399
Pulled By: jay-zhuang
fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011
Summary:
The main challenge to make the memtable garbage collection prototype (nicknamed `mempurge`) was to not get rid of WAL files that contain unflushed (but mempurged) data. That was successfully guaranteed by not writing the VersionEdit to the MANIFEST file after a successful mempurge.
By not writing VersionEdits to the `MANIFEST` file after a succesful mempurge operation, we do not change the earliest log file number that contains unflushed data: `cfd->GetLogNumber()` (`cfd->SetLogNumber()` is only called in `VersionSet::ProcessManifestWrites`). As a result, a number of functions introduced earlier just for the mempurge operation are not obscolete/redundant. (e.g.: `FlushJob::ExtractEarliestLogFileNumber`), and this PR aims at cleaning up all these now-unnecessary functions. In particular, we no longer need to store the earliest log file number in the `MemTable` struct itself. This PR therefore also reverts the `MemTable` struct to its original form.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8558
Test Plan: Already included in `db_flush_test.cc`.
Reviewed By: anand1976
Differential Revision: D29764351
Pulled By: bjlemaire
fbshipit-source-id: 0f43b260fa270251862512f397d3f24ee62e8437
Summary:
If the primary's CURRENT file is missing or inaccessible, the secondary should not hang
trying repeatedly to switch to the next MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8200
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D27840627
Pulled By: riversand963
fbshipit-source-id: 071fed97cbab1bc5cdefd1dc235e5cd406c174e1
Summary:
Rare TSAN and valgrind failures are caused by unnecessary
reading of a field on the TaskLimiterToken::limiter_ for an assertion
after the token has been released and the limiter destroyed. To simplify
we can simply destroy the token before triggering DB shutdown
(potentially destroying the limiter). This makes the ReleaseOnce logic
unnecessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8567
Test Plan: watch for more failures in CI
Reviewed By: ajkr
Differential Revision: D29811795
Pulled By: pdillinger
fbshipit-source-id: 135549ebb98fe4f176d1542ed85d5bd6350a40b3
Summary:
Try avoid expensive updating options operation if
`SetDBOptions()` does not change any option value.
Skip updating is not guaranteed, for example, changing `bytes_per_sync`
to `0` may still trigger updating, as the value could be sanitized.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8518
Test Plan: added unittest
Reviewed By: riversand963
Differential Revision: D29672639
Pulled By: jay-zhuang
fbshipit-source-id: b7931de62ceea6f1bdff0d1209adf1197d3ed1f4
Summary:
If we want to check whether a Status s is NoSpace() or not, we should check the subcode instread of using s==Status::NoSpace(). Fix some of the incorrect check in the ErrorHandler.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8504
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D29601764
Pulled By: zhichao-cao
fbshipit-source-id: cdab56a827891c23746bba9cbb53f169fe35f086
Summary:
Currently, the code shows that we delete memtables immedately after it is trimmed from history. Although it should never happen as the super version still holds the memtable, which is only switched after it, it feels a good practice not to do it, but use clean it up in the standard way: put it to WriteContext and clean it after DB mutex.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8530
Test Plan: Run all existing tests.
Reviewed By: ajkr
Differential Revision: D29703410
fbshipit-source-id: 21d8068ac6377de4b6fa7a89697195742659fde4
Summary:
I previously didn't notice the DB mutex was being held during
block cache entry stat scans, probably because I primarily checked for
read performance regressions, because they require the block cache and
are traditionally latency-sensitive.
This change does some refactoring to avoid holding DB mutex and to
avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
Some tests have to be updated because now the stats collector is
populated in the Cache aggressively on DB startup rather than lazily.
(I hope to clean up some of this added complexity in the future.)
This change also ensures proper treatment of need_out_of_mutex for
non-int DB properties.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538
Test Plan:
Added unit test logic that uses sync points to fail if the DB mutex
is held during a scan, covering the various ways that a scan might be
triggered.
Performance test - the known impact to holding the DB mutex is on
TransactionDB, and the easiest way to see the impact is to hack the
scan code to almost always miss and take an artificially long time
scanning. Here I've injected an unconditional 5s sleep at the call to
ApplyToAllEntries.
Before (hacked):
$ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0)
rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
$ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0)
rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323
Notice the 5s P100 write time.
After (hacked):
$ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0)
rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
$ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0)
rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918
P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:
$ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0)
rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
$ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0)
rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832
No substantial difference.
Reviewed By: siying
Differential Revision: D29738847
Pulled By: pdillinger
fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
Summary:
Made the EncryptionProvider and BlockCipher classes inherit from Customizable. Added/fixed the CreateFromString method to these classes to create instances from builtin or registered classes. Added tests to verify that instances can be registered and retrieved as appropriate.
Added the ability to configure the builtin (CTR, ROT13) classes from configurable properties. Added the appropriate tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8354
Reviewed By: zhichao-cao
Differential Revision: D29558949
Pulled By: mrambacher
fbshipit-source-id: c20286b32d179777e060f51a58943e9b0cf81d04
Summary:
In this PR, `mempurge` is made compatible with the Write Ahead Log: in case of recovery, the DB is now capable of recovering the data that was "mempurged" and kept in the `imm()` list of immutable memtables.
The twist was to add a uint64_t to the `memtable` struct to store the number of the earliest log file containing entries from the `memtable`. When a `Flush` operation is replaced with a `MemPurge`, the `VersionEdit` (which usually contains the new min log file number to pick up for recovery and the level 0 file path of the newly created SST file) is no longer appended to the manifest log, and every time the `deleteWal` method is called, a check is made on the list of immutable memtables.
This PR also includes a unit test that verifies that no data is lost upon Reopening of the database when the mempurge feature is activated. This extensive unit test includes two column families, with valid data contained in the imm() at time of "crash"/reopening (recovery).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8528
Reviewed By: pdillinger
Differential Revision: D29701097
Pulled By: bjlemaire
fbshipit-source-id: 072a900fb6ccc1edcf5eef6caf88f3060238edf9
Summary:
The removed function in this PR, just only have declared and dose not have any reference used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8508
Reviewed By: mrambacher
Differential Revision: D29649033
Pulled By: jay-zhuang
fbshipit-source-id: df98143b73d6c184a2a60c9f7ea2548a065ee35d
Summary:
This PR is for https://github.com/facebook/rocksdb/issues/8453
We need to update `s = biter.status();` when `biter.status().IsIncomplete()` is true. By doing this, can fix the problem in issue.
Besides, we still need to update `db_statistics` in `get_context.ReportCounters()` before return back.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8485
Reviewed By: jay-zhuang
Differential Revision: D29604835
Pulled By: ajkr
fbshipit-source-id: c7f2f1cd058223ce1b507ec05d57cf264b9c9710
Summary:
The MemPurge output status can either be an Abort if the mempurge is aborted due to the new_mem memtable reaching more than the target capacity (currently 60%), or for other reasons. As a result, in the log, we want to differentiate between an abort status, which in this PR only leads to a ROCKS_LOG_INFO, and any other status, which in this PR leads to a ROCKS_LOG_WARN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8514
Reviewed By: pdillinger
Differential Revision: D29662446
Pulled By: bjlemaire
fbshipit-source-id: c9bec8e238ebc7ecb14fbbddf580e6887e281c16
Summary:
When db is open as secondary, there are basically 2 step process:
1) Collect column families from wal log
2) Apply changes to Memtable
In case primary db is TransactionDB instance, wal log will contain some additional data, like noop, etc. ColumnFamilyCollector doesn't implement methods to handle these, so it fails to open a wal log written by TransactionDB. (Everything works fine with standard DB::Open).
Memtable recovery process knows how to handle such wal logs, so only missing piece seems to be ColumnFamilyCollector.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8456
Reviewed By: ajkr
Differential Revision: D29455945
Pulled By: mrambacher
fbshipit-source-id: 5b29560fcbc008e17e95d0dc4b07558f3d63e26f
Summary:
In ```DBImpl::WriteImpl()```, we call ```PreprocessWrite()``` which, among other things, checks the BG error and returns it set. This return status is later on passed to ```WriteStatusCheck()```, which calls ```SetBGError()```. This results in a spurious call, and info logs, on every user write request. We should avoid passing the ```PreprocessWrite()``` return status to ```WriteStatusCheck()```, as the former would have called ```SetBGError()``` already if it encountered any new errors, such as error when creating a new WAL file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8511
Test Plan: Run existing tests
Reviewed By: zhichao-cao
Differential Revision: D29639917
Pulled By: anand1976
fbshipit-source-id: 19234163969e1645dbeb273712aaf5cd9ea2b182
Summary:
In https://github.com/facebook/rocksdb/issues/8454, I introduced a new process baptized `MemPurge` (memtable garbage collection). This new PR is built upon this past mempurge prototype.
In this PR, I made the `mempurge` process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...).
Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the `imm()` memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge.
MemPurge is activated by making the `experimental_allow_mempurge` flag `true`. When activated, the `MemPurge` process will always happen when the flush reason is `kWriteBufferFull`.
The 3 unit tests confirm that this process supports `Put`, `Get`, `Delete`, `DeleteRange` operators and is compatible with `Iterators` and `CompactionFilters`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8505
Reviewed By: pdillinger
Differential Revision: D29619283
Pulled By: bjlemaire
fbshipit-source-id: 8a99bee76b63a8211bff1a00e0ae32360aaece95
Summary:
Various tests had disabled valgrind due to it slowing down and timing
out (as is the case right now) the CI runs. Where a test was disabled with no comment,
I assumed slowness was the cause. For these tests that were slow under
valgrind, as well as the ones identified in https://github.com/facebook/rocksdb/issues/8352, this PR moves them
behind the compiler flag `-DROCKSDB_FULL_VALGRIND_RUN`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8475
Test Plan: running `make full_valgrind_test`, `make valgrind_test`, `make check`; will verify they appear working correctly
Reviewed By: jay-zhuang
Differential Revision: D29504843
Pulled By: ajkr
fbshipit-source-id: 2aac90749cfbd30d5ce11cb29a07a1b9314eeea7
Summary:
Previously, the following command:
```USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j$(nproc) analyze```
was raising an error/warning the new_mem could potentially be a `nullptr`. This error appeared due to code changes from https://github.com/facebook/rocksdb/issues/8454, including an if-statement containing "`... && new_mem != nullptr && ...`", which made the analyzer believe that past this `if`-statement, a `new_mem==nullptr` was a possible scenario.
This code patch simply introduces `assert`s and removes this condition in the `if`-statement.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8492
Reviewed By: jay-zhuang
Differential Revision: D29571275
Pulled By: bjlemaire
fbshipit-source-id: 75d72246b70ebbbae7dea11ccb5778686d8bcbea
Summary:
We ended up using a different approach for tracking the amount of
garbage in blob files (see e.g. https://github.com/facebook/rocksdb/pull/8450),
so the ability to apply only a range of table property collectors is
now unnecessary. The patch reverts this part of
https://github.com/facebook/rocksdb/pull/8298 while keeping the cleanup done
in that PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8465
Test Plan: `make check`
Reviewed By: jay-zhuang
Differential Revision: D29399921
Pulled By: ltamasi
fbshipit-source-id: af64816c357d0829b9d7ba8ca1477038138f6f0a
Summary:
Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454
Reviewed By: anand1976
Differential Revision: D29433971
Pulled By: bjlemaire
fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
Summary:
Change the job_id for remote compaction interface, which will include
both internal compaction job_id, also a sub_compaction_job_id. It is not
a backward compatible change. The user needs to update interface during
upgrade. (We will avoid backward incompatible change after the feature is
not experimental.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8364
Reviewed By: ajkr
Differential Revision: D28917301
Pulled By: jay-zhuang
fbshipit-source-id: 6d72a21f652bb517ad6954d0387b496797fc4e11
Summary:
Added BlobMetaData to ColumnFamilyMetaData and LiveBlobMetaData and DB API GetLiveBlobMetaData to retrieve it.
First pass at struct. More tests and maybe fields to come...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8273
Reviewed By: ltamasi
Differential Revision: D29102400
Pulled By: mrambacher
fbshipit-source-id: 8a2383a4446328be6b91dced9841fdd3dfc80b73
Summary:
In PR https://github.com/facebook/rocksdb/issues/7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead.
In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine https://github.com/facebook/rocksdb/issues/8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8412
Test Plan: make check, add new testing cases.
Reviewed By: anand1976
Differential Revision: D29151545
Pulled By: zhichao-cao
fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772
Summary:
Provide support for Merge operation with base values during
Compaction in IntegratedBlobDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8445
Test Plan: Add new unit test
Reviewed By: ltamasi
Differential Revision: D29343949
Pulled By: akankshamahajan15
fbshipit-source-id: 844f6f02f93388a11e6e08bda7bb3a2a28e47c70
Summary:
The patch builds on `BlobGarbageMeter` and `BlobCountingIterator`
(introduced in https://github.com/facebook/rocksdb/issues/8426 and
https://github.com/facebook/rocksdb/issues/8443 respectively)
and ties it all together. It measures the amount of garbage
generated by a compaction and logs the corresponding `BlobFileGarbage`
records as part of the compaction job's `VersionEdit`. Note: in order
to have accurate results, `kRemoveAndSkipUntil` for compaction filters
is implemented using iteration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8450
Test Plan: Ran `make check` and the crash test script.
Reviewed By: jay-zhuang
Differential Revision: D29338207
Pulled By: ltamasi
fbshipit-source-id: 4381c432ac215139439f6d6fb801a6c0e4d8c128
Summary:
`VersionSet::VerifyCompactionFileConsistency` was superseded by the LSM tree
consistency checks introduced in https://github.com/facebook/rocksdb/pull/6901,
which are more comprehensive, more efficient, and are performed unconditionally
even in release builds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8449
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D29337441
Pulled By: ltamasi
fbshipit-source-id: a05324f88e3400e27e6a00406c878a6276e0c9cc
Summary:
Follow-up to https://github.com/facebook/rocksdb/issues/8426 .
The patch adds a new kind of `InternalIterator` that wraps another one and
passes each key-value encountered to `BlobGarbageMeter` as inflow.
This iterator will be used as an input iterator for compactions when the input
SSTs reference blob files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8443
Test Plan: `make check`
Reviewed By: jay-zhuang
Differential Revision: D29311987
Pulled By: ltamasi
fbshipit-source-id: b4493b4c0c0c2e3c2ecc33c8969a5ef02de5d9d8
Summary:
This reverts commit 25be1ed66a.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8438
Test Plan: Run the impacted mysql test 40 times
Reviewed By: ajkr
Differential Revision: D29286247
Pulled By: jay-zhuang
fbshipit-source-id: d3bd056971a19a8b012d5d0295fa045c012b3c04
Summary:
Currently, blob file checksums are incorrectly dumped as raw bytes
in the `ldb manifest_dump` output (i.e. they are not printed as hex).
The patch fixes this and also updates some test cases to reflect that
the checksum value field in `BlobFileAddition` and `SharedBlobFileMetaData`
contains the raw checksum and not a hex string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8437
Test Plan:
`make check`
Tested using `ldb manifest_dump`
Reviewed By: akankshamahajan15
Differential Revision: D29284170
Pulled By: ltamasi
fbshipit-source-id: d11cfb3435b14cd73c8a3d3eb14fa0f9fa1d2228
Summary:
`DeleteFilesInRange()` marks deleting files to `being_compacted`
before deleting, which may cause ongoing compactions report corruption
exception or ASSERT for debug build.
Adding the missing `ComputeCompactionScore()` when `being_compacted` is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8434
Test Plan: Unittest
Reviewed By: ajkr
Differential Revision: D29276127
Pulled By: jay-zhuang
fbshipit-source-id: f5b223e3c1fc6d821e100e3f3442bc70c1d50cf7
Summary:
This is part of an alternative approach to https://github.com/facebook/rocksdb/issues/8316.
Unlike that approach, this one relies on key-values getting processed one by one
during compaction, and does not involve persistence.
Specifically, the patch adds a class `BlobGarbageMeter` that can track the number
and total size of blobs in a (sub)compaction's input and output on a per-blob file
basis. This information can then be used to compute the amount of additional
garbage generated by the compaction for any given blob file by subtracting the
"outflow" from the "inflow."
Note: this patch only adds `BlobGarbageMeter` and associated unit tests. I plan to
hook up this class to the input and output of `CompactionIterator` in a subsequent PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8426
Test Plan: `make check`
Reviewed By: jay-zhuang
Differential Revision: D29242250
Pulled By: ltamasi
fbshipit-source-id: 597e50ad556540e413a50e804ba15bc044d809bb
Summary:
Tracing the MultiGet information including timestamp, keys, and CF_IDs to the trace file for analyzing and replay.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8421
Test Plan: make check, add test to trace_analyzer_test
Reviewed By: anand1976
Differential Revision: D29221195
Pulled By: zhichao-cao
fbshipit-source-id: 30c677d6c39ab31ef4bbdf7e0d1fa1fd79f295ff
Summary:
**Summary**:
2 new statistics counters are added to RocksDB: `MEMTABLE_PAYLOAD_BYTES_AT_FLUSH` and `MEMTABLE_GARBAGE_BYTES_AT_FLUSH`. The former tracks how many raw bytes of useful data are present on the memtable at flush time, whereas the latter is tracks how many of these raw bytes are considered garbage, meaning that they ended up not being imported on the SSTables resulting from the flush operations.
**Unit test**: run `make db_flush_test -j$(nproc); ./db_flush_test` to run the unit test.
This executable includes 3 tests, that test support and correct stat calculations for workloads with inserts, deletes, and DeleteRanges. The parameters are set such that the workloads are performed on a single memtable, and a single SSTable is created as a result of the flush operation. The flush operation is manually called in the test file. The tests verify that the values of these 2 statistics counters introduced in this PR can be exactly predicted, showing that we have a full understanding of the underlying operations.
**Performance testing**:
`./db_bench -statistics -benchmarks=fillrandom -num=10000000` repeated 10 times.
Timing done using "date" function in a bash script.
_Results_:
Original Rocksdb fork: mean 66.6 sec, std 1.18 sec.
This feature branch: mean 67.4 sec, std 1.35 sec.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8411
Reviewed By: akankshamahajan15
Differential Revision: D29150629
Pulled By: bjlemaire
fbshipit-source-id: 7b3c2e86d50c6aa34fa50fd134282eacb543a5b1
Summary:
This PR prepopulates warm/hot data blocks which are already in memory
into block cache at the time of flush. On a flush, the data block that is
in memory (in memtables) get flushed to the device. If using Direct IO,
additional IO is incurred to read this data back into memory again, which
is avoided by enabling newly added option.
Right now, this is enabled only for flush for data blocks. We plan to
expand this option to cover compactions in the future and for other types
of blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242
Test Plan: Add new unit test
Reviewed By: anand1976
Differential Revision: D28521703
Pulled By: akankshamahajan15
fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05
Summary:
RocksDB logs a warning if WAL truncation on DB open fails. Its possible that on some file systems, truncation is not required and they would return ```Status::NotSupported()``` for ```ReopenWritableFile```. Don't log a warning in such cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8414
Reviewed By: akankshamahajan15
Differential Revision: D29181738
Pulled By: anand1976
fbshipit-source-id: 6e01e9117e1e4c1d67daa4dcee7fa59d06e057a7
Summary:
This is the next part of the ImmutableOptions cleanup. After changing the use of ImmutableCFOptions to ImmutableOptions, there were places in the code that had did something like "ImmutableOptions* immutable_cf_options", where "cf" referred to the "old" type.
This change simply renames the variables to match the current type. No new functionality is introduced.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8409
Reviewed By: pdillinger
Differential Revision: D29166248
Pulled By: mrambacher
fbshipit-source-id: 96de97f8e743f5c5160f02246e3ed8269556dc6f
Summary:
This reverts commit 9167ece586.
It was found to reliably trip a compaction picking conflict assertion in a MyRocks unit test. We don't understand why yet so reverting in the meantime.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8410
Test Plan: `make check -j48`
Reviewed By: jay-zhuang
Differential Revision: D29150300
Pulled By: ajkr
fbshipit-source-id: 2de8664f355d6da015e84e5fec2e3f90f49741c8
Summary:
- Added CreateFromString method to Env and FilesSystem to replace LoadEnv/Load. This method/signature is a precursor to making these classes extend Customizable.
- Added CreateFromSystem to Env. This method standardizes creating an Env from the environment variables. Previously, some places would check TEST_ENV_URI and others would also check TEST_FS_URI. Now the code is more command/standardized.
- Added CreateFromFlags to Env. These method allows Env to be create from string options (such as GFLAGS options) in a more standard way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8174
Reviewed By: zhichao-cao
Differential Revision: D28999603
Pulled By: mrambacher
fbshipit-source-id: 88e6911e7e91f908458a7fe10a20e93ecbc275fb
Summary:
(1)Make CompactionService derived from Customizable by defining two extra functions that are needed, as described in customizable.h comment section
(2)Revise the MyTestCompactionService class in compaction_service_test.cc to satisfy the class inheritance requirement
(3)Specify namespace of ToString() in compaction_service_test.cc to avoid function collision with CompactionService's ancestor classes
Test did:
make -j24 compaction_service_test
./compaction_service_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8395
Reviewed By: jay-zhuang
Differential Revision: D29076068
Pulled By: hx235
fbshipit-source-id: c130100fa466939b3137e917f5fdc4b2ae8e37d4
Summary:
If the block Cache is full with strict_capacity_limit=false,
then our CacheEntryStatsCollector could be immediately evicted on
release, so iterating through column families with shared block cache
could trigger re-scan for each CF. This change fixes that problem by
pinning the CacheEntryStatsCollector from InternalStats so that it's not
evicted.
I had originally thought that this object could participate in LRU like
everything else, but even though a re-load+re-scan only touches memory,
it can be orders of magnitude more expensive than other cache misses.
One service in Facebook has scans that take ~20s over 100GB block cache
that is mostly 4KB entries. (The up-side of this bug and https://github.com/facebook/rocksdb/issues/8369 is that
we had a natural experiment on the effect on some service metrics even
with block cache scans running continuously in the background--a kind
of worst case scenario. Metrics like latency were not affected enough
to trigger warnings.)
Other smaller fixes:
20s is already a sizable portion of 600s stats dump period, or 180s
default max age to force re-scan, so added logic to ensure that (for
each block cache) we don't spend more than 0.2% of our background thread
time scanning it. Nevertheless, "foreground" requests for cache entry
stats (calls to `db->GetMapProperty(DB::Properties::kBlockCacheEntryStats)`)
are permitted to consume more CPU.
Renamed field to cache_entry_stats_ to match code style.
This change is intended for patching in 6.21 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8385
Test Plan:
unit test expanded to cover new logic (detect regression),
some manual testing with db_bench
Reviewed By: ajkr
Differential Revision: D29042759
Pulled By: pdillinger
fbshipit-source-id: 236faa902397f50038c618f50fbc8cf3f277308c
Summary:
Recalculate the total size after generate new sst files.
New generated files might have different size as the previous time which
could cause the test failed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8396
Test Plan:
```
gtest-parallel ./db_compaction_test
--gtest_filter=DBCompactionTest.ManualCompactionMax -r 1000 -w 100
```
Reviewed By: akankshamahajan15
Differential Revision: D29083299
Pulled By: jay-zhuang
fbshipit-source-id: 49d4bd619cefc0f9a1f452f8759ff4c2ba1b6fdb
Summary:
The subcompaction boundary picking logic does not currently guarantee
that all user keys that differ only by timestamp get processed by the same
subcompaction. This can cause issues with the `CompactionIterator` state
machine: for instance, one subcompaction that processes a subset of such KVs
might drop a tombstone based on the KVs it sees, while in reality the
tombstone might not have been eligible to be optimized out.
(See also https://github.com/facebook/rocksdb/issues/6645, which adjusted the way compaction inputs are picked for the
same reason.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8393
Test Plan: Ran `make check` and the crash test script with timestamps enabled.
Reviewed By: jay-zhuang
Differential Revision: D29071635
Pulled By: ltamasi
fbshipit-source-id: f6c72442122b4e581871e096fabe3876a9e8a5a6
Summary:
DBImpl::DumpStats is supposed to do this:
Dump DB stats to LOG
For each CF, dump CFStatsNoFileHistogram to LOG
For each CF, dump CFFileHistogram to LOG
Instead, due to a longstanding bug from 2017 (https://github.com/facebook/rocksdb/issues/2126), it would dump
CFStats, which includes both CFStatsNoFileHistogram and CFFileHistogram,
in both loops, resulting in near-duplicate output.
This fixes the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8380
Test Plan: Manual inspection of LOG after db_bench
Reviewed By: jay-zhuang
Differential Revision: D29017535
Pulled By: pdillinger
fbshipit-source-id: 3010604c4a629a80347f129cd746ce9b0d0cbda6
Summary:
In the current logic, any IO Error with retryable flag == true will be handled by the special logic and in most cases, StartRecoverFromRetryableBGIOError will be called to do the auto resume. If the NoSpace error with retryable flag is set during WAL write, it is mapped as a hard error, which will trigger the auto recovery. During the recover process, if write continues and append to the WAL, the write process sees that bg_error is set to HardError and it calls WriteStatusCheck(), which calls SetBGError() with Status (not IOStatus). This will redirect to the regular SetBGError interface, in which recovery_error_ will be set to the corresponding error. With the recovery_error_ set, the auto resume thread created in StartRecoverFromRetryableBGIOError will keep failing as long as user keeps trying to write.
To fix this issue. All the NoSpace error (no matter retryable flag is set or not) will be redirect to the regular SetBGError, and RecoverFromNoSpace() will do the recovery job which calls SstFileManager::StartErrorRecovery().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8376
Test Plan: make check and added the new testing case
Reviewed By: anand1976
Differential Revision: D29071828
Pulled By: zhichao-cao
fbshipit-source-id: 7171d7e14cc4620fdab49b7eff7a2fe9a89942c2
Summary:
This PR add support for Merge operation in Integrated BlobDB with base values(i.e DB::Put). Merged values can be retrieved through DB::Get, DB::MultiGet, DB::GetMergeOperands and Iterator operation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8292
Test Plan: Add new unit tests
Reviewed By: ltamasi
Differential Revision: D28415896
Pulled By: akankshamahajan15
fbshipit-source-id: e9b3478bef51d2f214fb88c31ed3c8d2f4a531ff
Summary:
Changed fprintf function to fputc in ApplyVersionEdit, and replaced null characters with whitespaces.
Added unit test in ldb_test.py - verifies that manifest_dump --verbose output is correct when keys and values containing null characters are inserted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8378
Reviewed By: pdillinger
Differential Revision: D29034584
Pulled By: bjlemaire
fbshipit-source-id: 50833687a8a5f726e247c38457eadc3e6dbab862
Summary:
Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360
Test Plan: add the test to lru_cache_test
Reviewed By: pdillinger
Differential Revision: D29006215
Pulled By: zhichao-cao
fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814
Summary:
Logically, subcompactions process a key range [start, end); however, the way
this is currently implemented is that the `CompactionIterator` for any given
subcompaction keeps processing key-values until it actually outputs a key that
is out of range, which is then discarded. Instead of doing this, the patch
introduces a new type of internal iterator called `ClippingIterator` which wraps
another internal iterator and "clips" its range of key-values so that any KVs
returned are strictly in the [start, end) interval. This does eliminate a (minor)
inefficiency by stopping processing in subcompactions exactly at the limit;
however, the main motivation is related to BlobDB: namely, we need this to be
able to measure the amount of garbage generated by a subcompaction
precisely and prevent off-by-one errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8327
Test Plan: `make check`
Reviewed By: siying
Differential Revision: D28761541
Pulled By: ltamasi
fbshipit-source-id: ee0e7229f04edabbc7bed5adb51771fbdc287f69
Summary:
In final polishing of https://github.com/facebook/rocksdb/issues/8297 (after most manual testing), I
broke my own caching layer by sanitizing an input parameter with
std::min(0, x) instead of std::max(0, x). I resisted unit testing the
timing part of the result caching because historically, these test
are either flaky or difficult to write, and this was not a correctness
issue. This bug is essentially unnoticeable with a small number
of column families but can explode background work with a
large number of column families.
This change fixes the logical error, removes some unnecessary related
optimization, and adds mock time/sleeps to the unit test to ensure we
can cache hit within the age limit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8369
Test Plan: added time testing logic to existing unit test
Reviewed By: ajkr
Differential Revision: D28950892
Pulled By: pdillinger
fbshipit-source-id: e79cd4ff3eec68fd0119d994f1ed468c38026c3b
Summary:
Added the ability to cancel an in-progress range compaction by storing to an atomic "canceled" variable pointed to within the CompactRangeOptions structure.
Tested via two tests added to db_tests2.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8351
Reviewed By: ajkr
Differential Revision: D28808894
Pulled By: ddevec
fbshipit-source-id: cb321361c9e23b084b188bb203f11c375a22c2dd
Summary:
This is a duplicate of https://github.com/facebook/rocksdb/issues/4948 by mzhaom to fix tests after rebase.
This change is a follow-up to https://github.com/facebook/rocksdb/issues/4927, which made this possible by allowing tombstone dropping/seqnum zeroing optimizations on the last key in the compaction. Now the `largest_seqno != 0` condition suffices to prevent snapshot release triggered compaction from entering an infinite loop.
The issues caused by the extraneous condition `level_and_file.second->num_deletions > 1` are:
- files could have `largest_seqno > 0` forever making it impossible to tell they cannot contain any covering keys
- it doesn't trigger compaction when there are many overwritten keys. Some MyRocks use case actually doesn't use Delete but instead calls Put with empty value to "delete" keys, so we'd like to be able to trigger compaction in this case too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8357
Test Plan: - make check
Reviewed By: jay-zhuang
Differential Revision: D28855340
Pulled By: ajkr
fbshipit-source-id: a261b51eecafec492499e6d01e8e43112f801798
Summary:
I noticed ```openat``` system call with ```O_WRONLY``` flag and ```sync_file_range``` and ```truncate``` on WAL file when using ```rocksdb::DB::OpenForReadOnly``` by way of ```db_bench --readonly=true --benchmarks=readseq --use_existing_db=1 --num=1 ...```
Noticed in ```strace``` after seeing the last modification time of the WAL file change after each run (with ```--readonly=true```).
I think introduced by 7d7f14480e from https://github.com/facebook/rocksdb/pull/8122
I added a test to catch the WAL file being truncated and the modification time on it changing.
I am not sure if a mock filesystem with mock clock could be used to avoid having to sleep 1.1s.
The test could also check the set of files is the same and that the sizes are also unchanged.
Before:
```
[ RUN ] DBBasicTest.ReadOnlyReopenMtimeUnchanged
db/db_basic_test.cc:182: Failure
Expected equality of these values:
file_mtime_after_readonly_reopen
Which is: 1621611136
file_mtime_before_readonly_reopen
Which is: 1621611135
file is: 000010.log
[ FAILED ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms)
```
After:
```
[ RUN ] DBBasicTest.ReadOnlyReopenMtimeUnchanged
[ OK ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8313
Reviewed By: pdillinger
Differential Revision: D28656925
Pulled By: jay-zhuang
fbshipit-source-id: ea9e215cb53e7c830e76bc5fc75c45e21f12a1d6
Summary:
https://github.com/facebook/rocksdb/pull/8288 introduces a bug: SequenceIterWrapper should do next for seek key using internal key comparator rather than user comparator. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8328
Test Plan: Pass all existing tests
Reviewed By: ltamasi
Differential Revision: D28647263
fbshipit-source-id: 4081d684fd8a86d248c485ef8a1563c7af136447
Summary:
Error:
```
db/db_compaction_test.cc:5211:47: warning: The left operand of '*' is a garbage value
uint64_t total = (l1_avg_size + l2_avg_size * 10) * 10;
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8325
Test Plan: `$ make analyze`
Reviewed By: pdillinger
Differential Revision: D28620916
Pulled By: jay-zhuang
fbshipit-source-id: f6d58ab84eefbcc905cda45afb9522b0c6d230f8
Summary:
With Ribbon filter work and possible variance in actual bits
per key (or prefix; general term "entry") to achieve certain FP rates,
I've received a request to be able to track actual bits per key in
generated filters. This change adds a num_filter_entries table
property, which can be combined with filter_size to get bits per key
(entry).
This can vary from num_entries in at least these ways:
* Different versions of same key are only counted once in filters.
* With prefix filters, several user keys map to the same filter entry.
* A single filter can include both prefixes and user keys.
Note that FilterBlockBuilder::NumAdded() didn't do anything useful
except distinguish empty from non-empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8323
Test Plan: basic unit test included, others updated
Reviewed By: jay-zhuang
Differential Revision: D28596210
Pulled By: pdillinger
fbshipit-source-id: 529a111f3c84501e5a470bc84705e436ee68c376
Summary:
Fix a bug that for manual compaction, `max_compaction_bytes` is only
limit the SST files from input level, but not overlapped files on output
level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8269
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D28231044
Pulled By: jay-zhuang
fbshipit-source-id: 9d7d03004f30cc4b1b9819830141436907554b7c
Summary:
When a memtable is flushed, it will validate number of entries it reads, and compare the number with how many entries inserted into memtable. This serves as one sanity c\
heck against memory corruption. This change will also allow more counters to be added in the future for better validation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8288
Test Plan: Pass all existing tests
Reviewed By: ajkr
Differential Revision: D28369194
fbshipit-source-id: 7ff870380c41eab7f99eee508550dcdce32838ad
Summary:
The test want to make sure these's no compaction during `AddFile`
(between `DBImpl::AddFile:MutexLock` and `DBImpl::AddFile:MutexUnlock`)
but the mutex could be unlocked by `EnterUnbatched()`.
Move the lock start point after bumping the ingest file number.
Also fix the dead lock when ASSERT fails.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8307
Reviewed By: ajkr
Differential Revision: D28479849
Pulled By: jay-zhuang
fbshipit-source-id: b3c50f66aa5d5f59c5c27f815bfea189c4cd06cb