Summary:
When table_options.prepopulate_block_cache is set to
BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly and
table_options.partition_filters is also set true, then there is
segmentation failure when top level filter is fetched because its
entered with wrong type in cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9263
Test Plan:
Updated unit tests;
Ran db_stress: make crash_test -j32
Reviewed By: pdillinger
Differential Revision: D32936566
Pulled By: akankshamahajan15
fbshipit-source-id: 8bd79e53830d3e3c1bb79787e1ffbc3cb46d4426
Summary:
If ignore_unsupported_options=true, then it is possible for MemTableRepFactory::CreateFromString to succeed without setting a result (result=nullptr). This would cause the original value to be overwritten with null and an error would be raised later when PrepareOptions is invoked.
Added unit test for this condition. Will add (in another PR unless required by reviewers) comparable tests for all of the other Customizable classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9273
Reviewed By: ltamasi
Differential Revision: D32990365
Pulled By: mrambacher
fbshipit-source-id: b150724c3f5ae7346357b3866244fd93466875c7
Summary:
1. Fix GetOptionsPtr for Wrapped (Inner() != nullptr) Customizable objects. This allows the inner options to be returned via this method.
2. Allow the option type map to be nullptr. This allows objects to be registered as options (for GetOptionsPtr) but not be used by the configuration methods.
Added tests as appropriate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9213
Reviewed By: zhichao-cao
Differential Revision: D32718882
Pulled By: mrambacher
fbshipit-source-id: 563203d1f006a2629060feb31c5dff9a233e1e83
Summary:
Saw error like this:
`Backup failed -- IO error: No such file or directory: While opening a
file for sequentially reading:
/dev/shm/rocksdb/rocksdb_crashtest_blackbox/004426.log: No such file or
directory`
Unfortunately, GetSortedWalFiles (used by Backups, Checkpoint, etc.)
relies on no file deletions happening while its operating, which
means not only disabling (more) deletions, but ensuring any pending
deletions are completed. Two fixes related to this:
* There was a gap in several places between decrementing
pending_purge_obsolete_files_ and incrementing bg_purge_scheduled_ where
the db mutex would be released and GetSortedWalFiles (and others) could
get false information that no deletions are pending.
* The fix to https://github.com/facebook/rocksdb/issues/8591 (disabling deletions in GetSortedWalFiles) seems
incomplete because it doesn't prevent pending deletions from occuring
during the operation (if deletions not already disabled, the case that
was to be fixed by the change).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9208
Test Plan:
existing tests (it's hard to write a test for interleavings
that are now excluded - this is what stress test is for)
Reviewed By: ajkr
Differential Revision: D32630675
Pulled By: pdillinger
fbshipit-source-id: a121e3da648de130cd24d44c524232f4eb22f178
Summary:
More follow-up to https://github.com/facebook/rocksdb/issues/9193 + https://github.com/facebook/rocksdb/issues/9188
* Even though we need to print ETA updates to avoid hitting the 10min
timeout, we need to avoid printing an update if there's no actual
progress, so that hung tests will timeout after 10 min rather than 5
hours.
* When there is a hung test, it's really annoying to track down which
test is hung, so if no progress is observed for 1 minute, we run ps once
to show what is running.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9201
Test Plan: manual and CircleCI
Reviewed By: jay-zhuang
Differential Revision: D32612028
Pulled By: pdillinger
fbshipit-source-id: 00f8ea70fc5fec9ede28ff74287d90fc73854aad
Summary:
Address some issues with https://github.com/facebook/rocksdb/issues/9188
* Internal CI doesn't render \r as anything, so use \n for "not
connected to terminal" case
* CircleCI apparently uses a pseudo-tty for output and although
rerdirect stdout (because of EAGAIN bug) we don't redirect stderr, so it
is detected as a terminal-connected case. Fix by redirecting stderr
also.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9193
Test Plan: manual, CI
Reviewed By: akankshamahajan15
Differential Revision: D32581128
Pulled By: pdillinger
fbshipit-source-id: 5ae7c3209128d8dbd4153c5b9fdb2b810e6deb2e
Summary:
Disable the QPS verification in test temporally, which causes the test failure due to different system delays.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9190
Test Plan: make check
Reviewed By: siying
Differential Revision: D32576289
Pulled By: zhichao-cao
fbshipit-source-id: 1df972e77dd82eed5af3462e5db5e141aadf8fae
Summary:
Original unit test fail to test the case of multi-cf mode switching to new manifest. The assertion
failure will trigger when the primary instance reopens and secondary continues to tail the
newly-created MANIFEST. Fix the assertion failure and update existing unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9143
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D32574233
Pulled By: riversand963
fbshipit-source-id: 857ddbe994019091276458abebcf8e2b65340468
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187
Test Plan: Ran `make check` and the stress/crash test.
Reviewed By: riversand963
Differential Revision: D32565512
Pulled By: ltamasi
fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
Summary:
A bug in https://github.com/facebook/rocksdb/issues/9163 can cause checksum verification to fail if
parsing a properties block fails.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9189
Test Plan:
check_format_compatible.sh (never quite works locally but
this particular case seems fixed using variants of SHORT_TEST=1).
And added new unit test case.
Reviewed By: ajkr
Differential Revision: D32574626
Pulled By: pdillinger
fbshipit-source-id: 6fa5c8595737b71a3c3d011a52daf6d6c08715d7
Summary:
`ReadOptions::iter_start_seqnum` and `DBOptions::preserve_deletes` are
deprecated, please try using user defined timestamp feature instead.
The feature is used to support differential snapshots, but not well
maintained (https://github.com/facebook/rocksdb/issues/6837, https://github.com/facebook/rocksdb/issues/8472) and the interface is not user friendly which
returns an internal key from the iterator. The user defined timestamp
feature is a more flexible feature to support similar usecase, please
switch to that if you have such usecase.
The deprecated feature will be removed in a future release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9091
Test Plan:
check LOG
Fix https://github.com/facebook/rocksdb/issues/9090
Reviewed By: ajkr
Differential Revision: D32071750
Pulled By: jay-zhuang
fbshipit-source-id: b882c4668dd1bf26ce03c4c192f1bba584bf6104
Summary:
Generating megabytes of successful test output has caused
issues / inconveniences for CI and internal sandcastle runs. This
changes their configuration to only print output from failed tests.
(Successful test output is still available in files under t/.)
This likewise changes default behavior of parallel `make check` as
a quick team poll showed interest in that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9188
Test Plan:
Seed some test failures and observe
* `make -j24 check` (new behavior)
* `PRINT_PARALLEL_OUTPUTS=1 make -j24 check` (old CI behavior)
* `QUIET_PARALLEL_TESTS=1 make -j24 check` (old manual run behavior)
Reviewed By: siying
Differential Revision: D32567392
Pulled By: pdillinger
fbshipit-source-id: 8d8fb64aebd16bca103b11e3bd1f13c488a69611
Summary:
1. Added a target for building a bundle jar for Sonatype Nexus - sometimes if the OSS Maven Central is misbehaving, it is quicker to upload a bundle to be processed for release.
2. Simplify the publish code by using a for-loop.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9186
Reviewed By: jay-zhuang
Differential Revision: D32564469
Pulled By: ajkr
fbshipit-source-id: aaceac27e9143fb65b61dad2a46df346586672cd
Summary:
Track each SST's timestamp information as user properties https://github.com/facebook/rocksdb/issues/8959
Rockdb has supported user-defined timestamp feature. Application can specify a timestamp
when writing each k-v pair. When data flush from memory to disk file called SST files.
Each SST files consist of multiple data blocks and several metadata blocks. Among the metadata
blocks, there is one called Properties block that tracks some pre-defined properties of this SST file.
This PR is for collecting the properties of min and max timestamps of all keys in the file. With those
properties the SST file is more convenient to tell whether the keys in the SST have timestamps or not.
The changes involved are as follows:
1) Add a class TimestampTablePropertiesCollector to collect min/max timestamp when add keys to table,
The way TimestampTablePropertiesCollector use to compare timestamp of key should defined by
user by implementing the Comparator::CompareTimestamp function in the user defined comparator.
2) Add corresponding unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9093
Reviewed By: ltamasi
Differential Revision: D32406927
Pulled By: riversand963
fbshipit-source-id: 25922971b7e67bacf4d53a1fb67c4c5ddaa61573
Summary:
DBTest2.RateLimitedCompactionReads sometime shows following failure:
what(): db/db_test2.cc:3976: Failure
Expected equality of these values:
i + 1
Which is: 4
NumTableFilesAtLevel(0)
Which is: 0
The assertion itself doesn't appear to be correct. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9185
Test Plan: Removing an assertion shouldn't break anything.
Reviewed By: ajkr
Differential Revision: D32549530
fbshipit-source-id: 9993372d8af89161f903337a13f3e316e690a6b8
Summary:
After RocksDB 6.19 and before this PR, RocksDB FlushJob may pick more memtables to flush beyond synced WALs.
This can be problematic if there are multiple column families, since it can prematurely advance the flushed column
family's log_number. Should subsequent attempts fail to sync the latest WALs and the database goes
through a recovery, it may detect corrupted WAL number below the flushed column family's log number
and complain about column family inconsistency.
To fix, we record the maximum memtable ID of the column family being flushed. Then we call SyncClosedLogs()
so that all closed WALs at the time when memtable ID is recorded will be synced.
I also disabled a unit test temporarily due to reasons described in https://github.com/facebook/rocksdb/issues/9151
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9142
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D32299956
Pulled By: riversand963
fbshipit-source-id: 0da75888177d91905cf8c9d00605b73afb5970a7
Summary:
- Fixed bug where bottom-pri manual compactions were counting towards `bg_compaction_scheduled_` instead of `bg_bottom_compaction_scheduled_`. It seems to have no negative effect.
- Fixed bug where automatic compaction scheduling did not consider `bg_bottom_compaction_scheduled_`. Now automatic compactions cannot be scheduled that exceed the per-DB compaction concurrency limit (`max_compactions`) when some existing compactions are bottommost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9179
Test Plan: new unit test for manual/automatic. Also verified the existing automatic/automatic test ("ConcurrentBottomPriLowPriCompactions") hanged until changing it to explicitly enable concurrency.
Reviewed By: riversand963
Differential Revision: D32488048
Pulled By: ajkr
fbshipit-source-id: 20c4c0693678e81e43f85ed3cc3402fcf26e3310
Summary:
Add a new API in listener.h that notifies about IOErrors on
Read/Write/Append/Flush etc. The API reports about IOStatus, filename, Operation
name, offset and length.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9177
Test Plan: Added new unit tests
Reviewed By: anand1976
Differential Revision: D32470627
Pulled By: akankshamahajan15
fbshipit-source-id: 189a717033590ae227b3beae8b1e7e185e4cdc12
Summary:
* Checksums are now checked on meta blocks unless specifically
suppressed or not applicable (e.g. plain table). (Was other way around.)
This means a number of cases that were not checking checksums now are,
including direct read TableProperties in Version::GetTableProperties
(fixed in meta_blocks ReadTableProperties), reading any block from
PersistentCache (fixed in BlockFetcher), read TableProperties in
SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open,
maybe more.
* For that to work, I moved the global_seqno+TableProperties checksum
logic to the shared table/ code, because that is used by many utilies
such as SstFileDumper.
* Also for that to work, we have to know when we're dealing with a block
that has a checksum (trailer), so added that capability to Footer based
on magic number, and from there BlockFetcher.
* Knowledge of trailer presence has also fixed a problem where other
table formats were reading blocks including bytes for a non-existant
trailer--and awkwardly kind-of not using them, e.g. no shared code
checking checksums. (BlockFetcher compression type was populated
incorrectly.) Now we only read what is needed.
* Minimized code duplication and differing/incompatible/awkward
abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block
without parsing block handle)
* Moved some meta block handling code from table_properties*.*
* Moved some code specific to block-based table from shared table/ code
to BlockBasedTable class. The checksum stuff means we can't completely
separate it, but things that don't need to be in shared table/ code
should not be.
* Use unique_ptr rather than raw ptr in more places. (Note: you can
std::move from unique_ptr to shared_ptr.)
Without enhancements to GetPropertiesOfAllTablesTest (see below),
net reduction of roughly 100 lines of code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163
Test Plan:
existing tests and
* Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that
checksums are now checked on direct read of table properties by TableCache
(new test would fail before this change)
* Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test
putting table properties under old meta name
* Also generally enhanced that same test to actually test what it was
supposed to be testing already, by kicking things out of table cache when
we don't want them there.
Reviewed By: ajkr, mrambacher
Differential Revision: D32514757
Pulled By: pdillinger
fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9
Summary:
Fix the analyzer test failure caused by inaccurate timing wait. The wait time at different system might be different or cause the delay, now we do not accurately count the lines. Only in a very rare extreme case, test will ignore the part exceed the timing of 1 second.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9181
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D32511319
Pulled By: zhichao-cao
fbshipit-source-id: e694c8cb465c750cfa5a43dab3eff6707b9a11c8
Summary:
Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge.
**Context:**
Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction, moving toward the goal of [single global memory limit using block cache capacity](https://github.com/facebook/rocksdb/wiki/Projects-Being-Developed#improving-memory-efficiency). It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under `LRUCacheOptions::strict_capacity_limit=true`.
The option to turn on this feature is `BlockBasedTableOptions::reserve_table_builder_memory = true` which by default is set to `false`. We [decided](https://github.com/facebook/rocksdb/pull/9073#discussion_r741548409) not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option.
**Summary:**
- Reserved/released cache for creation/destruction of three main memory users with the passed-in `FilterBuildingContext::cache_res_mgr` during filter construction:
- hash entries (i.e`hash_entries`.size(), we bucket-charge hash entries during insertion for performance),
- banding (Ribbon Filter only, `bytes_coeff_rows` +`bytes_result_rows` + `bytes_backtrack`),
- final filter (i.e, `mutable_buf`'s size).
- Implementation details: in order to use `CacheReservationManager::CacheReservationHandle` to account final filter's memory, we have to store the `CacheReservationManager` object and `CacheReservationHandle` for final filter in `XXPH3BitsFilterBuilder` as well as explicitly delete the filter bits builder when done with the final filter in block based table.
- Added option fo run `filter_bench` with this memory reservation feature
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9073
Test Plan:
- Added new tests in `db_bloom_filter_test` to verify filter construction peak cache reservation under combination of `BlockBasedTable::Rep::FilterType` (e.g, `kFullFilter`, `kPartitionedFilter`), `BloomFilterPolicy::Mode`(e.g, `kFastLocalBloom`, `kStandard128Ribbon`, `kDeprecatedBlock`) and `BlockBasedTableOptions::reserve_table_builder_memory`
- To address the concern for slow test: tests with memory reservation under `kFullFilter` + `kStandard128Ribbon` and `kPartitionedFilter` take around **3000 - 6000 ms** and others take around **1500 - 2000 ms**, in total adding **20000 - 25000 ms** to the test suit running locally
- Added new test in `bloom_test` to verify Ribbon Filter fallback on large banding in FullFilter
- Added test in `filter_bench` to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over **20** run as below:
- FastLocalBloom
- baseline `./filter_bench -impl=2 -quick -runs 20 | grep 'Build avg'`:
- **Build avg ns/key: 29.56295** (DEBUG_LEVEL=1), **29.98153** (DEBUG_LEVEL=0)
- new feature (expected to be similar as above)`./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg'`:
- **Build avg ns/key: 30.99046** (DEBUG_LEVEL=1), **30.48867** (DEBUG_LEVEL=0)
- new feature of RibbonFilter with fallback (expected to be similar as above) `./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
- **Build avg ns/key: 31.146975** (DEBUG_LEVEL=1), **30.08165** (DEBUG_LEVEL=0)
- Ribbon128
- baseline `./filter_bench -impl=3 -quick -runs 20 | grep 'Build avg'`:
- **Build avg ns/key: 129.17585** (DEBUG_LEVEL=1), **130.5225** (DEBUG_LEVEL=0)
- new feature (expected to be similar as above) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg' `:
- **Build avg ns/key: 131.61645** (DEBUG_LEVEL=1), **132.98075** (DEBUG_LEVEL=0)
- new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
- **Build avg ns/key: 52.032965** (DEBUG_LEVEL=1), **52.597825** (DEBUG_LEVEL=0)
- And the warning message of `"Cache reservation for Ribbon filter banding failed due to cache full"` is indeed logged to console.
Reviewed By: pdillinger
Differential Revision: D31991348
Pulled By: hx235
fbshipit-source-id: 9336b2c60f44d530063da518ceaf56dac5f9df8e
Summary:
Using deps for running blackbox and whitebox allows them to be
parallelized, which doesn't seem to be working well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9180
Test Plan: make -j24 crash_test
Reviewed By: siying
Differential Revision: D32500851
Pulled By: pdillinger
fbshipit-source-id: 364288c8d023b93e7ca2724ea40edae2f4eb0407
Summary:
`pthread_setname_np()` fails on attempts to assign oversized names like
"rocksdb:bottom10", which resulted in some thread name updates being
lost. We do not need the ID suffix so I removed it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9165
Test Plan:
```
$ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -max_background_flushes=123 -max_background_compactions=456 -num_bottom_pri_threads=789 -duration=60
```
While above is running:
```
$ ps -o 'comm' -Lp `pidof db_bench` | grep '^rocksdb:' | sort | uniq -c
789 rocksdb:bottom
123 rocksdb:high
456 rocksdb:low
```
Reviewed By: pdillinger
Differential Revision: D32415077
Pulled By: ajkr
fbshipit-source-id: a0e013101e26a78bc5eca73509293ef4bf22254f
Summary:
Add the 3 read bytes counter to the Statistic, which will be used by storage tiering and get the information for files with different temperature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9123
Test Plan: added new testing cases.
Reviewed By: siying
Differential Revision: D32154745
Pulled By: zhichao-cao
fbshipit-source-id: b7905d6dae469a72428742364ec07b634b6f15da
Summary:
Move the 'macosx-version-min' arg to the front of PLATFORM_SHARED_LDFLAGS so that it doesn't get concatenated with the library name. Fixes https://github.com/facebook/rocksdb/issues/9146
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9149
Reviewed By: mrambacher
Differential Revision: D32396101
Pulled By: pdillinger
fbshipit-source-id: aefcf53384e64d399049f158779acc3a4e54a8fe
Summary:
We have three layers of block cache that often use the same key
but map to different physical data:
* BlockBasedTableOptions::block_cache
* BlockBasedTableOptions::block_cache_compressed
* BlockBasedTableOptions::persistent_cache
If any two of these happen to share an underlying implementation and key
space (insertion into one shows up in another), then memory safety is
broken. The simplest case is block_cache == block_cache_compressed.
(Credit mrambacher for asking about this case in a review.)
With this change, we explicitly check for overlap and preemptively and
safely fail with a Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9172
Test Plan: test added. Crashes without new check
Reviewed By: anand1976
Differential Revision: D32465659
Pulled By: pdillinger
fbshipit-source-id: 3876b45b6dce6167e5a7a642725ddc86b96f8e40
Summary:
When defining a template class, the constructor should be specified
simply using the class name; it does not take template arguments.a
Apparently older versions of gcc and clang did not complain about this
syntax, but gcc 11.x and recent versions of clang both complain about
this file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9173
Test Plan:
When building with platform010 I got compile errors in this file both
in `mode/dev` (clang) and in `mode/opt-gcc`. This diff fixes the
compile failures.
Reviewed By: ajkr
Differential Revision: D32455881
Pulled By: simpkins
fbshipit-source-id: 0682910d9e2cdade94ce1e77973d47ac04d9f7e2
Summary:
* Parallel `make check` would pass if a test binary failed to list gtest
tests. This is now likely to report as a failure.
* Crazy perl was generating some extra incorrect test names causing
extra files and binary invocations. Fixed with cleaner awk.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9160
Test Plan:
For first part, add an 'assert(false);' to start of hash_test main and
see 'make check' pass before, and fail after.
For second part, inspect t/ directory before vs. after. Number of
executed tests is same:
$ cat log* | grep 'PASSED.*test' | awk '{ tot += $4; } END { print tot; }'
10469
Reviewed By: ajkr
Differential Revision: D32372006
Pulled By: pdillinger
fbshipit-source-id: 185b3db2b67e3f9198eb75322e4d0493e4fc1beb
Summary:
**Context:**
Some existing internal calls of `GenericRateLimiter::Request()` in backupable_db.cc and newly added internal calls in https://github.com/facebook/rocksdb/pull/8722/ do not make sure `bytes <= GetSingleBurstBytes()` as required by rate_limiter https://github.com/facebook/rocksdb/blob/master/include/rocksdb/rate_limiter.h#L47.
**Impacts of this bug include:**
(1) In debug build, when `GenericRateLimiter::Request()` requests bytes greater than `GenericRateLimiter:: kMinRefillBytesPerPeriod = 100` byte, process will crash due to assertion failure. See https://github.com/facebook/rocksdb/pull/9063#discussion_r737034133 and for possible scenario
(2) In production build, although there will not be the above crash due to disabled assertion, the bug can lead to a request of small bytes being blocked for a long time by a request of same priority with insanely large bytes from a different thread. See updated https://github.com/facebook/rocksdb/wiki/Rate-Limiter ("Notice that although....the maximum bytes that can be granted in a single request have to be bounded...") for more info.
There is an on-going effort to move rate-limiting to file wrapper level so rate limiting in `BackupEngine` and this PR might be made obsolete in the future.
**Summary:**
- Implemented loop-calling `GenericRateLimiter::Request()` with `bytes <= GetSingleBurstBytes()` as a static private helper function `BackupEngineImpl::LoopRateLimitRequestHelper`
-- Considering make this a util function in `RateLimiter` later or do something with `RateLimiter::RequestToken()`
- Replaced buggy internal callers with this helper function wherever requested byte is not pre-limited by `GetSingleBurstBytes()`
- Removed the minimum refill bytes per period enforced by `GenericRateLimiter` since it is useless and prevents testing `GenericRateLimiter` for extreme case with small refill bytes per period.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9063
Test Plan:
- Added a new test that failed the assertion before this change and now passes
- It exposed bugs in [the write during creation in `CopyOrCreateFile()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2034-L2043)), [the read of table properties in `GetFileDbIdentities()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2372-L2378)), [some read of metadata in `BackupMeta::LoadFromFile()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2726))
- Passing Existing tests
Reviewed By: ajkr
Differential Revision: D31824535
Pulled By: hx235
fbshipit-source-id: d2b3dea7a64e2a4b1e6a59fca322f0800a4fcbcc
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162
Existing TransactionUtil::CheckKeyForConflict() performs only seq-based
conflict checking. If user-defined timestamp is enabled, it should perform
conflict checking based on timestamps too.
Update TransactionUtil::CheckKey-related methods to verify the timestamp of the
latest version of a key is smaller than the read timestamp. Note that
CheckKeysForConflict() is not updated since it's used only by optimistic
transaction, and we do not plan to update it in this upcoming batch of diffs.
Existing GetLatestSequenceForKey() returns the sequence of the latest
version of a specific user key. Since we support user-defined timestamp, we
need to update this method to also return the timestamp (if enabled) of the
latest version of the key. This will be needed for snapshot validation.
Reviewed By: ltamasi
Differential Revision: D31567960
fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44
Summary:
This makes it easier to debug with tools like `ps`. The change only
applies to builds with glibc 2.30+ and _GNU_SOURCE extensions enabled.
We could adopt it in more cases by using the syscall but this is enough
for our build.
Replaces https://github.com/facebook/rocksdb/issues/2973.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9164
Test Plan:
- ran some benchmarks and correlated logged thread IDs with those shown by `ps -L`.
- verified no noticeable regression in throughput for log heavy (more than 700k log lines and over 5k / second) scenario.
Benchmark command:
```
$ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom -compression_type=none -max_bytes_for_level_multiplier=2 -write_buffer_size=262144 -num_levels=7 -max_bytes_for_level_base=2097152 -target_file_size_base=524288 -level_compaction_dynamic_level_bytes=true -max_background_jobs=12 -num=20000000
```
Results before: 15.9MB/s, 15.8MB/s, 16.0MB/s
Results after: 16.3MB/s, 16.3MB/s, 15.8MB/s
- Rely on CI to test the fallback behavior
Reviewed By: riversand963
Differential Revision: D32399660
Pulled By: ajkr
fbshipit-source-id: c24d44fdf7782faa616ef0a0964eaca3539d9c24
Summary:
I was unable to figure out the behavior by reading the old doc so attempted to
write it differently.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9154
Reviewed By: riversand963
Differential Revision: D32338843
Pulled By: ajkr
fbshipit-source-id: e1e67720cd92572b195583e5ea2c592180d4fefd
Summary:
Implement the Name() method in FileSystemWrapper, since https://github.com/facebook/rocksdb/issues/8649 removed it and it can cause compilation failures. We can deprecate it in RocksDB 7.0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9156
Reviewed By: riversand963
Differential Revision: D32363977
Pulled By: anand1976
fbshipit-source-id: 1e5a2fec2ab0649255720d89abf5bac26bb64ded
Summary:
RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB.
This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block.
1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer), readahead_size will decrease by 8KB.
2. It maintains readahead_size for L1 - Ln levels.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056
Test Plan:
Added new unit tests
Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled.
Reviewed By: anand1976
Differential Revision: D31773640
Pulled By: akankshamahajan15
fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98
Summary:
Track per-SST user-defined timestamp information in MANIFEST https://github.com/facebook/rocksdb/issues/8957
Rockdb has supported user-defined timestamp feature. Application can specify a timestamp
when writing each k-v pair. When data flush from memory to disk file called SST files, file
creation activity will commit to MANIFEST. This commit is for tracking timestamp info in the
MANIFEST for each file. The changes involved are as follows:
1) Track max/min timestamp in FileMetaData, and fix invoved codes.
2) Add NewFileCustomTag::kMinTimestamp and NewFileCustomTag::kMinTimestamp in
NewFileCustomTag ( in the kNewFile4 part ), and support invoved codes such as
VersionEdit Encode and Decode etc.
3) Add unit test code for VersionEdit EncodeDecodeNewFile4, and fix invoved test codes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9092
Reviewed By: ajkr, akankshamahajan15
Differential Revision: D32252323
Pulled By: riversand963
fbshipit-source-id: d2642898d6e3ad1fef0eb866b98045408bd4e162