Summary:
After RocksDB 6.19 and before this PR, RocksDB FlushJob may pick more memtables to flush beyond synced WALs.
This can be problematic if there are multiple column families, since it can prematurely advance the flushed column
family's log_number. Should subsequent attempts fail to sync the latest WALs and the database goes
through a recovery, it may detect corrupted WAL number below the flushed column family's log number
and complain about column family inconsistency.
To fix, we record the maximum memtable ID of the column family being flushed. Then we call SyncClosedLogs()
so that all closed WALs at the time when memtable ID is recorded will be synced.
I also disabled a unit test temporarily due to reasons described in https://github.com/facebook/rocksdb/issues/9151
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9142
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D32299956
Pulled By: riversand963
fbshipit-source-id: 0da75888177d91905cf8c9d00605b73afb5970a7
Summary:
- Fixed bug where bottom-pri manual compactions were counting towards `bg_compaction_scheduled_` instead of `bg_bottom_compaction_scheduled_`. It seems to have no negative effect.
- Fixed bug where automatic compaction scheduling did not consider `bg_bottom_compaction_scheduled_`. Now automatic compactions cannot be scheduled that exceed the per-DB compaction concurrency limit (`max_compactions`) when some existing compactions are bottommost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9179
Test Plan: new unit test for manual/automatic. Also verified the existing automatic/automatic test ("ConcurrentBottomPriLowPriCompactions") hanged until changing it to explicitly enable concurrency.
Reviewed By: riversand963
Differential Revision: D32488048
Pulled By: ajkr
fbshipit-source-id: 20c4c0693678e81e43f85ed3cc3402fcf26e3310
Summary:
Add a new API in listener.h that notifies about IOErrors on
Read/Write/Append/Flush etc. The API reports about IOStatus, filename, Operation
name, offset and length.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9177
Test Plan: Added new unit tests
Reviewed By: anand1976
Differential Revision: D32470627
Pulled By: akankshamahajan15
fbshipit-source-id: 189a717033590ae227b3beae8b1e7e185e4cdc12
Summary:
* Checksums are now checked on meta blocks unless specifically
suppressed or not applicable (e.g. plain table). (Was other way around.)
This means a number of cases that were not checking checksums now are,
including direct read TableProperties in Version::GetTableProperties
(fixed in meta_blocks ReadTableProperties), reading any block from
PersistentCache (fixed in BlockFetcher), read TableProperties in
SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open,
maybe more.
* For that to work, I moved the global_seqno+TableProperties checksum
logic to the shared table/ code, because that is used by many utilies
such as SstFileDumper.
* Also for that to work, we have to know when we're dealing with a block
that has a checksum (trailer), so added that capability to Footer based
on magic number, and from there BlockFetcher.
* Knowledge of trailer presence has also fixed a problem where other
table formats were reading blocks including bytes for a non-existant
trailer--and awkwardly kind-of not using them, e.g. no shared code
checking checksums. (BlockFetcher compression type was populated
incorrectly.) Now we only read what is needed.
* Minimized code duplication and differing/incompatible/awkward
abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block
without parsing block handle)
* Moved some meta block handling code from table_properties*.*
* Moved some code specific to block-based table from shared table/ code
to BlockBasedTable class. The checksum stuff means we can't completely
separate it, but things that don't need to be in shared table/ code
should not be.
* Use unique_ptr rather than raw ptr in more places. (Note: you can
std::move from unique_ptr to shared_ptr.)
Without enhancements to GetPropertiesOfAllTablesTest (see below),
net reduction of roughly 100 lines of code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163
Test Plan:
existing tests and
* Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that
checksums are now checked on direct read of table properties by TableCache
(new test would fail before this change)
* Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test
putting table properties under old meta name
* Also generally enhanced that same test to actually test what it was
supposed to be testing already, by kicking things out of table cache when
we don't want them there.
Reviewed By: ajkr, mrambacher
Differential Revision: D32514757
Pulled By: pdillinger
fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9
Summary:
Fix the analyzer test failure caused by inaccurate timing wait. The wait time at different system might be different or cause the delay, now we do not accurately count the lines. Only in a very rare extreme case, test will ignore the part exceed the timing of 1 second.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9181
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D32511319
Pulled By: zhichao-cao
fbshipit-source-id: e694c8cb465c750cfa5a43dab3eff6707b9a11c8
Summary:
Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge.
**Context:**
Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction, moving toward the goal of [single global memory limit using block cache capacity](https://github.com/facebook/rocksdb/wiki/Projects-Being-Developed#improving-memory-efficiency). It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under `LRUCacheOptions::strict_capacity_limit=true`.
The option to turn on this feature is `BlockBasedTableOptions::reserve_table_builder_memory = true` which by default is set to `false`. We [decided](https://github.com/facebook/rocksdb/pull/9073#discussion_r741548409) not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option.
**Summary:**
- Reserved/released cache for creation/destruction of three main memory users with the passed-in `FilterBuildingContext::cache_res_mgr` during filter construction:
- hash entries (i.e`hash_entries`.size(), we bucket-charge hash entries during insertion for performance),
- banding (Ribbon Filter only, `bytes_coeff_rows` +`bytes_result_rows` + `bytes_backtrack`),
- final filter (i.e, `mutable_buf`'s size).
- Implementation details: in order to use `CacheReservationManager::CacheReservationHandle` to account final filter's memory, we have to store the `CacheReservationManager` object and `CacheReservationHandle` for final filter in `XXPH3BitsFilterBuilder` as well as explicitly delete the filter bits builder when done with the final filter in block based table.
- Added option fo run `filter_bench` with this memory reservation feature
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9073
Test Plan:
- Added new tests in `db_bloom_filter_test` to verify filter construction peak cache reservation under combination of `BlockBasedTable::Rep::FilterType` (e.g, `kFullFilter`, `kPartitionedFilter`), `BloomFilterPolicy::Mode`(e.g, `kFastLocalBloom`, `kStandard128Ribbon`, `kDeprecatedBlock`) and `BlockBasedTableOptions::reserve_table_builder_memory`
- To address the concern for slow test: tests with memory reservation under `kFullFilter` + `kStandard128Ribbon` and `kPartitionedFilter` take around **3000 - 6000 ms** and others take around **1500 - 2000 ms**, in total adding **20000 - 25000 ms** to the test suit running locally
- Added new test in `bloom_test` to verify Ribbon Filter fallback on large banding in FullFilter
- Added test in `filter_bench` to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over **20** run as below:
- FastLocalBloom
- baseline `./filter_bench -impl=2 -quick -runs 20 | grep 'Build avg'`:
- **Build avg ns/key: 29.56295** (DEBUG_LEVEL=1), **29.98153** (DEBUG_LEVEL=0)
- new feature (expected to be similar as above)`./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg'`:
- **Build avg ns/key: 30.99046** (DEBUG_LEVEL=1), **30.48867** (DEBUG_LEVEL=0)
- new feature of RibbonFilter with fallback (expected to be similar as above) `./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
- **Build avg ns/key: 31.146975** (DEBUG_LEVEL=1), **30.08165** (DEBUG_LEVEL=0)
- Ribbon128
- baseline `./filter_bench -impl=3 -quick -runs 20 | grep 'Build avg'`:
- **Build avg ns/key: 129.17585** (DEBUG_LEVEL=1), **130.5225** (DEBUG_LEVEL=0)
- new feature (expected to be similar as above) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg' `:
- **Build avg ns/key: 131.61645** (DEBUG_LEVEL=1), **132.98075** (DEBUG_LEVEL=0)
- new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
- **Build avg ns/key: 52.032965** (DEBUG_LEVEL=1), **52.597825** (DEBUG_LEVEL=0)
- And the warning message of `"Cache reservation for Ribbon filter banding failed due to cache full"` is indeed logged to console.
Reviewed By: pdillinger
Differential Revision: D31991348
Pulled By: hx235
fbshipit-source-id: 9336b2c60f44d530063da518ceaf56dac5f9df8e
Summary:
Using deps for running blackbox and whitebox allows them to be
parallelized, which doesn't seem to be working well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9180
Test Plan: make -j24 crash_test
Reviewed By: siying
Differential Revision: D32500851
Pulled By: pdillinger
fbshipit-source-id: 364288c8d023b93e7ca2724ea40edae2f4eb0407
Summary:
`pthread_setname_np()` fails on attempts to assign oversized names like
"rocksdb:bottom10", which resulted in some thread name updates being
lost. We do not need the ID suffix so I removed it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9165
Test Plan:
```
$ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -max_background_flushes=123 -max_background_compactions=456 -num_bottom_pri_threads=789 -duration=60
```
While above is running:
```
$ ps -o 'comm' -Lp `pidof db_bench` | grep '^rocksdb:' | sort | uniq -c
789 rocksdb:bottom
123 rocksdb:high
456 rocksdb:low
```
Reviewed By: pdillinger
Differential Revision: D32415077
Pulled By: ajkr
fbshipit-source-id: a0e013101e26a78bc5eca73509293ef4bf22254f
Summary:
Add the 3 read bytes counter to the Statistic, which will be used by storage tiering and get the information for files with different temperature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9123
Test Plan: added new testing cases.
Reviewed By: siying
Differential Revision: D32154745
Pulled By: zhichao-cao
fbshipit-source-id: b7905d6dae469a72428742364ec07b634b6f15da
Summary:
Move the 'macosx-version-min' arg to the front of PLATFORM_SHARED_LDFLAGS so that it doesn't get concatenated with the library name. Fixes https://github.com/facebook/rocksdb/issues/9146
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9149
Reviewed By: mrambacher
Differential Revision: D32396101
Pulled By: pdillinger
fbshipit-source-id: aefcf53384e64d399049f158779acc3a4e54a8fe
Summary:
We have three layers of block cache that often use the same key
but map to different physical data:
* BlockBasedTableOptions::block_cache
* BlockBasedTableOptions::block_cache_compressed
* BlockBasedTableOptions::persistent_cache
If any two of these happen to share an underlying implementation and key
space (insertion into one shows up in another), then memory safety is
broken. The simplest case is block_cache == block_cache_compressed.
(Credit mrambacher for asking about this case in a review.)
With this change, we explicitly check for overlap and preemptively and
safely fail with a Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9172
Test Plan: test added. Crashes without new check
Reviewed By: anand1976
Differential Revision: D32465659
Pulled By: pdillinger
fbshipit-source-id: 3876b45b6dce6167e5a7a642725ddc86b96f8e40
Summary:
When defining a template class, the constructor should be specified
simply using the class name; it does not take template arguments.a
Apparently older versions of gcc and clang did not complain about this
syntax, but gcc 11.x and recent versions of clang both complain about
this file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9173
Test Plan:
When building with platform010 I got compile errors in this file both
in `mode/dev` (clang) and in `mode/opt-gcc`. This diff fixes the
compile failures.
Reviewed By: ajkr
Differential Revision: D32455881
Pulled By: simpkins
fbshipit-source-id: 0682910d9e2cdade94ce1e77973d47ac04d9f7e2
Summary:
* Parallel `make check` would pass if a test binary failed to list gtest
tests. This is now likely to report as a failure.
* Crazy perl was generating some extra incorrect test names causing
extra files and binary invocations. Fixed with cleaner awk.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9160
Test Plan:
For first part, add an 'assert(false);' to start of hash_test main and
see 'make check' pass before, and fail after.
For second part, inspect t/ directory before vs. after. Number of
executed tests is same:
$ cat log* | grep 'PASSED.*test' | awk '{ tot += $4; } END { print tot; }'
10469
Reviewed By: ajkr
Differential Revision: D32372006
Pulled By: pdillinger
fbshipit-source-id: 185b3db2b67e3f9198eb75322e4d0493e4fc1beb
Summary:
**Context:**
Some existing internal calls of `GenericRateLimiter::Request()` in backupable_db.cc and newly added internal calls in https://github.com/facebook/rocksdb/pull/8722/ do not make sure `bytes <= GetSingleBurstBytes()` as required by rate_limiter https://github.com/facebook/rocksdb/blob/master/include/rocksdb/rate_limiter.h#L47.
**Impacts of this bug include:**
(1) In debug build, when `GenericRateLimiter::Request()` requests bytes greater than `GenericRateLimiter:: kMinRefillBytesPerPeriod = 100` byte, process will crash due to assertion failure. See https://github.com/facebook/rocksdb/pull/9063#discussion_r737034133 and for possible scenario
(2) In production build, although there will not be the above crash due to disabled assertion, the bug can lead to a request of small bytes being blocked for a long time by a request of same priority with insanely large bytes from a different thread. See updated https://github.com/facebook/rocksdb/wiki/Rate-Limiter ("Notice that although....the maximum bytes that can be granted in a single request have to be bounded...") for more info.
There is an on-going effort to move rate-limiting to file wrapper level so rate limiting in `BackupEngine` and this PR might be made obsolete in the future.
**Summary:**
- Implemented loop-calling `GenericRateLimiter::Request()` with `bytes <= GetSingleBurstBytes()` as a static private helper function `BackupEngineImpl::LoopRateLimitRequestHelper`
-- Considering make this a util function in `RateLimiter` later or do something with `RateLimiter::RequestToken()`
- Replaced buggy internal callers with this helper function wherever requested byte is not pre-limited by `GetSingleBurstBytes()`
- Removed the minimum refill bytes per period enforced by `GenericRateLimiter` since it is useless and prevents testing `GenericRateLimiter` for extreme case with small refill bytes per period.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9063
Test Plan:
- Added a new test that failed the assertion before this change and now passes
- It exposed bugs in [the write during creation in `CopyOrCreateFile()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2034-L2043)), [the read of table properties in `GetFileDbIdentities()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2372-L2378)), [some read of metadata in `BackupMeta::LoadFromFile()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2726))
- Passing Existing tests
Reviewed By: ajkr
Differential Revision: D31824535
Pulled By: hx235
fbshipit-source-id: d2b3dea7a64e2a4b1e6a59fca322f0800a4fcbcc
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162
Existing TransactionUtil::CheckKeyForConflict() performs only seq-based
conflict checking. If user-defined timestamp is enabled, it should perform
conflict checking based on timestamps too.
Update TransactionUtil::CheckKey-related methods to verify the timestamp of the
latest version of a key is smaller than the read timestamp. Note that
CheckKeysForConflict() is not updated since it's used only by optimistic
transaction, and we do not plan to update it in this upcoming batch of diffs.
Existing GetLatestSequenceForKey() returns the sequence of the latest
version of a specific user key. Since we support user-defined timestamp, we
need to update this method to also return the timestamp (if enabled) of the
latest version of the key. This will be needed for snapshot validation.
Reviewed By: ltamasi
Differential Revision: D31567960
fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44
Summary:
This makes it easier to debug with tools like `ps`. The change only
applies to builds with glibc 2.30+ and _GNU_SOURCE extensions enabled.
We could adopt it in more cases by using the syscall but this is enough
for our build.
Replaces https://github.com/facebook/rocksdb/issues/2973.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9164
Test Plan:
- ran some benchmarks and correlated logged thread IDs with those shown by `ps -L`.
- verified no noticeable regression in throughput for log heavy (more than 700k log lines and over 5k / second) scenario.
Benchmark command:
```
$ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom -compression_type=none -max_bytes_for_level_multiplier=2 -write_buffer_size=262144 -num_levels=7 -max_bytes_for_level_base=2097152 -target_file_size_base=524288 -level_compaction_dynamic_level_bytes=true -max_background_jobs=12 -num=20000000
```
Results before: 15.9MB/s, 15.8MB/s, 16.0MB/s
Results after: 16.3MB/s, 16.3MB/s, 15.8MB/s
- Rely on CI to test the fallback behavior
Reviewed By: riversand963
Differential Revision: D32399660
Pulled By: ajkr
fbshipit-source-id: c24d44fdf7782faa616ef0a0964eaca3539d9c24
Summary:
I was unable to figure out the behavior by reading the old doc so attempted to
write it differently.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9154
Reviewed By: riversand963
Differential Revision: D32338843
Pulled By: ajkr
fbshipit-source-id: e1e67720cd92572b195583e5ea2c592180d4fefd
Summary:
Implement the Name() method in FileSystemWrapper, since https://github.com/facebook/rocksdb/issues/8649 removed it and it can cause compilation failures. We can deprecate it in RocksDB 7.0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9156
Reviewed By: riversand963
Differential Revision: D32363977
Pulled By: anand1976
fbshipit-source-id: 1e5a2fec2ab0649255720d89abf5bac26bb64ded
Summary:
RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB.
This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block.
1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer), readahead_size will decrease by 8KB.
2. It maintains readahead_size for L1 - Ln levels.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056
Test Plan:
Added new unit tests
Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled.
Reviewed By: anand1976
Differential Revision: D31773640
Pulled By: akankshamahajan15
fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98
Summary:
Track per-SST user-defined timestamp information in MANIFEST https://github.com/facebook/rocksdb/issues/8957
Rockdb has supported user-defined timestamp feature. Application can specify a timestamp
when writing each k-v pair. When data flush from memory to disk file called SST files, file
creation activity will commit to MANIFEST. This commit is for tracking timestamp info in the
MANIFEST for each file. The changes involved are as follows:
1) Track max/min timestamp in FileMetaData, and fix invoved codes.
2) Add NewFileCustomTag::kMinTimestamp and NewFileCustomTag::kMinTimestamp in
NewFileCustomTag ( in the kNewFile4 part ), and support invoved codes such as
VersionEdit Encode and Decode etc.
3) Add unit test code for VersionEdit EncodeDecodeNewFile4, and fix invoved test codes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9092
Reviewed By: ajkr, akankshamahajan15
Differential Revision: D32252323
Pulled By: riversand963
fbshipit-source-id: d2642898d6e3ad1fef0eb866b98045408bd4e162
Summary:
It seems that an incorrect native source file entry was introduced in https://github.com/facebook/rocksdb/pull/8999. For some reason it appears that CI was not run against that PR, and so the problem was not detected.
This PR fixes the problem by removing the invalid entry, allowing RocksJava to build correctly again.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9147
Reviewed By: pdillinger
Differential Revision: D32300976
fbshipit-source-id: dbd763b806bacf0fc08f4deaf07c63d0a266c4cf
Summary:
Before this fix compilation with GCC 4.8.5 20150623 (Red Hat 4.8.5-36) would fail with the following error:
```
CC jls/db/db_impl/db_impl.o
In file included from ./env/file_system_tracer.h:8:0,
from ./file/random_access_file_reader.h:15,
from ./file/file_prefetch_buffer.h:15,
from ./table/format.h:13,
from ./table/internal_iterator.h:14,
from ./db/pinned_iterators_manager.h:12,
from ./db/range_tombstone_fragmenter.h:15,
from ./db/memtable.h:22,
from ./db/memtable_list.h:16,
from ./db/column_family.h:17,
from ./db/db_impl/db_impl.h:22,
from db/db_impl/db_impl.cc:9:
./include/rocksdb/file_system.h:108:8: error: unused parameter 'opts'
[-Werror=unused-parameter]
struct FileOptions : EnvOptions {
^
db/db_impl/db_impl.cc: In member function 'virtual rocksdb::Status
rocksdb::DBImpl::SetDBOptions(const
std::unordered_map<std::basic_string<char>, std::basic_string<char>
>&)':
db/db_impl/db_impl.cc:1230:36: note: synthesized method
'rocksdb::FileOptions& rocksdb::FileOptions::operator=(const
rocksdb::FileOptions&)' first required here
file_options_for_compaction_ = FileOptions(new_db_options);
^
CC jls/db/db_impl/db_impl_compaction_flush.o
cc1plus: all warnings being treated as errors
make[1]: *** [jls/db/db_impl/db_impl.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory `/rocksdb-local-build'
make: *** [rocksdbjavastatic] Error 2
Makefile:2222: recipe for target 'rocksdbjavastaticdockerarm64v8' failed
make: *** [rocksdbjavastaticdockerarm64v8] Error 2
```
This was detected on both ppc64le and arm64v8, however it does not seem to appear in the same GCC 4.8 version we use for x64 in CircleCI - https://app.circleci.com/pipelines/github/facebook/rocksdb/9691/workflows/c2a94367-14f3-4039-be95-325c34643d41/jobs/227906
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9144
Reviewed By: riversand963
Differential Revision: D32290770
fbshipit-source-id: c90a54ba2a618e1ff3660fff3f3368ab36c3c527
Summary:
For multiple versions (ts + seq) of the same user key, if they cross the boundary of `full_history_ts_low_`,
we should retain the version that is visible to the `full_history_ts_low_`. Namely, we keep the internal key
with the largest timestamp smaller than `full_history_ts_low`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9116
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D32261514
Pulled By: riversand963
fbshipit-source-id: e10f47c254c04c05261440051e4f50cb7d95474e
Summary:
Note: This PR is the 3rd PR of a bigger PR stack (https://github.com/facebook/rocksdb/issues/9073) and depends on the second PR (https://github.com/facebook/rocksdb/pull/9071). **See changes from this PR only 00447324d0**
Context:
pdillinger brought up a good [point](https://github.com/facebook/rocksdb/pull/9073#discussion_r741478309) about lacking RAII support for per cache reservation in `CacheReservationManager` when reviewing https://github.com/facebook/rocksdb/pull/9073.
To summarize the discussion, the current API `CacheReservationManager::UpdateCacheReservation()` requires callers to explicitly calculate and pass in a correct`new_mem_used` to release a cache reservation (if they don't want to rely on the clean-up during `CacheReservationManager`'s destruction - such as they want to release it earlier).
While this implementation has convenience in some use-case such as `WriteBufferManager`, where [reservation](https://github.com/facebook/rocksdb/blob/main/memtable/write_buffer_manager.cc#L69-L91) and [release](https://github.com/facebook/rocksdb/blob/main/memtable/write_buffer_manager.cc#L109-L129) amounts do not necessarily correspond symmetrically and thus a flexible `new_mem_used` inputing is needed, it can be prone to caller's calculation error as well as cause a mass of codes in releasing cache in other use-case such as filter construction, where reservation and release amounts do correspond symmetrically and many code paths requiring a cache release, as [pointed](https://github.com/facebook/rocksdb/pull/9073#discussion_r741478309) out by pdillinger.
Therefore we decided to provide a new API in `CacheReservationManager` to update reservation with better RAII support for per cache reservation, using a handle to manage the life time of that particular cache reservation.
- Added a new class `CacheReservationHandle`
- Added a new API `CacheReservationManager::MakeCacheReservation()` that outputs a `CacheReservationHandle` for managing the reservation
- Updated class comments to clarify two different cache reservation methods
Tests:
- Passing new tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9130
Reviewed By: pdillinger
Differential Revision: D32199446
Pulled By: hx235
fbshipit-source-id: 1cba7c636e5ecfb55b0c1e0c2d218cc9b5b30b4e
Summary:
Note: This PR is the 2nd PR of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073).
Context:
`CacheReservationManager::UpdateCacheReservation(std::size_t new_memory_used)` accepts an accumulated total memory used (e.g, used 10MB so far) instead of usage change (e.g, increase by 5 MB, decrease by 5 MB). It has benefits including consolidating API for increase and decrease as described in https://github.com/facebook/rocksdb/pull/8506.
However, not every `CacheReservationManager` user keeps track of this accumulated total memory usage. For example, Bloom/Ribbon Filter construction (e.g, [here](822d729fcd/table/block_based/filter_policy.cc (L587)) in https://github.com/facebook/rocksdb/pull/9073) does not while WriteBufferManager and compression dictionary buffering do.
Considering future users might or might not keep track of this counter and implementing this counter within `CacheReservationManager` is easy due to the passed-in `std::size_t new_memory_used` in calling `CacheReservationManager::UpdateCacheReservation(std::size_t new_memory_used)`, it is proposed to add a new API `CacheReservationManager::GetTotalMemoryUsage()`.
As noted in the API comments, since `CacheReservationManager` is NOT thread-safe, external synchronization is
needed in calling `UpdateCacheReservation()` if you want `GetTotalMemoryUsed()` returns the indeed latest memory used.
- Added and updated private counter `memory_used_` every time `CacheReservationManager::UpdateCacheReservation(std::size_t new_memory_used)` is called regardless if the call returns non-okay status
- Added `CacheReservationManager::GetTotalMemoryUsage()` to return `memory_used_`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9071
Test Plan:
- Passing new tests
- Passing existing tests
Reviewed By: ajkr
Differential Revision: D31887813
Pulled By: hx235
fbshipit-source-id: 9a09f0c8683822673260362894c878b61ee60ceb
Summary:
The individual commits in this PR should be self-explanatory.
All small and _very_ low-priority changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5896
Reviewed By: riversand963
Differential Revision: D18065108
Pulled By: mrambacher
fbshipit-source-id: 236b1a1d9d21f982cc08aa67027108dde5eaf280
Summary:
Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087
Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc.
Reviewed By: riversand963
Differential Revision: D32007222
Pulled By: anand1976
fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792
Summary:
Add clarification/extension to comments on max_total_wal_size and the Java wrapper MaxTotalWalSize to better explain the effect of the option on log file sizes.
Closes https://github.com/facebook/rocksdb/issues/5789
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9108
Reviewed By: pdillinger
Differential Revision: D32066640
Pulled By: mrambacher
fbshipit-source-id: 7d5affc87e4119019054af9c884a2ea01d68f5b7
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9062
Real MySQL-style transactions in MyRocks uses SingleDelete, which is
missing in our existint MySQLStyleTransactionTest. Ths diff by lth fills the
gap in test coverage.
Reviewed By: lth
Differential Revision: D31813015
fbshipit-source-id: 196ad761de30ae9ea1f92257058dfc265f211892
Summary:
Otherwise a rebuild is not done if a RocksDB plugin header file is
changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9120
Test Plan:
Build RocksDB with a plugin.
Change a header file of the RocksDB plugin and rebuild.
Signed-off-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed By: riversand963
Differential Revision: D32223303
Pulled By: ajkr
fbshipit-source-id: 76d31b10fe915906edc181c7b6398a09b7d079ee
Summary:
…action
```
db/db_with_timestamp_basic_test.cc:2643: Failure
db_->CompactFiles(compact_opt, handles_[cf], collector->GetFlushedFiles(), static_cast<int>(kNumTimestamps - i))
Invalid argument: A compaction must contain at least one file.
```
Able to be reproduced by run multiple test in parallel.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9136
Test Plan:
```
gtest-parallel ./db_with_timestamp_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampCompressionSettings.PutAndGetWithCompaction/12 -r 100 -w 100
```
Reviewed By: riversand963
Differential Revision: D32197734
Pulled By: jay-zhuang
fbshipit-source-id: aeb0d6e9b37312f577e203ca81bb7a0f14d4e7ce
Summary:
The patch refactors and unifies the logic in `VersionBuilder::SaveBlobFilesTo`
and `VersionBuilder::GetMinOldestBlobFileNumber` by introducing a generic
helper that can "merge" the list of `BlobFileMetaData` in the base version with
the list of `MutableBlobFileMetaData` representing the updated state after
applying a sequence of `VersionEdit`s. This serves as groundwork for subsequent
changes that will enable us to determine whether a blob file is live after applying
a sequence of edits without calling `VersionBuilder::SaveTo`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9122
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D32151472
Pulled By: ltamasi
fbshipit-source-id: 11622b475866de823334b8bc21b0e99d913af97e
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105
The user contract of SingleDelete is that: a SingleDelete can only be issued to
a key that exists and has NOT been updated. For example, application can insert
one key `key`, and uses a SingleDelete to delete it in the future. The `key`
cannot be updated or removed using Delete.
In reality, especially when write-prepared transaction is being used, things
can get tricky. For example, a prepared transaction already writes `key` to the
memtable after a successful Prepare(). Afterwards, should the transaction
rollback, it will insert a Delete into the memtable to cancel out the prior
Put. Consider the following sequence of operations.
```
// operation sequence 1
Begin txn
Put(key)
Prepare()
Flush()
Rollback txn
Flush()
```
There will be two SSTs resulting from above. One of the contains a PUT, while
the second one contains a Delete. It is also known that releasing a snapshot
can lead to an L0 containing only a SD for a particular key. Consider the
following operations following the above block.
```
// operation sequence 2
db->Put(key)
db->SingleDelete(key)
Flush()
```
The operation sequence 2 can result in an L0 with only the SD.
Should there be a snapshot for conflict checking created before operation
sequence 1, then an attempt to compact the db may hit the assertion failure
below, because ikey_.type is Delete (from a rollback).
```
else if (clear_and_output_next_key_) {
assert(ikey_.type == kTypeValue || ikey_.type == kTypeBlobIndex);
}
```
To fix the assertion failure, we can skip the SingleDelete if we detect an
earlier Delete in the same snapshot interval.
Reviewed By: ltamasi
Differential Revision: D32056848
fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748
Summary:
It is useful to add options.manual_wal_flush to db_bench
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9132
Test Plan: Run the benchamrk with the option.
Reviewed By: ltamasi
Differential Revision: D32188060
fbshipit-source-id: a70835d3cad0f30095218dfda1daff0a432892e5
Summary:
Note: This PR is the 1st part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073).
Context:
Previously, the payload (i.e, filter data) within `BlockBasedTableBuilder::Rep::FilterBlockBuilder` object is not deallocated until `BlockBasedTableBuilder` is deallocated, despite it is no longer useful after its related `filter_content` being written.
- Transferred the payload (i.e, the filter data) out of `BlockBasedTableBuilder::Rep::FilterBlockBuilder` object
- For PartitionedFilter:
- Unified `filters` and `filter_gc` lists into one `std::deque<FilterEntry> filters` by adding a new field `last_filter_entry_key` and storing the `std::unique_ptr filter_data` with the `Slice filter` in the same entry
- Reset `last_filter_data` in the case where `filters` is empty, which should be as by then we would've finish using all the `Slice filter`
- Deallocated the payload by going out of scope as soon as we're done with using the `filter_content` associated with the payload
- This is an internal interface change at the level of `FilterBlockBuilder::Finish()`, which leads to touching the inherited interface in `BlockBasedFilterBlockBuilder`. But for that, the payload transferring is ignored.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9070
Test Plan: - The main focus is to catch segment fault error during `FilterBlockBuilder::Finish()` and `BlockBasedTableBuilder::Finish()` and interface mismatch. Relying on existing CI tests is enough as `assert(false)` was temporarily added to verify the new logic of transferring ownership indeed run
Reviewed By: pdillinger
Differential Revision: D31884933
Pulled By: hx235
fbshipit-source-id: f73ecfbea13788d4fc058013ace27230110b52f4
Summary:
Context:
Surprisingly, there isn't any sanitization against negative `int64_t bytes` in `GenericRateLimiter::Request(int64_t bytes, const Env::IOPriority pri, Statistics* stats)`. A negative `bytes` can be passed in and incorrectly increases `available_bytes_` by subtracting the negative `bytes` from `available_bytes_`, such as [here](https://github.com/facebook/rocksdb/blob/main/util/rate_limiter.cc#L138) and [here](https://github.com/facebook/rocksdb/blob/main/util/rate_limiter.cc#L283), which are incorrect behaviors.
- Sanitized negative request bytes by rounding it up to 0
- Added notes to public and internal API
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9112
Test Plan: - Rely on existing tests
Reviewed By: ajkr
Differential Revision: D32085364
Pulled By: hx235
fbshipit-source-id: b1b6066b2dd5ffc7bcbfb07069ca65a33578251b
Summary:
To prepare for adding checksum to footer and "context aware"
checksums. This also brings closely related code much closer together.
Recently added `BlockBasedTableBuilder::ComputeBlockTrailer` for testing
is made obsolete in the refactoring, as testing the checksums can happen
at a lower level of abstraction.
Also now checking for unrecognized checksum type on reading footer,
rather than later on use.
Also removed an obsolete function delcaration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9113
Test Plan:
existing tests worked before refactoring to remove
`ComputeBlockTrailer`. And then refactored+improved tests using it.
Reviewed By: mrambacher
Differential Revision: D32090149
Pulled By: pdillinger
fbshipit-source-id: 2879da683c1498ea85a3b70dace9b6d9f6b47b6e
Summary:
Revert "Temporarily disable s390x+cmake* Travis jobs (https://github.com/facebook/rocksdb/issues/9095)"
This reverts commit f2d11b3fdc.
I have now uploaded the CMake deb for s390x provided by jonathan-albrecht-ibm to our S3 bucket.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9110
Reviewed By: ajkr
Differential Revision: D32082903
Pulled By: riversand963
fbshipit-source-id: b7243d19fc133e665a8654e3b528c4f53d5b11d1