Summary:
A current limitation of backups is that you don't know the
exact database state of when the backup was taken. With this new
feature, you can at least inspect the backup's DB state without
restoring it by opening it as a read-only DB.
Rather than add something like OpenAsReadOnlyDB to the BackupEngine API,
which would inhibit opening stackable DB implementations read-only
(if/when their APIs support it), we instead provide a DB name and Env
that can be used to open as a read-only DB.
Possible follow-up work:
* Add a version of GetBackupInfo for a single backup.
* Let CreateNewBackup return the BackupID of the newly-created backup.
Implementation details:
Refactored ChrootFileSystem to split off new base class RemapFileSystem,
which allows more general remapping of files. We use this base class to
implement BackupEngineImpl::RemapSharedFileSystem.
To minimize API impact, I decided to just add these fields `name_for_open`
and `env_for_open` to those set by GetBackupInfo when
include_file_details=true. Creating the RemapSharedFileSystem adds a bit
to the memory consumption, perhaps unnecessarily in some cases, but this
has been mitigated by (a) only initialize the RemapSharedFileSystem
lazily when GetBackupInfo with include_file_details=true is called, and
(b) using the existing `shared_ptr<FileInfo>` objects to hold most of the
mapping data.
To enhance API safety, RemapSharedFileSystem is wrapped by new
ReadOnlyFileSystem which rejects any attempts to write. This uncovered a
couple of places in which DB::OpenForReadOnly would write to the
filesystem, so I fixed these. Added a release note because this affects
logging.
Additional minor refactoring in backupable_db.cc to support the new
functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142
Test Plan:
new test (run with ASAN and UBSAN), added to stress test and
ran it for a while with amplified backup_one_in
Reviewed By: ajkr
Differential Revision: D27535408
Pulled By: pdillinger
fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148
Summary:
According to https://github.com/facebook/rocksdb/issues/5907, each filter partition "should include the bloom of the prefix of the last
key in the previous partition" so that SeekForPrev() in prefix mode can return correct result.
The prefix of the last key in the previous partition does not necessarily have the same prefix
as the first key in the current partition. Regardless of the first key in current partition, the
prefix of the last key in the previous partition should be added. The existing code, however,
does not follow this. Furthermore, there is another issue: when finishing current filter partition,
`FullFilterBlockBuilder::AddPrefix()` is called for the first key in next filter partition, which effectively
overwrites `last_prefix_str_` prematurely. Consequently, when the filter block builder proceeds
to the next partition, `last_prefix_str_` will be the prefix of its first key, leaving no way of adding
the bloom of the prefix of the last key of the previous partition.
Prefix extractor is FixedLength.2.
```
[ filter part 1 ] [ filter part 2 ]
abc d
```
When SeekForPrev("abcd"), checking the filter partition will land on filter part 2 because "abcd" > "abc"
but smaller than "d".
If the filter in filter part 2 happens to return false for the test for "ab", then SeekForPrev("abcd") will build
incorrect iterator tree in non-total-order mode.
Also fix a unit test which starts to fail following this PR. `InDomain` should not fail due to assertion
error when checking on an arbitrary key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8137
Test Plan:
```
make check
```
Without this fix, the following command will fail pretty soon.
```
./db_stress --acquire_snapshot_one_in=10000 --avoid_flush_during_recovery=0 \
--avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 \
--batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=17 \
--bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 \
--checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \
--compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 \
--compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 \
--compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 \
--continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox \
--db_write_buffer_size=8388608 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_blob_files=0 \
--enable_compaction_filter=0 --enable_pipelined_write=1 --file_checksum_impl=big --flush_one_in=1000000 \
--format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \
--get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=2 --ingest_external_file_one_in=0 \
--iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \
--log2_keys_per_lock=10 --long_running_snapshots=1 --mark_for_compaction_one_file_in=0 \
--max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000000 --max_key_len=3 \
--max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 \
--max_write_buffer_size_to_maintain=8388608 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False \
--nooverwritepercent=0 --open_files=500000 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=1 --partition_pinning=0 --pause_background_one_in=1000000 \
--periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --read_fault_one_in=0 --read_only=0 \
--readpercent=45 --recycle_log_file_num=0 --reopen=20 --secondary_catch_up_one_in=0 \
--snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 \
--sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False \
--target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 \
--top_level_index_pinning=0 --unpartitioned_pinning=1 --use_blob_db=0 --use_block_based_filter=0 \
--use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 \
--use_multiget=0 --use_ribbon_filter=0 --use_txn=0 --user_timestamp_size=8 --verify_checksum=1 \
--verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 \
--write_dbid_to_manifest=1 --writepercent=35
```
Reviewed By: pdillinger
Differential Revision: D27553054
Pulled By: riversand963
fbshipit-source-id: 60e391e4a2d8d98a9a3172ec5d6176b90ec3de98
Summary:
Add request_id in IODebugContext which will be populated by
underlying FileSystem for IOTracing purposes. Update IOTracer to trace
request_id in the tracing records. Provided API
IODebugContext::SetRequestId which will set the request_id and enable
tracing for request_id. The API hides the implementation and underlying
file system needs to call this API directly.
Update DB::StartIOTrace API and remove redundant Env* from the
argument as its not used and DB already has Env that is passed down to
IOTracer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8045
Test Plan: Update unit test.
Differential Revision: D26899871
Pulled By: akankshamahajan15
fbshipit-source-id: 56adef52ee5af0fb3060b607c3af1ec01635fa2b
Summary:
Return early in case there are zero data blocks when
`BlockBasedTableBuilder::EnterUnbuffered()` is called. This crash can
only be triggered by applying dictionary compression to SST files that
contain only range tombstones. It cannot be triggered by a low buffer
limit alone since we only consider entering unbuffered mode after
buffering a data block causing the limit to be breached, or `Finish()`ing the file. It also cannot
be triggered by a totally empty file because those go through
`Abandon()` rather than `Finish()` so unbuffered mode is never entered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8141
Test Plan: added a unit test that repro'd the "Floating point exception"
Reviewed By: riversand963
Differential Revision: D27495640
Pulled By: ajkr
fbshipit-source-id: a463cfba476919dc5c5c380800a75a86c31ffa23
Summary:
Added `TableProperties::{fast,slow}_compression_estimated_data_size`.
These properties are present in block-based tables when
`ColumnFamilyOptions::sample_for_compression > 0` and the necessary
compression library is supported when the file is generated. They
contain estimates of what `TableProperties::data_size` would be if the
"fast"/"slow" compression library had been used instead. One
limitation is we do not record exactly which "fast" (ZSTD or Zlib)
or "slow" (LZ4 or Snappy) compression library produced the result.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8139
Test Plan:
- new unit test
- ran `db_bench` with `sample_for_compression=1`; verified the `data_size` property matches the `{slow,fast}_compression_estimated_data_size` when the same compression type is used for the output file compression and the sampled compression
Reviewed By: riversand963
Differential Revision: D27454338
Pulled By: ajkr
fbshipit-source-id: 9529293de93ddac7f03b2e149d746e9f634abac4
Summary:
In DBImpl::CloseHelper, we wait for bg_compaction_scheduled_
and bg_flush_scheduled_ to drop to 0. Unschedule is called prior
to cancel any unscheduled flushes/compactions. It is assumed that
anything in the high priority is a flush, and anything in the low
priority pool is a compaction. This assumption, however, is broken when
the high-pri pool is full.
As a result, bg_compaction_scheduled_ can go < 0 and bg_flush_scheduled_
will remain > 0 and DB can be in hang state.
The fix is, we decrement the `bg_{flush,compaction,bottom_compaction}_scheduled_`
inside the `Unschedule{Flush,Compaction,BottomCompaction}Callback()`s. DB
`mutex_` will make the counts atomic in `Unschedule`.
Related discussion: https://github.com/facebook/rocksdb/issues/7928
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8125
Test Plan: Added new test case which hangs without the fix.
Reviewed By: jay-zhuang
Differential Revision: D27390043
Pulled By: ajkr
fbshipit-source-id: 78a367fba9a59ac5607ad24bd1c46dc16d5ec110
Summary:
BackupEngine previously had unclear but strict concurrency
requirements that the API user must follow for safe use. Now we make
that clear, by separating operations into "Read," "Append," and "Write"
operations, and specifying which combinations are safe across threads on
the same BackupEngine object (previously none; now all, using a
read-write lock), and which are safe across different BackupEngine
instances open on the same backup_dir.
The changes to backupable_db.h should be backward compatible. It is
mostly about eliminating copies of what should be the same function and
(unsurprisingly) useful documentation comments were often placed on
only one of the two copies. With the re-organization, we are also
grouping different categories of operations. In the future we might add
BackupEngineReadAppendOnly, but that didn't seem necessary.
To mark API Read operations 'const', I had to mark some implementation
functions 'const' and some fields mutable.
Functional changes:
* Added RWMutex locking around public API functions to implement thread
safety on a single object. To avoid future bugs, this is another
internal class layered on top (removing many "override" in
BackupEngineImpl). It would be possible to allow more concurrency
between operations, rather than mutual exclusion, but IMHO not worth the
work.
* Fixed a race between Open() (Initialize()) and CreateNewBackup() for
different objects on the same backup_dir, where Initialize() could
delete the temporary meta file created during CreateNewBackup().
(This was found by the new test.)
Also cleaned up a couple of "status checked" TODOs, and improved a
checksum mismatch error message to include involved files.
Potential follow-up work:
* CreateNewBackup has an API wart because it doesn't tell you the
BackupID it just created, which makes it of limited use in a multithreaded
setting.
* We could also consider a Refresh() function to catch up to
changes made from another BackupEngine object to the same dir.
* Use a lock file to prevent multiple writer BackupEngines, but this
won't work on remote filesystems not supporting lock files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8115
Test Plan:
new mini-stress test in backup unit tests, run with gcc,
clang, ASC, TSAN, and UBSAN, 100 iterations each.
Reviewed By: ajkr
Differential Revision: D27347589
Pulled By: pdillinger
fbshipit-source-id: 28d82ed2ac672e44085a739ddb19d297dad14b15
Summary:
Previously it only applied to block-based tables generated by flush. This restriction
was undocumented and blocked a new use case. Now compression sampling
applies to all block-based tables we generate when it is enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8105
Test Plan: new unit test
Reviewed By: riversand963
Differential Revision: D27317275
Pulled By: ajkr
fbshipit-source-id: cd9fcc5178d6515e8cb59c6facb5ac01893cb5b0
Summary:
`strerror()` is not thread-safe, using `strerror_r()` instead. The API could be different on the different platforms, used the code from 0deef031cb/folly/String.cpp (L457)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8087
Reviewed By: mrambacher
Differential Revision: D27267151
Pulled By: jay-zhuang
fbshipit-source-id: 4b8856d1ec069d5f239b764750682c56e5be9ddb
Summary:
Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071
Test Plan: make check, added new unit test.
Reviewed By: anand1976
Differential Revision: D27177043
Pulled By: zhichao-cao
fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e
Summary:
In previous codebase, if WAL is used, all the retryable IO Error will be treated as hard error. So write is stalled. In this PR, the retryable IO error from WAL sync is separated from SST file flush io error. If WAL Sync is ok and retryable IO Error only happens during SST flush, the error is mapped to soft error. So user can continue insert to Memtable and append to WAL.
Resolve the bug that if WAL sync fails, the memtable status does not roll back due to calling PickMemtable early than calling and checking SyncClosedLog.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8049
Test Plan: added new unit test, make check
Reviewed By: anand1976
Differential Revision: D26965529
Pulled By: zhichao-cao
fbshipit-source-id: f5fecb66602212523c92ee49d7edcb6065982410
Summary:
WriteController had a number of issues:
* It could introduce a delay of 1ms even if the write rate never exceeded the
configured delayed_write_rate.
* The DB-wide delayed_write_rate could be exceeded in a number of ways
with multiple column families:
* Wiping all pending delay "debts" when another column family joins
the delay with GetDelayToken().
* Resetting last_refill_time_ to (now + sleep amount) means each
column family can write with delayed_write_rate for large writes.
* Updating bytes_left_ for a partial refill without updating
last_refill_time_ would essentially give out random bonuses,
especially to medium-sized writes.
Now the code is much simpler, with these issues fixed. See comments in
the new code and new (replacement) tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8064
Test Plan: new tests, better than old tests
Reviewed By: mrambacher
Differential Revision: D27064936
Pulled By: pdillinger
fbshipit-source-id: 497c23fe6819340b8f3d440bd634d8a2bc47323f
Summary:
Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8050
Test Plan: make check and add test to error_handler_fs_test
Reviewed By: anand1976
Differential Revision: D26990565
Pulled By: zhichao-cao
fbshipit-source-id: 49f71e8ea4e9db8b189943976404205b56ab883f
Summary:
Extend support to track blob files in SST File manager.
This PR notifies SstFileManager whenever a new blob file is created,
via OnAddFile and an obsolete blob file deleted via OnDeleteFile
and delete file via ScheduleFileDeletion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037
Test Plan: Add new unit tests
Reviewed By: ltamasi
Differential Revision: D26891237
Pulled By: akankshamahajan15
fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6
Summary:
The new options are:
* compact0 - compact L0 into L1 using one thread
* compact1 - compact L1 into L2 using one thread
* flush - flush memtable
* waitforcompaction - wait for compaction to finish
These are useful for reproducible benchmarks to help get the LSM tree shape
into a deterministic state. I wrote about this at:
http://smalldatum.blogspot.com/2021/02/read-only-benchmarks-with-lsm-are.html
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8027
Reviewed By: riversand963
Differential Revision: D27053861
Pulled By: ajkr
fbshipit-source-id: 1646f35584a3db03740fbeb47d91c3f00fb35d6e
Summary:
This is for cases that do not meet the Facebook criteria for
SKIP (see new comments). Also made ROCKSDB_GTEST_{SKIP,BYPASS} print the
message because gtest doesn't ever seem to.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8048
Test Plan: manual inspection of ./ribbon_test output, CI
Reviewed By: mrambacher
Differential Revision: D26953688
Pulled By: pdillinger
fbshipit-source-id: c914eaffe7d419db6ab90a193d474531e23582e5
Summary:
This API can be used for things like determining how much space
can be freed up by deleting a particular backup, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8042
Test Plan:
validation of the API added to many existing backup unit
tests
Reviewed By: mrambacher
Differential Revision: D26936577
Pulled By: pdillinger
fbshipit-source-id: f0bbd90f0917b9781a6837652fb4616d9247816a
Summary:
New comment for share_files_with_checksum:
// Only used if share_table_files is set to true. Setting to false is
// DEPRECATED and potentially dangerous because in that case BackupEngine
// can lose data if backing up databases with distinct or divergent
// history, for example if restoring from a backup other than the latest,
// writing to the DB, and creating another backup. Setting to true (default)
// prevents these issues by ensuring that different table files (SSTs) with
// the same number are treated as distinct. See
// share_files_with_checksum_naming and ShareFilesNaming.
I have also removed interim option kFlagMatchInterimNaming, which is no
longer needed and was never needed for correct+compatible operation
(just performance).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8020
Test Plan:
tests updated. Backward+forward compatibility verified with
SHORT_TEST=1 check_format_compatible.sh. ldb uses default backup
options, and I manually verified shared_checksum in
/tmp/rocksdb_format_compatible_peterd/bak/current/ after run.
Reviewed By: ajkr
Differential Revision: D26786331
Pulled By: pdillinger
fbshipit-source-id: 36f968dfef1f5cacbd65154abe1d846151a55130
Summary:
Haven't seen any production issues with new Bloom filter and
it's now > 1 year old (added in 6.6.0).
Updated check_format_compatible.sh and HISTORY.md
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8017
Test Plan: tests updated (or prior bugs fixed)
Reviewed By: ajkr
Differential Revision: D26762197
Pulled By: pdillinger
fbshipit-source-id: 0e755c46b443087c1544da0fd545beb9c403d1c2
Summary:
I recently discovered the confusing, undocumented semantics of
Read() functions in the FileSystem and Env APIs. I have added
clarification to the best of my reverse-engineered understanding, and
made a note in HISTORY.md for implementors to check their
implementations, as a subtly non-adherent implementation could lead to
RocksDB quietly ignoring some portion of a file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8029
Test Plan: no code changes
Reviewed By: anand1976
Differential Revision: D26831698
Pulled By: pdillinger
fbshipit-source-id: 208f97ff6037bc13bb2ef360b987c2640c79bd03
Summary:
The patch does the following:
1) Exposes the amount of data (number of bytes) read from blob files from
`BlobFileReader::GetBlob` / `Version::GetBlob`.
2) Tracks the total number and size of blobs read from blob files during a
compaction (due to garbage collection or compaction filter usage) in
`CompactionIterationStats` and propagates this data to
`InternalStats::CompactionStats` / `CompactionJobStats`.
3) Updates the formulae for write amplification calculations to include the
amount of data read from blob files.
4) Extends the compaction stats dump with a new column `Rblob(GB)` and
a new line containing the total number and size of blob files in the current
`Version` to complement the information about the shape and size of the LSM tree
that's already there.
5) Updates `CompactionJobStats` so that the number of files and amount of data
written by a compaction are broken down per file type (i.e. table/blob file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022
Test Plan: Ran `make check` and `db_bench`.
Reviewed By: riversand963
Differential Revision: D26801199
Pulled By: ltamasi
fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2
Summary:
This PR adds SetBufferSize() to the WriteBufferManager object. This enables user code to adjust the global budget for write_buffers based upon other memory conditions such as growth in table reader memory as the dataset grows.
The buffer_size_ member variable is now atomic to match design of other changeable size_t members within WriteBufferManager.
This change is useful as is. However, this change is also essential if someone decides they wanted to enable db_write_buffer_size modifications through the DB::SetOptions() API, i.e. no waste taking this as is.
Any format / spacing changes are due to clang-format as required by check-in automation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7961
Reviewed By: ajkr
Differential Revision: D26639075
Pulled By: akankshamahajan15
fbshipit-source-id: 0604348caf092d35f44e85715331dc920e5c1033
Summary:
The patch breaks down the "bytes written" (as well as the "number of output files")
compaction statistics into two, so the values are logged separately for table files
and blob files in the info log, and are shown in separate columns (`Write(GB)` for table
files, `Wblob(GB)` for blob files) when the compaction statistics are dumped.
This will also come in handy for fixing the write amplification statistics, which currently
do not consider the amount of data read from blob files during compaction. (This will
be fixed by an upcoming patch.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013
Test Plan: Ran `make check` and `db_bench`.
Reviewed By: riversand963
Differential Revision: D26742156
Pulled By: ltamasi
fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5
Summary:
The checkpointing logic supports passing file level checksums
to the copy_file_cb callback function which is used by the backup code
for detecting corruption during file copies.
However, this is currently implemented only for table files.
This PR extends the checksum retrieval to blob files as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8003
Test Plan: Add new test units
Reviewed By: ltamasi
Differential Revision: D26680701
Pulled By: akankshamahajan15
fbshipit-source-id: 1bd1e2464df6e9aa31091d35b8c72786d94cd1c5
Summary:
Allow applications to implement a custom compaction filter and pass it to BlobDB.
The compaction filter's custom logic can operate on blobs.
To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method.
Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely
on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in
db/blob/db_blob_compaction_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D26509280
Pulled By: riversand963
fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39
Summary:
In the adapter class `WritableFileStringStreamAdapter`, which wraps WritableFile to be used for std::ostream, previouly only `std::endl` is considered a special case because `endl` is written by `os.put()` directly without going through `xsputn()`. `os.put()` will call `sputc()` and if we further check the internal implementation of `sputc()`, we will see it is
```
int_type __CLR_OR_THIS_CALL sputc(_Elem _Ch) { // put a character
return 0 < _Pnavail() ? _Traits::to_int_type(*_Pninc() = _Ch) : overflow(_Traits::to_int_type(_Ch));
```
As we explicitly disabled buffering, _Pnavail() is always 0. Thus every write, not captured by xsputn, becomes an overflow.
When I run tests on Windows, I found not only `std::endl` will drop into this case, writing an unsigned long long will also call `os.put()` then followed by `sputc()` and eventually call `overflow()`. Therefore, instead of only checking `std::endl`, we should try to append other characters as well unless the appending operation fails.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7991
Reviewed By: jay-zhuang
Differential Revision: D26615692
Pulled By: ajkr
fbshipit-source-id: 4c0003de1645b9531545b23df69b000e07014468
Summary:
RocksDB does auto-readahead for iterators on noticing more
than two reads for a table file. The readahead starts at 8KB and doubles on every
additional read upto BlockBasedTable::kMaxAutoReadAheadSize which is
256*1024.
This PR adds a new option BlockBasedTableOptions::max_auto_readahead_size which
replaces BlockBasedTable::kMaxAutoReadAheadSize and the new option can be
configured.
If max_auto_readahead_size is set 0 then no implicit auto prefetching will
be done. If max_auto_readahead_size provided is less than
8KB (which is initial readahead size used by rocksdb in case of
auto-readahead), readahead size will remain same as max_auto_readahead_size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7951
Test Plan: Add new unit test case.
Reviewed By: anand1976
Differential Revision: D26568085
Pulled By: akankshamahajan15
fbshipit-source-id: b6543520fc74e97d859f2002328d4c5254d417af
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
Summary:
Added a "only_mutable_options" flag to the ConfigOptions. When set, the Configurable methods will only look at/update options that are marked as kMutable.
Fixed DB::SetOptions to allow for the update of any mutable TableFactory options. Fixes https://github.com/facebook/rocksdb/issues/7385.
Added tests for the new flag. Updated HISTORY.md
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936
Reviewed By: akankshamahajan15
Differential Revision: D26389646
Pulled By: mrambacher
fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0
Summary:
The trace file record and payload encode is fixed, which requires complex backward compatibility resolving. This PR introduce a new trace file format, which makes it easier to add new entries to the payload and does not have backward compatible issues. V 0.1 is still supported in this PR. Added the tracing for lower_bound and upper_bound for iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7977
Test Plan: make check. tested with old trace file in replay and analyzing.
Reviewed By: anand1976
Differential Revision: D26529948
Pulled By: zhichao-cao
fbshipit-source-id: ebb75a127ce3c07c25a1ccc194c551f917896a76
Summary:
TransactionDB uses read callback to filter out un-committed data before
a snapshot. But `MultiGet()` API doesn't use that at all, which causes
returning unwanted data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963
Test Plan: Added unittest to reproduce
Reviewed By: anand1976
Differential Revision: D26455851
Pulled By: jay-zhuang
fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523
Test Plan: add new unit test, pass make check/ make asan_check
Reviewed By: pdillinger
Differential Revision: D24313271
Pulled By: zhichao-cao
fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
Summary:
Adds support for prefetching data in Ribbon queries,
which especially optimizes batched Ribbon queries for MultiGet
(~222ns/key to ~97ns/key) but also single key queries on cold memory
(~333ns to ~226ns) because many queries span more than one cache line.
This required some refactoring of the query algorithm, and there
does not appear to be a noticeable regression in "hot memory" query
times (perhaps from 48ns to 50ns).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7889
Test Plan:
existing unit tests, plus performance validation with
filter_bench:
Each data point is the best of two runs. I saturated the machine
CPUs with other filter_bench runs in the background.
Before:
$ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 125.86
Number of filters: 1993
Total size (MB): 168.166
Reported total allocated memory (MB): 183.211
Reported internal fragmentation: 8.94626%
Bits/key stored: 7.05341
Prelim FP rate %: 0.951827
----------------------------
Mixed inside/outside queries...
Single filter net ns/op: 48.0111
Batched, prepared net ns/op: 222.384
Batched, unprepared net ns/op: 343.908
Skewed 50% in 1% net ns/op: 252.916
Skewed 80% in 20% net ns/op: 320.579
Random filter net ns/op: 332.957
After:
$ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 128.117
Number of filters: 1993
Total size (MB): 168.166
Reported total allocated memory (MB): 183.211
Reported internal fragmentation: 8.94626%
Bits/key stored: 7.05341
Prelim FP rate %: 0.951827
----------------------------
Mixed inside/outside queries...
Single filter net ns/op: 49.8812
Batched, prepared net ns/op: 97.1514
Batched, unprepared net ns/op: 222.025
Skewed 50% in 1% net ns/op: 197.48
Skewed 80% in 20% net ns/op: 212.457
Random filter net ns/op: 226.464
Bloom comparison, for reference:
$ ./filter_bench -impl=2 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 35.3042
Number of filters: 1993
Total size (MB): 238.488
Reported total allocated memory (MB): 262.875
Reported internal fragmentation: 10.2255%
Bits/key stored: 10.0029
Prelim FP rate %: 0.965327
----------------------------
Mixed inside/outside queries...
Single filter net ns/op: 9.09931
Batched, prepared net ns/op: 34.21
Batched, unprepared net ns/op: 88.8564
Skewed 50% in 1% net ns/op: 139.75
Skewed 80% in 20% net ns/op: 181.264
Random filter net ns/op: 173.88
Reviewed By: jay-zhuang
Differential Revision: D26378710
Pulled By: pdillinger
fbshipit-source-id: 058428967c55ed763698284cd3b4bbe3351b6e69
Summary:
Added support for detecting plugins linked in the "plugin/" directory and building them from our Makefile in a standardized way. See "plugin/README.md" for details. An example of a plugin that can be built in this way can be found in https://github.com/ajkr/dedupfs.
There will be more to do in terms of making this process more convenient and adding support for CMake.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7918
Test Plan: my own plugin (https://github.com/ajkr/dedupfs) and also heard this patch worked with ZenFS.
Reviewed By: pdillinger
Differential Revision: D26189969
Pulled By: ajkr
fbshipit-source-id: 6624d4357d0ffbaedb42f0d12a3fcb737c78f758
Summary:
Explicitly reject all range deletions on `TransactionDB` or `OptimisticTransactionDB`, except when the user provides sufficient promises that allow us to proceed safely. The necessary promises are described in the API doc for `TransactionDB::DeleteRange()`. There is currently no way to provide enough promises to make it safe in `OptimisticTransactionDB`.
Fixes https://github.com/facebook/rocksdb/issues/7913.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7929
Test Plan: unit tests covering the cases it's permitted/rejected
Reviewed By: ltamasi
Differential Revision: D26240254
Pulled By: ajkr
fbshipit-source-id: 2834a0ce64cc3e4c3799e35b885a5e79c2f4f6d9
Summary:
Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584
Test Plan: Run all existing tests.
Reviewed By: pdillinger
Differential Revision: D20626739
fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9
Summary:
This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.).
The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer.
When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748
Test Plan:
- an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught
- add to stress/crash test to verify it works in variety of configs/operations without intentional corruption
- [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc.
Reviewed By: pdillinger
Differential Revision: D25754492
Pulled By: ajkr
fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866
Summary:
Closes https://github.com/facebook/rocksdb/issues/7035
Changed how build_version.cc was generated:
- Included the GIT tag/branch in the build_version file
- Changed the "Build Date" to be:
- If the GIT branch is "clean" (no changes), the date of the last git commit
- If the branch is not clean, the current date
- Added APIs to access the "build information", rather than accessing the strings directly.
The build_version.cc file is now regenerated whenever the library objects are rebuilt.
Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866
Reviewed By: pdillinger
Differential Revision: D26086565
Pulled By: mrambacher
fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f
Summary:
When retryable IO error occurs during compaction, it is mapped to soft error and set the BG error. However, auto resume is not called to clean the soft error since compaction will reschedule by itself. In this change, When retryable IO error occurs during compaction, BG error is not set. User will be informed the error via EventHelper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7899
Test Plan: tested with error_handler_fs_test
Reviewed By: anand1976
Differential Revision: D26094097
Pulled By: zhichao-cao
fbshipit-source-id: c53424f11d237405592cd762f43cbbdf8da8234f
Summary:
Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
Reviewed By: pdillinger
Differential Revision: D26006406
Pulled By: mrambacher
fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
Summary:
This provides a workaround for two race conditions that will be fixed in
a more sophisticated way later. This PR:
(1) Makes the client serialize calls to `Timer::Start()` and `Timer::Shutdown()` (see https://github.com/facebook/rocksdb/issues/7711). The long-term fix will be to make those functions thread-safe.
(2) Makes `PeriodicWorkScheduler` atomically add/cancel work together with starting/shutting down its `Timer`. The long-term fix will be for `Timer` API to offer more specialized APIs so the client will not need to synchronize.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7888
Test Plan: ran the repro provided in https://github.com/facebook/rocksdb/issues/7881
Reviewed By: jay-zhuang
Differential Revision: D25990891
Pulled By: ajkr
fbshipit-source-id: a97fdaebbda6d7db7ddb1b146738b68c16c5be38
Summary:
I find that the `track_and_verify_wals_in_manifest` option was only removed from 6.15 branch's HISTORY, but still appears under 6.15 in master branch's HISTORY. It should be moved to 6.16 since that's when the feature should be available.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7874
Reviewed By: jay-zhuang
Differential Revision: D25935971
Pulled By: cheng-chang
fbshipit-source-id: fe8bf1ec111597f9207e109aa3be65f8f919f1fd
Summary:
When the --try_load_options is used in conjunction with the
--column_family option, ldb incorrectly sets the ColumnFamilyOptions for
that column family to defaults. This PR fixes that by retaining from the
OPTIONS file and applying command line overrides.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7847
Test Plan: Add a unit test in ldb_cmd_test
Reviewed By: ajkr
Differential Revision: D25874720
Pulled By: anand1976
fbshipit-source-id: 04bcf23b55e5a30b5b6a59b0e5cb4faef3da7429
Summary:
The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html
There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source.
Also some minor fixes to `Env::GetChildren`:
* Improve error handling in POSIX implementation
* Remove unnecessary array allocation on Windows
* Fix struct name for Windows Non-UTF-8 API
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819
Reviewed By: ajkr
Differential Revision: D25837394
Pulled By: jay-zhuang
fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e
Summary:
Add new API WriteBufferManager::dummy_entries_in_cache_usage() which reports the dummy entries size stored in cache to account for DataBlocks in WriteBufferManager.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7837
Test Plan: Updated test ./write_buffer_manager_test
Reviewed By: ajkr
Differential Revision: D25794312
Pulled By: akankshamahajan15
fbshipit-source-id: 197f5e8701e3dc57a7df72dab1735624f90daf4b
Summary:
In RocksDB, when IO error happens, the flags of IOStatus can be set. If the IOStatus is set as "File Scope IO Error", it indicate that the error is constrained in the file level. Since RocksDB does not continues write data to a file when any IO Error happens, File Scope IO Error can be treated the same as Retryable IO Error. Adding the logic to ErrorHandler::SetBGError to include the file scope IO Error in its error handling logic, which is the same as retryable IO Error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7840
Test Plan: added new unit tests in error_handler_fs_test. make check
Reviewed By: anand1976
Differential Revision: D25820481
Pulled By: zhichao-cao
fbshipit-source-id: 69cabd3d010073e064d6142ce1cabf341b8a6806
Summary:
Previously we only had a debug assertion to check the right generator was being used for verification. However a user hit a problem in production where their factory was creating the wrong generator for some files, leading to checksum mismatches. It would have been easier to debug if we verified in optimized builds that the generator with the proper name is used. This PR adds such verification.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7824
Reviewed By: zhichao-cao
Differential Revision: D25740254
Pulled By: ajkr
fbshipit-source-id: a6231521747605021bad3231484b5d4f99f4044f
Summary:
So that we can more easily get aggregate live table data such
as total filter, index, and data sizes.
Also adds ldb support for getting properties
Also fixed some missing/inaccurate related comments in db.h
For example:
$ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties
rocksdb.aggregated-table-properties.data_size: 102871
rocksdb.aggregated-table-properties.filter_size: 0
rocksdb.aggregated-table-properties.index_partitions: 0
rocksdb.aggregated-table-properties.index_size: 2232
rocksdb.aggregated-table-properties.num_data_blocks: 100
rocksdb.aggregated-table-properties.num_deletions: 0
rocksdb.aggregated-table-properties.num_entries: 15000
rocksdb.aggregated-table-properties.num_merge_operands: 0
rocksdb.aggregated-table-properties.num_range_deletions: 0
rocksdb.aggregated-table-properties.raw_key_size: 288890
rocksdb.aggregated-table-properties.raw_value_size: 198890
rocksdb.aggregated-table-properties.top_level_index_size: 0
$ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1
rocksdb.aggregated-table-properties-at-level1.data_size: 80909
rocksdb.aggregated-table-properties-at-level1.filter_size: 0
rocksdb.aggregated-table-properties-at-level1.index_partitions: 0
rocksdb.aggregated-table-properties-at-level1.index_size: 1787
rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81
rocksdb.aggregated-table-properties-at-level1.num_deletions: 0
rocksdb.aggregated-table-properties-at-level1.num_entries: 12466
rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0
rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0
rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210
rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414
rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0
$
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779
Test Plan: Added a test to ldb_test.py
Reviewed By: jay-zhuang
Differential Revision: D25653103
Pulled By: pdillinger
fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
Summary:
Primarily this change refactors the optimize_filters_for_memory
code for Bloom filters, based on malloc_usable_size, to also work for
Ribbon filters.
This change also replaces the somewhat slow but general
BuiltinFilterBitsBuilder::ApproximateNumEntries with
implementation-specific versions for Ribbon (new) and Legacy Bloom
(based on a recently deleted version). The reason is to emphasize
speed in ApproximateNumEntries rather than 100% accuracy.
Justification: ApproximateNumEntries (formerly CalculateNumEntry) is
only used by RocksDB for range-partitioned filters, called each time we
start to construct one. (In theory, it should be possible to reuse the
estimate, but the abstractions provided by FilterPolicy don't really
make that workable.) But this is only used as a heuristic estimate for
hitting a desired partitioned filter size because of alignment to data
blocks, which have various numbers of unique keys or prefixes. The two
factors lead us to prioritize reasonable speed over 100% accuracy.
optimize_filters_for_memory adds extra complication, because precisely
calculating num_entries for some allowed number of bytes depends on state
with optimize_filters_for_memory enabled. And the allocator-agnostic
implementation of optimize_filters_for_memory, using malloc_usable_size,
means we would have to actually allocate memory, many times, just to
precisely determine how many entries (keys) could be added and stay below
some size budget, for the current state. (In a draft, I got this
working, and then realized the balance of speed vs. accuracy was all
wrong.)
So related to that, I have made CalculateSpace, an internal-only API
only used for testing, non-authoritative also if
optimize_filters_for_memory is enabled. This simplifies some code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7774
Test Plan:
unit test updated, and for FilterSize test, range of tested
values is greatly expanded (still super fast)
Also tested `db_bench -benchmarks=fillrandom,stats -bloom_bits=10 -num=1000000 -partition_index_and_filters -format_version=5 [-optimize_filters_for_memory] [-use_ribbon_filter]` with temporary debug output of generated filter sizes.
Bloom+optimize_filters_for_memory:
1 Filter size: 197 (224 in memory)
134 Filter size: 3525 (3584 in memory)
107 Filter size: 4037 (4096 in memory)
Total on disk: 904,506
Total in memory: 918,752
Ribbon+optimize_filters_for_memory:
1 Filter size: 3061 (3072 in memory)
110 Filter size: 3573 (3584 in memory)
58 Filter size: 4085 (4096 in memory)
Total on disk: 633,021 (-30.0%)
Total in memory: 634,880 (-30.9%)
Bloom (no offm):
1 Filter size: 261 (320 in memory)
1 Filter size: 3333 (3584 in memory)
240 Filter size: 3717 (4096 in memory)
Total on disk: 895,674 (-1% on disk vs. +offm; known tolerable overhead of offm)
Total in memory: 986,944 (+7.4% vs. +offm)
Ribbon (no offm):
1 Filter size: 2949 (3072 in memory)
1 Filter size: 3381 (3584 in memory)
167 Filter size: 3701 (4096 in memory)
Total on disk: 624,397 (-30.3% vs. Bloom)
Total in memory: 690,688 (-30.0% vs. Bloom)
Note that optimize_filters_for_memory is even more effective for Ribbon filter than for cache-local Bloom, because it can close the unused memory gap even tighter than Bloom filter, because of 16 byte increments for Ribbon vs. 64 byte increments for Bloom.
Reviewed By: jay-zhuang
Differential Revision: D25592970
Pulled By: pdillinger
fbshipit-source-id: 606fdaa025bb790d7e9c21601e8ea86e10541912
Summary:
When ConcurrentTaskLimiter is enabled and there are too many outstanding compactions, BackgroundCompaction returns Status::Busy(), which shouldn't be treat as compaction failure.
This caused performance issue when outstanding compactions reached the limit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7739
Reviewed By: cheng-chang
Differential Revision: D25508319
Pulled By: ltamasi
fbshipit-source-id: 3b181b16ada0ca3393cfa3a7412985764e79c719
Summary:
Deprecate CalculateNumEntry and replace with
ApproximateNumEntries (better name) using size_t instead of int and
uint32_t, to minimize confusing casts and bad overflow behavior
(possible though probably not realistic). Bloom sizes are now explicitly
capped at max size supported by implementations: just under 4GiB for
fv=5 Bloom, and just under 512MiB for fv<5 Legacy Bloom. This
hardening could help to set up for fuzzing.
Also, since RocksDB only uses this information as an approximation
for trying to hit certain sizes for partitioned filters, it's more important
that the function be reasonably fast than for it to be completely
accurate. It's hard enough to be 100% accurate for Ribbon (currently
reversing CalculateSpace) that adding optimize_filters_for_memory
into the mix is just not worth trying to be 100% accurate for num
entries for bytes.
Also:
- Cleaned up filter_policy.h to remove MSVC warning handling and
potentially unsafe use of exception for "not implemented"
- Correct the number of entries limit beyond which current Ribbon
implementation falls back on Bloom instead.
- Consistently use "num_entries" rather than "num_entry"
- Remove LegacyBloomBitsBuilder::CalculateNumEntry as it's essentially
obsolete from general implementation
BuiltinFilterBitsBuilder::CalculateNumEntries.
- Fix filter_bench to skip some tests that don't make sense when only
one or a small number of filters has been generated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7726
Test Plan:
expanded existing unit tests for CalculateSpace /
ApproximateNumEntries. Also manually used filter_bench to verify Legacy and
fv=5 Bloom size caps work (much too expensive for unit test). Note that
the actual bits per key is below requested due to space cap.
$ ./filter_bench -impl=0 -bits_per_key=20 -average_keys_per_filter=256000000 -vary_key_count_ratio=0 -m_keys_total_max=256 -allow_bad_fp_rate
...
Total size (MB): 511.992
Bits/key stored: 16.777
...
$ ./filter_bench -impl=2 -bits_per_key=20 -average_keys_per_filter=2000000000 -vary_key_count_ratio=0 -m_keys_total_max=2000
...
Total size (MB): 4096
Bits/key stored: 17.1799
...
$
Reviewed By: jay-zhuang
Differential Revision: D25239800
Pulled By: pdillinger
fbshipit-source-id: f94e6d065efd31e05ec630ae1a82e6400d8390c4
Summary:
Ensure that when direct IO is enabled and a compressed block cache is
configured, MultiGet inserts compressed data blocks into the compressed
block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7756
Test Plan: Add unit test to db_basic_test
Reviewed By: cheng-chang
Differential Revision: D25416240
Pulled By: anand1976
fbshipit-source-id: 75d57526370c9c0a45ff72651f3278dbd8a9086f
Summary:
When 2 phase commit is enabled, if there are prepared data in a WAL, the WAL should be kept, the minimum log number for such a WAL is written to MANIFEST during flush. In atomic flush, such information is not written to MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7570
Test Plan: Added a new unit test `DBAtomicFlushTest.ManualFlushUnder2PC`, this test fails in atomic flush without this PR, after this PR, it succeeds.
Reviewed By: riversand963
Differential Revision: D24394222
Pulled By: cheng-chang
fbshipit-source-id: 60ce74b21b704804943be40c8de01b41269cf116
Summary:
Add timestamp to the `CompactRange()` and `GetApproximateSizes` range keys if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7684
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D25015421
Pulled By: jay-zhuang
fbshipit-source-id: 51ca0756087eb053a3b11801e5c7ce1c6e2d38a9
Summary:
WAL may be truncated to an incomplete record due to crash while writing
the last record or corruption. In the former case, no hole will be
produced since no ACK'd data was lost. In the latter case, a hole could
be produced without this PR since we proceeded to recover the next WAL
as if nothing happened. This PR changes the record reading code to
always report a corruption for incomplete records in
`kPointInTimeRecovery` mode, and the upper layer will only ignore them
if the next WAL has consecutive seqnum (i.e., we are guaranteed no
hole).
While this solves the hole problem for the case of incomplete
records, the possibility is still there if the WAL is corrupted by
truncation to an exact record boundary. This PR also regresses how much data
can be recovered when writes are mixed with/without
`WriteOptions::disableWAL`, as then we can not distinguish between a
seqnum gap caused by corruption and a seqnum gap caused by a `disableWAL` write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7701
Test Plan:
Interestingly there already was a test for this case
(`DBWALTestWithParams.kPointInTimeRecovery`); it just had a typo bug in
the verification that prevented it from noticing holes in recovery.
Reviewed By: anand1976
Differential Revision: D25111765
Pulled By: ajkr
fbshipit-source-id: 5e330b13b1ee2b5be096cea9d0ff6075843e57b6
Summary:
Instead of using `EncodeFixed32` which always serialize a integer to
little endian, we should use the local machine's endianness when
populating a native data structure during options parsing.
Without this fix, `read_amp_bytes_per_bit` may be populated incorrectly
on big-endian machines.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7680
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D24999166
Pulled By: riversand963
fbshipit-source-id: dc603cff6e17f8fa32479ce6df93b93082e6b0c4
Summary:
An application may accidentally write merge operands without properly configuring `merge_operator`. We should alert them as early as possible that there's an API misuse. Previously RocksDB only notified them when a query or background operation needed to merge but couldn't. With this PR, RocksDB notifies them of the problem before applying the merge operand to the memtable (although it may already be in WAL, which seems it'd cause a crash loop until they enable `merge_operator`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7667
Reviewed By: riversand963
Differential Revision: D24933360
Pulled By: ajkr
fbshipit-source-id: 3a4a2ceb0b7aed184113dd03b8efd735a8332f7f
Summary:
Added experimental public API for Ribbon filter:
NewExperimentalRibbonFilterPolicy(). This experimental API will
take a "Bloom equivalent" bits per key, and configure the Ribbon
filter for the same FP rate as Bloom would have but ~30% space
savings. (Note: optimize_filters_for_memory is not yet implemented
for Ribbon filter. That can be added with no effect on schema.)
Internally, the Ribbon filter is configured using a "one_in_fp_rate"
value, which is 1 over desired FP rate. For example, use 100 for 1%
FP rate. I'm expecting this will be used in the future for configuring
Bloom-like filters, as I expect people to more commonly hold constant
the filter accuracy and change the space vs. time trade-off, rather than
hold constant the space (per key) and change the accuracy vs. time
trade-off, though we might make that available.
### Benchmarking
```
$ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 34.1341
Number of filters: 1993
Total size (MB): 238.488
Reported total allocated memory (MB): 262.875
Reported internal fragmentation: 10.2255%
Bits/key stored: 10.0029
----------------------------
Mixed inside/outside queries...
Single filter net ns/op: 18.7508
Random filter net ns/op: 258.246
Average FP rate %: 0.968672
----------------------------
Done. (For more info, run with -legend or -help.)
$ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 130.851
Number of filters: 1993
Total size (MB): 168.166
Reported total allocated memory (MB): 183.211
Reported internal fragmentation: 8.94626%
Bits/key stored: 7.05341
----------------------------
Mixed inside/outside queries...
Single filter net ns/op: 58.4523
Random filter net ns/op: 363.717
Average FP rate %: 0.952978
----------------------------
Done. (For more info, run with -legend or -help.)
```
168.166 / 238.488 = 0.705 -> 29.5% space reduction
130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction)
### Working around a hashing "flaw"
bloom_test discovered a flaw in the simple hashing applied in
StandardHasher when num_starts == 1 (num_slots == 128), showing an
excessively high FP rate. The problem is that when many entries, on the
order of number of hash bits or kCoeffBits, are associated with the same
start location, the correlation between the CoeffRow and ResultRow (for
efficiency) can lead to a solution that is "universal," or nearly so, for
entries mapping to that start location. (Normally, variance in start
location breaks the effective association between CoeffRow and
ResultRow; the same value for CoeffRow is effectively different if start
locations are different.) Without kUseSmash and with num_starts > 1 (thus
num_starts ~= num_slots), this flaw should be completely irrelevant. Even
with 10M slots, the chances of a single slot having just 16 (or more)
entries map to it--not enough to cause an FP problem, which would be local
to that slot if it happened--is 1 in millions. This spreadsheet formula
shows that: =1/(10000000*(1 - POISSON(15, 1, TRUE)))
As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is
intended for CPU efficiency of filters with many more entries/slots than
kCoeffBits, a very reasonable work-around is to disallow num_starts==1
when !kUseSmash, by making the minimum non-zero number of slots
2*kCoeffBits. This is the work-around I've applied. This also means that
the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not
space-efficient for less than a few hundred entries. Because of this, I
have made it fall back on constructing a Bloom filter, under existing
schema, when that is more space efficient for small filters. (We can
change this in the future if we want.)
TODO: better unit tests for this case in ribbon_test, and probably
update StandardHasher for kUseSmash case so that it can scale nicely to
small filters.
### Other related changes
* Add Ribbon filter to stress/crash test
* Add Ribbon filter to filter_bench as -impl=3
* Add option string support, as in "filter_policy=experimental_ribbon:5.678;"
where 5.678 is the Bloom equivalent bits per key.
* Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom
* Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on
binary searching CalculateSpace (inefficient), so that subclasses
(especially experimental ones) don't have to provide an efficient
implementation inverting CalculateSpace.
* Minor refactor FastLocalBloomBitsBuilder for new base class
XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder,
which allows the latter to fall back on Bloom construction in some
extreme cases.
* Mostly updated bloom_test for Ribbon filter, though a test like
FullBloomTest::Schema is a next TODO to ensure schema stability
(in case this becomes production-ready schema as it is).
* Add some APIs to ribbon_impl.h for configuring Ribbon filters.
Although these are reasonably covered by bloom_test, TODO more unit
tests in ribbon_test
* Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data
for constructing the linear approximations in GetNumSlotsFor95PctSuccess.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658
Test Plan:
Some unit tests updated but other testing is left TODO. This
is considered experimental but laying down schema compatibility as early
as possible in case it proves production-quality. Also tested in
stress/crash test.
Reviewed By: jay-zhuang
Differential Revision: D24899349
Pulled By: pdillinger
fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8
Summary:
Previously, even when `bottommost_compression_opts`'s `enabled` flag was set, it only took effect when
`bottommost_compression` was also set to something other than `kDisableCompressionOption`.
This wasn't documented and, if we kept the old behavior, it'd make
things complicated like the migration instructions in https://github.com/facebook/rocksdb/issues/7619. We can
simplify the API by making `bottommost_compression_opts` always take
effect when its `enabled` flag is set.
Fixes https://github.com/facebook/rocksdb/issues/7631.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7633
Reviewed By: ltamasi
Differential Revision: D24710358
Pulled By: ajkr
fbshipit-source-id: bbbdf9c1b53c63a4239d902cc3f5a11da1874647
Summary:
The Customizable class is an extension of the Configurable class and allows instances to be created by a name/ID. Classes that extend customizable can define their Type (e.g. "TableFactory", "Cache") and a method to instantiate them (TableFactory::CreateFromString). Customizable objects can be registered with the ObjectRegistry and created dynamically.
Future PRs will make more types of objects extend Customizable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6590
Reviewed By: cheng-chang
Differential Revision: D24841553
Pulled By: zhichao-cao
fbshipit-source-id: d0c2132bd932e971cbfe2c908ca2e5db30c5e155
Summary:
In `BuildTable()`, we call `builder->Finish()` before evaluating `builder->NeedCompact()`.
However, we call `builder->NeedCompact()` before `builder->Finish()` in compaction job. This can be wrong because the table properties collectors may rely on the success of `Finish()` to provide correct result for `NeedCompact()`.
Test plan (on devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7627
Reviewed By: ajkr
Differential Revision: D24728741
Pulled By: riversand963
fbshipit-source-id: 5a0dce244e14eb1106c4f87021e6bebca82b486e
Summary:
Existing API `VerifyChecksum()` allows application to verify sst files' block checksums.
Since whole file, user-specified checksum is tracked in MANIFEST, we can expose a new
API to verify sst files' file checksums.
```
// Compute table file checksums if applicable and compare with MANIFEST.
// Returns OK if no file has mismatching whole-file checksum.
Status DB::VerifyFileChecksums(const ReadOptions& /*read_options*/);
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7578
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D24436783
Pulled By: riversand963
fbshipit-source-id: 52b51519b842f2b3c4e3351998a97c86cbec85b3
Summary:
The filter query key should not contain timestamp. The timestamp is
stripped for Get(), but not MultiGet().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7589
Reviewed By: riversand963
Differential Revision: D24494661
Pulled By: jay-zhuang
fbshipit-source-id: fc5ff40f9d683a89a760c6ff0ab3aed05a70c317
Summary:
In dictionary compression's initial implementation, in order to save CPU overhead, we only enabled it
for bottom level under the assumption that the vast majority of data is
stored there. At that time, there was no
such thing as `ColumnFamilyOptions::bottommost_compression_opts`, so we just
hardcoded disabling dictionary compression in flush and compactions to
non-bottommost level. Now, we have users who generate all their files
through flush and are considering using dictionary compression.
To support such a use case, this PR expands the scope of `ColumnFamilyOptions::compression_opts` to
additionally include flushed files and files generated by compaction to
a non-bottommost level. Users can still get the old behavior by moving
their dictionary settings to `ColumnFamilyOptions::bottommost_compression_opts`
and explicitly enabling both that and `ColumnFamilyOptions::bottommost_compression`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7619
Reviewed By: ltamasi
Differential Revision: D24665610
Pulled By: ajkr
fbshipit-source-id: 656b90bce1033fe21c71e09af931ef5bde3e464c
Summary:
The recently reverted behavior changes were released to at least one
place internally, so we should mention the reverts in release notes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7617
Reviewed By: akankshamahajan15
Differential Revision: D24654343
Pulled By: ajkr
fbshipit-source-id: eb64b2797d8508cd95a2dc2698122c1be29ce817
Summary:
Currently, the following interleaving of events can lead to SuperVersion containing both immutable memtables as well as the resulting L0. This can cause Get to return incorrect result if there are merge operands. This may also affect other operations such as single deletes.
```
time main_thr bg_flush_thr bg_compact_thr compact_thr set_opts_thr
0 | WriteManifest:0
1 | issue compact
2 | wait
3 | Merge(counter)
4 | issue flush
5 | wait
6 | WriteManifest:1
7 | wake up
8 | write manifest
9 | wake up
10 | Get(counter)
11 | remove imm
V
```
The reason behind is that: one bg flush thread's installing new `Version` can be batched and performed by another thread that is the "leader" MANIFEST writer. This bg thread removes the memtables from current super version only after `LogAndApply` returns. After the leader MANIFEST writer signals (releasing mutex) this bg flush thread, it is possible that another thread sees this cf with both memtables (whose data have been flushed to the newest L0) and the L0 before this bg flush thread removes the memtables.
To address this issue, each bg flush thread can pass a callback function to `LogAndApply`. The callback is responsible for removing the memtables. Therefore, the leader MANIFEST writer can call this callback and remove the memtables before releasing the mutex.
Test plan (devserver)
```
$make merge_test
$./merge_test --gtest_filter=MergeTest.MergeWithCompactionAndFlush
$make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6069
Reviewed By: cheng-chang
Differential Revision: D18790894
Pulled By: riversand963
fbshipit-source-id: e41bd600c0448b4f4b2deb3f7677f95e3076b4ed
Summary:
Remove function calling in assert statement as assert is a no
op in opt build and that function might not be called. This causes hang
in closing RocksDB when refit level is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7581
Test Plan: make check -j64
Reviewed By: riversand963
Differential Revision: D24466420
Pulled By: akankshamahajan15
fbshipit-source-id: 97db4ec5a95ae693c3290e176a3c12a9b1ad2f6d
Summary:
This PR adds support for writing a location identifier of the DB host to SST files as a table property. By default, the hostname is used, but can be overridden by the user. There have been some recent corruptions in files written by ```SstFileWriter``` before checksumming, so this property can be used to trace it back to the writing host and checking the host for hardware isues.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7479
Test Plan: Add new unit tests
Reviewed By: pdillinger
Differential Revision: D24340671
Pulled By: anand1976
fbshipit-source-id: 2038949fd8d160c0633ccb4f9da77740f19fa2a2
Summary:
These notes existed on the release branches where they were backported, but were never added on master branch. Added them now and mentioned what minor release the fix originally appeared.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7545
Reviewed By: riversand963
Differential Revision: D24281759
Pulled By: ajkr
fbshipit-source-id: 7422e984b667793d6260dd32a7492afcb2ff1c4b
Summary:
The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520
Test Plan:
- new unit test
- added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0`
Reviewed By: pdillinger
Differential Revision: D24200034
Pulled By: ajkr
fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e
Summary:
Add following stats for MultiGet in Histogram to get more insight on MultiGet.
1. Number of index and filter blocks read from file as part of MultiGet
request per level.
2. Number of data blocks read from file per level.
3. Number of SST files loaded from file system per level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7366
Reviewed By: anand1976
Differential Revision: D24127040
Pulled By: akankshamahajan15
fbshipit-source-id: e63a003056b833729b277edc0639c08fb432756b
Summary:
If `BottommostLevelCompaction.kForce*` is set, compaction should avoid
trivial move and always compact the sst to the target size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7368
Reviewed By: ajkr
Differential Revision: D23629525
Pulled By: jay-zhuang
fbshipit-source-id: 79f23c79ecb31587e0593b28cce43131107bbcd0
Summary:
This exposes to the listener interface whether a compaction was
full or not. Also cleaned up API comment for CompactionJobInfo::stats,
which is not of a nullable type. And since CompactionJob is always
created with non-null CompactionJobStats, removed conditionals on it
being nullptr and instead assert non-null.
TODO later: update C and Java interfaces
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7451
Test Plan: updated existing unit tests to check new field, make check
Reviewed By: ltamasi
Differential Revision: D23977796
Pulled By: pdillinger
fbshipit-source-id: 1ae7e26cb949631c2b2fb9e696710daf53cc378d
Summary:
Introduce an new option options.check_flush_compaction_key_order, by default set to true, which checks key order of flush and compaction, and fail the operation if the order is violated.
Also did minor refactor hash checking code, which consolidates the hashing logic to a vlidation class, where the key ordering logic is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7467
Test Plan: Add unit tests to validate the check can catch reordering in flush and compaction, and can be properly disabled.
Reviewed By: riversand963
Differential Revision: D24010683
fbshipit-source-id: 8dd6292d2cda8006054e9ded7cfa4bf405f0527c
Summary:
This has been running in production on some key workloads, so
we believe it to be safe and extremely low cost. Nevertheless, I've
added code to ensure that "force_consistency_checks" is mentioned in
any corruption reports so that people know how to disable in case of
false positive corruption reports.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7446
Test Plan:
make check, CI, temporary debug print new message with
./version_builder_test
Reviewed By: ajkr
Differential Revision: D23972101
Pulled By: pdillinger
fbshipit-source-id: 9623e400f3752577c0ecf977e6d0915562cf9968
Summary:
Add a new Option "allow_data_in_errors". When it's set by users, it allows them to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data.
By default value is set false to prevent users data to be exposed in the messages.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7420
Test Plan:
1. make check -j64
2. Add a new test case
Reviewed By: ajkr
Differential Revision: D23835028
Pulled By: akankshamahajan15
fbshipit-source-id: 8d2eba8fb898e79fcf1fccc07295065a75eb59b1
Summary:
Two relatively simple functional changes to incremental backup
behavior, integrated with a minor refactoring to reduce code redundancy and
improve error/log message. There are nuances to the impact of these changes,
but I believe they are fundamentally good and generally safe. Those functional
changes:
* Incremental backups no longer read DB table files that are already saved to a
shared part of the backup directory, unless `share_files_with_checksum` is used
with `kLegacyCrc32cAndFileSize` naming (discouraged) where crc32c full file
checksums are needed to determine file naming.
* Justification: incremental backups should not need to read the whole DB,
especially without rate limiting. (Although other BackupEngine reads are not
rate limited either, other non-trivial reads are generally limited by a
corresponding write, as in copying files.) Also, the fact that this is not
already fixed was arguably a bug/oversight in the implementation of https://github.com/facebook/rocksdb/issues/7110.
* When considering whether a table file is already backed up in a shared part
of backup directory, BackupEngine would already query the sizes of source (DB)
and pre-existing destination (backup) files. BackupEngine now uses these file
sizes to detect corruption, as at least one of (a) old backup, (b) backup in
progress, or (c) current DB is corrupt if there's a size mismatch.
* Justification: a random related fix that also helps to cover a small hole
in corruption checking uncovered by the other functional change:
* For `share_table_files` without "checksum" (not recommended), the other
change regresses in detecting fundamentally unsafe use of this option
combination: when you might generate different versions of same SST file
number. As demonstrated by `BackupableDBTest.FailOverwritingBackups,` this
regression is greatly mitigated by the new file size checking. Nevertheless,
almost no reason to use `share_files_with_checksum=false` should remain, and
comments are updated appropriately.
Also, this change renames internal function `CalculateChecksum` to
`ReadFileAndComputeChecksum` to make the performance impact of this function
clear in code reviews.
It is not clear what 'same_path' is for in backupable_db.cc, and I suspect it
cannot be true for a DB with unique file names (like DBImpl). Nevertheless,
I've tried to keep its functionality intact when `true` to minimize risk for
now, despite having no unit tests for which it is true.
Select impact details (much more in unit tests): For
`share_files_with_checksum`, I am confident there is no regression (vs.
pre-6.12) in detecting DB or backup corruption at backup creation time, mostly
because the old design did not leverage this extra checksum computation for
detecting inconsistencies at backup creation time. (With computed checksums in
names, a recently corrupted file just looked like a different file vs. what was
already backed up.)
Even in the hypothetical case of DB session id collision (~100 bits entropy
collision), file size in name and/or our file size check add an extra layer of
protection against false success in creating an accurate new backup. (Unit test
included.)
`DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking
are still able to catch corruptions that `CreateNewBackup` does not. Note that
when custom file checksum support is added to BackupEngine, that will
essentially give the same power as `DB::VerifyChecksum` into `CreateNewBackup`.
We could add options for `CreateNewBackup` to cover some of what would be
caught by `VerifyBackup` with checksum checking.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7413
Test Plan:
Two new unit tests included, both of which fail without these
changes. Although we don't test the I/O improvement directly, we test it
indirectly in DB corruption detection power that was inadvertently unlocked
with new backup file naming PLUS computing current content checksums (now
removed). (I don't think that case of DB corruption detection justifies reading
the whole DB on incremental backup.)
Reviewed By: zhichao-cao
Differential Revision: D23818480
Pulled By: pdillinger
fbshipit-source-id: 148aff16f001af5b9fd4b22f155311c2461f1bac
Summary:
This change reverts BackupEngine to 6.12 state to accommodate a
higher-priority fix that does not easily merge with this custom checksum
support. We intend to reinstate this support soon, by merging a revert
of this change.
For backupable_db_test, I've removed the tests depending on this
feature.
I've also removed relevant HISTORY.md entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7411
Test Plan: unit tests
Reviewed By: ajkr
Differential Revision: D23793835
Pulled By: pdillinger
fbshipit-source-id: 7e861436539584799b13d1a8ae559b81b6d08052
Summary:
In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310
Test Plan: adding new unit test, pass make check.
Reviewed By: anand1976
Differential Revision: D23710892
Pulled By: zhichao-cao
fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9
Summary:
Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '\_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.
This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.
We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).
We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.
Two final auxiliary notes:
Recognizing that the backup file names have become a de facto part of
the backup meta schema, this change makes them easier to parse and
extend by putting a distinct marker, 's', before DB session ids embedded
in the name. When we extend this to allow custom checksums in the name,
they can get their own marker to ensure safe parsing. For backward
compatibility, file size does not get a marker but is assumed for
`_[0-9]+[.]`
Another change from initial 6.12 default behavior is never including
file custom checksum in the file name. Looking ahead to 6.13, we do not
want the default behavior to cause backup space amplification for
someone turning on file custom checksum checking in BackupEngine; we
want that to be an easy decision. When implemented, including file
custom checksums in backup file names will be a non-default option.
Actual file name patterns and priorities, as regexes:
kLegacyCrc32cAndFileSize OR pre-6.12 SST file ->
[0-9]+_[0-9]+_[0-9]+[.]sst
kFlagMatchInterimNaming set (default) AND early 6.12 SST file ->
[0-9]+_[0-9a-fA-F-]+[.]sst
kUseDbSessionId AND NOT kFlagIncludeFileSize ->
[0-9]+_s[0-9A-Z]{20}[.]sst
kUseDbSessionId AND kFlagIncludeFileSize (default) ->
[0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst
We might add opt-in options for more '\_' separated data in the name,
but embedded file size, if present, will always be after last '\_' and
before '.sst'.
This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400
Test Plan:
unit tests included. Sync point callbacks are used to mimic
previous version SST files.
Reviewed By: ajkr
Differential Revision: D23759587
Pulled By: pdillinger
fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025
Summary:
Make "unreleased" section for HISTORY.md with things misplaced
into 6.12 and 6.13
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7401
Test Plan: see how it goes, and `git diff origin/6.13.fb HISTORY.md`
Reviewed By: jay-zhuang
Differential Revision: D23759740
Pulled By: pdillinger
fbshipit-source-id: fc441916c7ff2bbb8d5384137653b340d4c47674
Summary:
Cleaned up the public API to use the EncryptedEnv. This change will allow providers to be developed and added to the system easier in the future. It will also allow better integration in the future with the OPTIONS file.
- The internal classes were moved out of the public API into an internal "env_encryption_ctr.h" header. Short-cut constructors were added to provide the original API functionality.
- The APIs to the constructors were changed to take shared_ptr, rather than raw pointers or references to allow better memory management and alternative implementations.
- CreateFromString methods were added to allow future expansion to other provider and cipher implementations through a standard API.
Additionally, there was a code duplication in the NewXXXFile methods. This common code was moved under a templatized function.
A first-pass at structuring the code was made to potentially allow multiple EncryptionProviders in a single EncryptedEnv. The idea was that different providers may use different cipher keys or different versions/algorithms. The EncryptedEnv should have some means of picking different providers based on information. The groundwork was started for this (the use of the provider_ member variable was localized) but the work has not been completed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7279
Reviewed By: jay-zhuang
Differential Revision: D23709440
Pulled By: zhichao-cao
fbshipit-source-id: 0e845fff0e03a52603eb9672b4ade32d063ff2f2
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753
Reviewed By: ajkr
Differential Revision: D23385030
Pulled By: zhichao-cao
fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
Summary:
During bottommost compaction, RocksDB cannot simply drop a tombstone if
this tombstone is not in the earliest snapshot. The current behavior is: RocksDB
skips other internal keys (of the same user key) in the same snapshot range. In
the meantime, RocksDB should check for the `shutting_down` flag. Otherwise, it
is possible for a bottommost compaction that has already started running to take
a long time to finish, even if the application has tried to cancel all background jobs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7356
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D23663241
Pulled By: riversand963
fbshipit-source-id: 25f8e9b51bc3bfa3353cdf87557800f9d90ee0b5
Summary:
https://github.com/facebook/rocksdb/issues/3341 guaranteed that upon return of `GetSortedWalFiles` after
`DisableFileDeletions`, all pending purges of previously obsolete WAL
files will have finished. However, the addition of
avoid_unnecessary_blocking_io in https://github.com/facebook/rocksdb/issues/5043 opened a hole in the code making
that assurance, which can lead to files to be copied for checkpoint or
backup going missing before being copied, with that option enabled.
This change patches the hole.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7369
Test Plan:
apparent fix to backups in crash test observed. Will work
on a unit test for another commit
Reviewed By: ajkr
Differential Revision: D23620258
Pulled By: pdillinger
fbshipit-source-id: bea36b461a5b719c3e3ef802f967bc3e8ae71614
Summary:
This is adapted from https://github.com/facebook/rocksdb/issues/6678 but takes a different approach, avoiding opening a read-write DB and avoiding the `DeleteFile()` API.
First, this PR refactors how options variables are initialized in `ldb` so it can be reused in a subcommand that doesn't open a DB:
- Separated remaining option initialization logic out of `OpenDB()`. The new `PrepareOptions()` function initializes the full options state.
- Fixed an old TODO about applying the subcommand CF option overrides to the proper `ColumnFamilyOptions` object.
Second, this PR adds the `ldb unsafe_remove_sst_file` subcommand. It uses the `VersionSet`-level APIs to remove the file with the specified number.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7335
Test Plan: played with interactive python and this file removal command. Verified openability/correct results in case of multiple column families, multiple levels, etc.
Reviewed By: pdillinger
Differential Revision: D23454575
Pulled By: ajkr
fbshipit-source-id: 039b7a8cbfc42fd123dcb25821eef51d61148afe
Summary:
Since we can't land https://github.com/facebook/rocksdb/issues/7336 until the next major release, added a strong warning against the `DeleteFile()` API in the meantime.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7337
Reviewed By: pdillinger
Differential Revision: D23459728
Pulled By: ajkr
fbshipit-source-id: 326cb9b18190386080c35c761a8736d8a877dafb
Summary:
In block-based table builder, the cut-over from buffered to unbuffered
mode involves sampling the buffered blocks and generating a dictionary.
There was a bug where `SstFileWriter` passed zero as the `target_file_size`
causing the cutover to happen immediately, so there were no samples
available for generating the dictionary.
This PR changes the meaning of `target_file_size == 0` to mean buffer
the whole file before cutting over. It also adds dictionary compression
support to `sst_dump --command=recompress` for easy evaluation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7323
Reviewed By: cheng-chang
Differential Revision: D23412158
Pulled By: ajkr
fbshipit-source-id: 3b232050e70ef3c2ee85a4b5f6fadb139c569873
Summary:
- Closes https://github.com/facebook/rocksdb/issues/6490
- Currently MERGEs are converted to PUTs at bottom or compaction has reached the beginning of the key, this can wrongly cover a PUT future base case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7166
Test Plan:
- Automated: `make all check`
- Manual: With `allow_ingest_behind = true`, add Merge operations to a key then run compaction. Then run ingesting external files to make sure the base case is probably compacted with existing Merges.
Reviewed By: cheng-chang
Differential Revision: D23325425
Pulled By: ajkr
fbshipit-source-id: 3eb415eb7b381b5453e45245393566153b1abb68
Summary:
L0 score is based on size target and number of files. The size target
used is `max_bytes_for_level_base`. However, the base level's size can
dynamically expand in write burst mode. In fact, it can expand so much
that L0->Lbase becomes the highest fanout in target sizes. This doesn't
make sense from an efficiency perspective, so this PR bounds the
L0->Lbase fanout to the smoothed level multiplier. The L0 scoring based
on file count remains unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7325
Test Plan:
contrived benchmark that exhibits the problem:
```
$ TEST_TMPDIR=/data/users/andrewkr/ ./db_bench -benchmarks=filluniquerandom,readrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -level0_file_num_compaction_trigger=4 -level_compaction_dynamic_level_bytes=true -compression_type=none -max_background_jobs=12 -rate_limiter_bytes_per_sec=104857600 -benchmark_write_rate_limit=10485760 -num=100000000
```
Results:
- "Burst W-Amp" is the write-amp near the end of the fillrandom benchmark
- "Total W-Amp" is the write-amp after readrandom has run a while and all levels no longer need compaction
Branch | Burst W-Amp | Total W-Amp | fillrandom (MB/s)
-- | -- | -- | --
master | 20.2 | 21.5 | 4.7
dynamic-l0-score | 12.6 | 14.1 | 7.2
Reviewed By: siying
Differential Revision: D23412935
Pulled By: ajkr
fbshipit-source-id: f91f2067188e432dd39deab02f1c56f195057a0e
Summary:
Add a unit test case to check memory usage when
max_write_buffer_size_to_maintain is set if flushed immutable memtables are
trimmed timely or not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7311
Test Plan: Compared the results with before bug fix.
Reviewed By: ltamasi
Differential Revision: D23321702
Pulled By: akankshamahajan15
fbshipit-source-id: da04ee21137d641a07fd499a9e2749eb036fcb1e
Summary:
A new file interface `SupportPrefetch()` is added. When the user overrides it to `false`, an internal prefetch buffer will be used for readahead. Useful for non-directIO but FS doesn't have readahead support.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7312
Reviewed By: anand1976
Differential Revision: D23329847
Pulled By: jay-zhuang
fbshipit-source-id: 71cd4ce6f4a820840294e4e6aec111ab76175527
Summary:
This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294.
To get DB checksum info for live files, we now read the manifest file
that will become part of the checkpoint/backup. This requires a little
extra handling in taking a custom checkpoint, including only reading the
manifest file up to the size prescribed by the checkpoint.
This moves GetFileChecksumsFromManifest from backup code to
file_checksum_helper.{h,cc} and removes apparently unnecessary checking
related to column families.
Updated HISTORY.md and warned potential future users of
DB::GetLiveFilesChecksumInfo()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309
Test Plan: updated unit test, before and after
Reviewed By: ajkr
Differential Revision: D23311994
Pulled By: pdillinger
fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5
Summary:
Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297
Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it.
Reviewed By: anand1976
Differential Revision: D23257950
fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03
Summary:
When SST file is created, application is able to know the file information through OnTableFileCreated callback in LogAndNotifyTableFileCreationFinished. Since file checksum information can be useful for application when the SST file is created, we add file_checksum and file_checksum_func_name information to TableFileCreationInfo, which will be passed through OnTableFileCreated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7108
Test Plan: make check, listener_test.
Reviewed By: ajkr
Differential Revision: D22470240
Pulled By: zhichao-cao
fbshipit-source-id: 92c20344d9b986eadfe3480f3769bf4add0dbaae
Summary:
After releasing a snapshot, it checks whether it is suitable to trigger bottom compactions.
When disabling auto compactions, it may still schedule compaction when releasing a snapshot. Whereas no compaction job will be actually handled, so the state of LSM is not changed and compaction will be triggered again and again every time releasing a snapshot.
Too frequent compactions lead to high CPU usage and high db_mutex lock contention which affects foreground write duration finally.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7267
Test Plan:
- make check
- manual test
Reviewed By: akankshamahajan15
Differential Revision: D23252880
Pulled By: ajkr
fbshipit-source-id: 4431e071a35d9912a2a3592875db27bae521434b
Summary:
When a memtable is trimmed in MemTableListVersion, the memtable
is only added to delete list if it is
the last reference. However it is not the last reference as it is held
by the super version. But the super version would not be switched if the
delete list is empty. So the memtable is never destroyed and memory
usage increases beyond write_buffer_size +
max_write_buffer_size_to_maintain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7296
Test Plan:
1. ./db_bench -benchmarks=randomtransaction
-optimistic_transaction_db=1 -statistics -stats_interval_seconds=1
-duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000
--transaction_set_snapshot
Reviewed By: ltamasi
Differential Revision: D23267395
Pulled By: akankshamahajan15
fbshipit-source-id: 3a8d437fe9f4015f851ff84c0e29528aa946b650
Summary:
On a read-write DB configured with
DBOptions::file_checksum_gen_factory, BackupEngine::CreateNewBackup can
fail intermittently, with non-OK status. This is due to a race between
GetLiveFiles and GetLiveFilesChecksumInfo in creating backups.
For patching 6.12 release (as this commit is intended for, except this is a
forward-merged version), we can simply treat files for which we falsely failed
to get checksum info as legacy files lacking checksum info.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7294
Test Plan: unit test reproducer included
Reviewed By: ajkr
Differential Revision: D23253489
Pulled By: pdillinger
fbshipit-source-id: 9e4945dad120b776ad3e753be10b962f61f28e14
Summary:
The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, `WALRecoveryMode::kTolerateCorruptedTailRecords` must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7271
Reviewed By: riversand963
Differential Revision: D23169923
Pulled By: ajkr
fbshipit-source-id: 2cf8a3bcd2c9a0ecb0055a84725047a10fd4db50
Summary:
Add `kEntryDeleteWithTimestamp` to `EntryType` which is a public API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7195
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D22914704
Pulled By: riversand963
fbshipit-source-id: 886f73c6b70c527cad1c8fc9fc8d3afe60e1ea39
Summary:
There is potential data race related CompactRange() with level refitting. After the compaction step and refitting step, some automatic compaction could put data to the destination level and cause the DB to be corrupted. Fix the bug by checking the target level to be empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7261
Test Plan: Add a unit test, which would fail with "Corruption: L1 have overlapping ranges '666F6F' seq:6, type:1 vs. '626172' seq:2, type:1", and now it succeeds.
Reviewed By: ajkr
Differential Revision: D23142269
fbshipit-source-id: 28bc14d5ac934c192260b23a4ce3f10a95e3ee91
Summary:
Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223
Reviewed By: riversand963
Differential Revision: D23056737
Pulled By: jay-zhuang
fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba
Summary:
Manual compaction with `CompactRangeOptions::change_levels` set could
refit to a level targeted by another manual compaction. If
force_consistency_checks were disabled, it could be possible for
overlapping files to be written at that target level.
This PR prevents the possibility by calling `DisableManualCompaction()`
prior to `ReFitLevel()`. It also improves the manual compaction disabling
mechanism to wait for pending manual compactions to complete before
returning, and support disabling from multiple threads.
Fixes https://github.com/facebook/rocksdb/issues/6432.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250
Test Plan:
crash test command that repro'd the bug reliably:
```
$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0
```
Reviewed By: ltamasi
Differential Revision: D23090800
Pulled By: ajkr
fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13
Summary:
A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups.
Tests are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085
Test Plan: Passed make check
Reviewed By: pdillinger
Differential Revision: D22390756
Pulled By: gg814
fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f
Summary:
https://github.com/facebook/rocksdb/pull/5289 introduces a performance regression that caused an upper bound check within every BlockBasedTableIterator::Next(). This is unnecessary if we've checked the boundary key for current block and it is within upper bound.
Fix the bug. Also rename the boolean to a enum so that the code is slightly better readable. The original regression was probably to fix a bug that the block upper bound check status is not reset after a new block is created. Fix it bug so that the regression can be avoided without hitting the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7209
Test Plan: Run all existing tests. Will run atomic black box crash test for a while.
Reviewed By: anand1976
Differential Revision: D22859246
fbshipit-source-id: cbdad1f5e656c55fd8b71726d5a4f6cb53ff9140
Summary:
SST Partitioner interface that allows to split SST files during compactions.
It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957
Reviewed By: ajkr
Differential Revision: D22461239
fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e
Summary:
When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file. If the values do not match, the SST file is rejected.
Code put in place for the check for both flush and compaction jobs. Corresponding test added to corruption_test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134
Reviewed By: cheng-chang
Differential Revision: D22646149
fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f
Summary:
Issue https://github.com/facebook/rocksdb/issues/7133 reported that using `system_clock` in `FileOperationInfo::TimePoint` causes the duration of file flush operation (which can be a noop on MacOS in some scenarios) appears to be 0 and fail an assertion in listener_test. Using `steady_clock` supposedly fixed the problem.
`steady_clock` actually fits better into the use cases of `FileOperationInfo::TimePoint` as all usages care about durations but not wall clock time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7153
Test Plan: make check.
Reviewed By: riversand963
Differential Revision: D22654136
Pulled By: roghnin
fbshipit-source-id: 5980b1080734bdae496a18071a2c2b5887c67d85
Summary:
BackupEngine requires computing table checksums twice when backing up table files to the `shared_checksum` directory.
The repeated computation can be avoided by utilizing the db session id stored as a part of the table properties.
Filenames of table files in the `shared_checksum` directory depend on the following conditions:
1. the naming scheme is `kOptionalChecksumAndDbSessionId`,
2. `db_session_id` is not empty,
3. checksum is available in the DB manifest.
If 1,2,3 are satisfied, then the filenames will be of the form `<file_number>_<checksum>_<db_session_id>.sst`.
If 1,2 are satisfied, then the filenames will be of the form `<file_number>_<db_session_id>.sst`.
In all other cases, the filenames are of the form `<file_number>_<checksum>_<size>.sst`.
Additionally, if `kOptionalChecksumAndDbSessionId` is used (and not falling back to `kChecksumAndFileSize`), the `<checksum>` appeared in the filenames is hexadecimally encoded, instead of being plain `uint32_t` value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7110
Test Plan: backupable_db_test and manual tests.
Reviewed By: ajkr
Differential Revision: D22508992
Pulled By: gg814
fbshipit-source-id: 5669f0ea9ad5a097f69f6d87aca4abba15032389
Summary:
Previously when running `db_bench` with large value for `num_multi_dbs` and enabled `Options::dump_malloc_stats`, we would see most CPU spent in jemalloc locking. After this PR that no longer shows up at the top of the profile.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7145
Reviewed By: riversand963
Differential Revision: D22593031
Pulled By: ajkr
fbshipit-source-id: 3b3fc91f93249c6afee53f59f34c487c3fc5add6
Summary:
In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765
Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check
Reviewed By: anand1976
Differential Revision: D21916789
Pulled By: zhichao-cao
fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
Summary:
Currently, RocksDB lets compaction to go through even in case of
corrupted keys, the number of which is reported in CompactionJobStats.
However, RocksDB does not check this value. We should let compaction run
in a stricter mode.
Temporarily disable two tests that allow corrupted keys in compaction.
With this PR, the two tests will assert(false) and terminate. Still need
to investigate what is the recommended google-test way of doing it.
Death test (EXPECT_DEATH) in gtest has warnings now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7124
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D22530722
Pulled By: riversand963
fbshipit-source-id: 6a5a6a992028c6d4f92cb74693c92db462ae4ad6
Summary:
This is a followup to https://github.com/facebook/rocksdb/issues/6646. In that PR, for simplicity I just appended a comparison against the 0th restart key in case `BinarySeek()`'s binary search landed at index 0. As a result there were `2/(N+1) + log_2(N)` key comparisons. This PR does it differently. Now we expand the binary search range by one so it also covers the case where target is at or before the restart key at index 0. As a result, it involves `log_2(N+1)` key comparisons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7068
Test Plan:
ran readrandom with mostly default settings and counted key comparisons
using `PerfContext`.
before: `user_key_comparison_count = 28881965`
after: `user_key_comparison_count = 27823245`
setup command:
```
$ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000
```
benchmark command:
```
$ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3
```
Reviewed By: anand1976
Differential Revision: D22357032
Pulled By: ajkr
fbshipit-source-id: 8b01e9c1c2a4e9d02fc9dfe16c1cc0327f8bdf24
Summary:
When format_version is high enough to support user-key and
there are index entries for same user key that spans multiple data
blocks then it changes from user-key mode to internal-key mode. But the
flush policy is not reset to point to Block Builder of internal-keys.
After this switch, no entries are added to user key index partition
result, thus it never triggers flushing the block.
Fix: 1. After adding the entry in sub_builder_index_, if there is a switch
from user-key to internal-key, then flush policy is updated to point to
Block Builder of internal-keys index partition.
2. Set sub_builder_index_->seperator_is_key_plus_seq_ = true if
seperator_is_key_plus_seq_ is set to true so that subsequent partitions
can also use internal key mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7096
Test Plan: make check -j64
Reviewed By: ajkr
Differential Revision: D22416598
Pulled By: akankshamahajan15
fbshipit-source-id: 01fc2dc07ea1b32f8fb803995ebe6e9a3fbe67ac
Summary:
Currently, `EventListener` in listner.h only have callback functions for file read and write. One may favor extended callback functions for more file I/O operations like flush, sync and close. This PR tries to add those interface and have them called when appropriate throughout the code base.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7055
Test Plan:
Write an experimental listener with those new callback functions with log output in them; run experiments and check logs to see those functions are actually called.
Default test suits `make check` should also be included.
Reviewed By: riversand963
Differential Revision: D22380624
Pulled By: roghnin
fbshipit-source-id: 4121491d45c2c2aae8c255e7998090559a241c6a
Summary:
When table file checksums are enabled and stored in the DB manifest by using the RocksDB default crc32c checksum function, BackupEngine will calculate the crc32c checksum of the file to be copied and compare the calculated result with the one stored in the DB manifest before copying the file to the backup directory.
After copying to the backup directory, BackupEngine will verify the checksum of the copied file with the one calculated before copying. This helps detect some rare corruption events such as bit-flips during the copying process.
No verification with checksums in DB manifest will be performed if the table file checksum function is not the RocksDB default crc32c checksum function.
In addition, If `share_table_files` and `share_files_with_checksum` are true, BackupEngine will compare the checksums computed before and after copying of the table files.
Corresponding tests are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7015
Test Plan: Passed make check
Reviewed By: pdillinger
Differential Revision: D22165732
Pulled By: gg814
fbshipit-source-id: ee0e8cc397c455eba64545c29380b9d9853588ec