Summary:
Added the Blob option settings from the AdvancedColmnFamilyOptions to the C API.
There are no tests for getting/setting options in the C API currently, hence no specific test plans. Should there be a some?
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8148
Reviewed By: ltamasi
Differential Revision: D27568495
Pulled By: mrambacher
fbshipit-source-id: 3a52b784467ea2c4bc58be5f75c5d41f0a5c55d6
Summary:
Extend the DB::GetLiveFilesChecksumInfo API to blob files.
This API is also used by the file_checksum_dump ldb command to dump checksum
of SST files which now also dumps blob files checksum.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8179
Test Plan: Add new unit test
Reviewed By: zhichao-cao
Differential Revision: D27714965
Pulled By: akankshamahajan15
fbshipit-source-id: d8b7343ea845a64c83800336d88cced7152a8c92
Summary:
Before this PR, `get_iostats_context()` will silently return a nullptr if no thread_local support is detected.
This can be the result of build_detect_platform's failure to compile the simple code snippet on certain platforms, as
reported in https://github.com/facebook/mysql-5.6/issues/904.
To be safe, we should fail the compilation if user does not opt out IOStatsContext and
ROCKSDB_SUPPORT_THREAD_LOCAL is not defined.
If RocksDB relies on c++11, can we just always use thread_local? It turns out there might be
performance concerns (https://github.com/facebook/rocksdb/issues/5774),
which is beyond the scope of this PR. We can revisit this later. Here, we stick to the original impl.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8117
Reviewed By: ajkr
Differential Revision: D27356847
Pulled By: riversand963
fbshipit-source-id: f7d5776842277598d8341b955febb601946801ae
Summary:
* CreateNewBackup(WithMetadata) returning the BackupID of new backup
through optional new output param. This is especially useful with the
new mutithreading support, so that you can transactionally determine the
ID of a backup you create.
* GetBackupInfo / GetLatestBackupInfo for individual backups, so that
you don't have to comb through a vector of backups if you don't want to.
Updated HISTORY.md (including re: BlobDB support as new feature)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8170
Test Plan:
Added test logic to existing tests, to minimize increase in
cost of running tests
Reviewed By: zhichao-cao
Differential Revision: D27680410
Pulled By: pdillinger
fbshipit-source-id: 1fc45b73d81aae293ccd4a43d9583d7fd915d3eb
Summary:
Current flush reason attribution is misleading or incorrect (depending on what the original intention was):
- Flush due to WAL reaching its maximum size is attributed to `kWriteBufferManager`
- Flushes due to full write buffer and write buffer manager are not distinguishable, both are attributed to `kWriteBufferFull`
This changes the first to a new flush reason `kWALFull`, and splits the second between `kWriteBufferManager` and `kWriteBufferFull`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8150
Reviewed By: zhichao-cao
Differential Revision: D27569645
Pulled By: ot
fbshipit-source-id: 7e3c8ca186a6e71976e6b8e937297eebd4b769cc
Summary:
Add support for blob files for backup/restore like table files.
Since DB session ID is currently not supported for blob files (there is no place to store it in
the header), so for blob files uses the
kLegacyCrc32cAndFileSize naming scheme even if
share_files_with_checksum_naming is set to kUseDbSessionId.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8129
Test Plan: Add new test units
Reviewed By: ltamasi
Differential Revision: D27408510
Pulled By: akankshamahajan15
fbshipit-source-id: b27434d189a639ef3e6ad165c61a143a2daaf06e
Summary:
A current limitation of backups is that you don't know the
exact database state of when the backup was taken. With this new
feature, you can at least inspect the backup's DB state without
restoring it by opening it as a read-only DB.
Rather than add something like OpenAsReadOnlyDB to the BackupEngine API,
which would inhibit opening stackable DB implementations read-only
(if/when their APIs support it), we instead provide a DB name and Env
that can be used to open as a read-only DB.
Possible follow-up work:
* Add a version of GetBackupInfo for a single backup.
* Let CreateNewBackup return the BackupID of the newly-created backup.
Implementation details:
Refactored ChrootFileSystem to split off new base class RemapFileSystem,
which allows more general remapping of files. We use this base class to
implement BackupEngineImpl::RemapSharedFileSystem.
To minimize API impact, I decided to just add these fields `name_for_open`
and `env_for_open` to those set by GetBackupInfo when
include_file_details=true. Creating the RemapSharedFileSystem adds a bit
to the memory consumption, perhaps unnecessarily in some cases, but this
has been mitigated by (a) only initialize the RemapSharedFileSystem
lazily when GetBackupInfo with include_file_details=true is called, and
(b) using the existing `shared_ptr<FileInfo>` objects to hold most of the
mapping data.
To enhance API safety, RemapSharedFileSystem is wrapped by new
ReadOnlyFileSystem which rejects any attempts to write. This uncovered a
couple of places in which DB::OpenForReadOnly would write to the
filesystem, so I fixed these. Added a release note because this affects
logging.
Additional minor refactoring in backupable_db.cc to support the new
functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142
Test Plan:
new test (run with ASAN and UBSAN), added to stress test and
ran it for a while with amplified backup_one_in
Reviewed By: ajkr
Differential Revision: D27535408
Pulled By: pdillinger
fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148
Summary:
Add request_id in IODebugContext which will be populated by
underlying FileSystem for IOTracing purposes. Update IOTracer to trace
request_id in the tracing records. Provided API
IODebugContext::SetRequestId which will set the request_id and enable
tracing for request_id. The API hides the implementation and underlying
file system needs to call this API directly.
Update DB::StartIOTrace API and remove redundant Env* from the
argument as its not used and DB already has Env that is passed down to
IOTracer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8045
Test Plan: Update unit test.
Differential Revision: D26899871
Pulled By: akankshamahajan15
fbshipit-source-id: 56adef52ee5af0fb3060b607c3af1ec01635fa2b
Summary:
Added `TableProperties::{fast,slow}_compression_estimated_data_size`.
These properties are present in block-based tables when
`ColumnFamilyOptions::sample_for_compression > 0` and the necessary
compression library is supported when the file is generated. They
contain estimates of what `TableProperties::data_size` would be if the
"fast"/"slow" compression library had been used instead. One
limitation is we do not record exactly which "fast" (ZSTD or Zlib)
or "slow" (LZ4 or Snappy) compression library produced the result.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8139
Test Plan:
- new unit test
- ran `db_bench` with `sample_for_compression=1`; verified the `data_size` property matches the `{slow,fast}_compression_estimated_data_size` when the same compression type is used for the output file compression and the sampled compression
Reviewed By: riversand963
Differential Revision: D27454338
Pulled By: ajkr
fbshipit-source-id: 9529293de93ddac7f03b2e149d746e9f634abac4
Summary:
BackupEngine previously had unclear but strict concurrency
requirements that the API user must follow for safe use. Now we make
that clear, by separating operations into "Read," "Append," and "Write"
operations, and specifying which combinations are safe across threads on
the same BackupEngine object (previously none; now all, using a
read-write lock), and which are safe across different BackupEngine
instances open on the same backup_dir.
The changes to backupable_db.h should be backward compatible. It is
mostly about eliminating copies of what should be the same function and
(unsurprisingly) useful documentation comments were often placed on
only one of the two copies. With the re-organization, we are also
grouping different categories of operations. In the future we might add
BackupEngineReadAppendOnly, but that didn't seem necessary.
To mark API Read operations 'const', I had to mark some implementation
functions 'const' and some fields mutable.
Functional changes:
* Added RWMutex locking around public API functions to implement thread
safety on a single object. To avoid future bugs, this is another
internal class layered on top (removing many "override" in
BackupEngineImpl). It would be possible to allow more concurrency
between operations, rather than mutual exclusion, but IMHO not worth the
work.
* Fixed a race between Open() (Initialize()) and CreateNewBackup() for
different objects on the same backup_dir, where Initialize() could
delete the temporary meta file created during CreateNewBackup().
(This was found by the new test.)
Also cleaned up a couple of "status checked" TODOs, and improved a
checksum mismatch error message to include involved files.
Potential follow-up work:
* CreateNewBackup has an API wart because it doesn't tell you the
BackupID it just created, which makes it of limited use in a multithreaded
setting.
* We could also consider a Refresh() function to catch up to
changes made from another BackupEngine object to the same dir.
* Use a lock file to prevent multiple writer BackupEngines, but this
won't work on remote filesystems not supporting lock files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8115
Test Plan:
new mini-stress test in backup unit tests, run with gcc,
clang, ASC, TSAN, and UBSAN, 100 iterations each.
Reviewed By: ajkr
Differential Revision: D27347589
Pulled By: pdillinger
fbshipit-source-id: 28d82ed2ac672e44085a739ddb19d297dad14b15
Summary:
Ran a spell check over the comments in the include/rocksdb directory and fixed any mis-spellings.
There are still some variable names that are spelled incorrectly (like SizeApproximationOptions::include_memtabtles, SstFileMetaData::oldest_ancester_time) that were not fixed, as those would break compilation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8120
Reviewed By: zhichao-cao
Differential Revision: D27366034
Pulled By: mrambacher
fbshipit-source-id: 6a3f3674890bb6acc751e9c5887a8fbb6adca5df
Summary:
Previously it only applied to block-based tables generated by flush. This restriction
was undocumented and blocked a new use case. Now compression sampling
applies to all block-based tables we generate when it is enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8105
Test Plan: new unit test
Reviewed By: riversand963
Differential Revision: D27317275
Pulled By: ajkr
fbshipit-source-id: cd9fcc5178d6515e8cb59c6facb5ac01893cb5b0
Summary:
Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071
Test Plan: make check, added new unit test.
Reviewed By: anand1976
Differential Revision: D27177043
Pulled By: zhichao-cao
fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e
Summary:
Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8050
Test Plan: make check and add test to error_handler_fs_test
Reviewed By: anand1976
Differential Revision: D26990565
Pulled By: zhichao-cao
fbshipit-source-id: 49f71e8ea4e9db8b189943976404205b56ab883f
Summary:
Extend support to track blob files in SST File manager.
This PR notifies SstFileManager whenever a new blob file is created,
via OnAddFile and an obsolete blob file deleted via OnDeleteFile
and delete file via ScheduleFileDeletion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037
Test Plan: Add new unit tests
Reviewed By: ltamasi
Differential Revision: D26891237
Pulled By: akankshamahajan15
fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6
Summary:
These classes were wraps of Env that provided only extensions to the FileSystem functionality. Changed the classes to be FileSystems and the wraps to be of the CompositeEnvWrapper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7968
Reviewed By: anand1976
Differential Revision: D26900253
Pulled By: mrambacher
fbshipit-source-id: 94001d8024a3c54a1c11adadca2bac66c3af2a77
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes.
For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.
There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved.
Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:
6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found)
6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found)
PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found)
(Note that the times for 6.18 are prior to revert of the SystemClock).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033
Reviewed By: pdillinger
Differential Revision: D27014563
Pulled By: mrambacher
fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
Summary:
This API can be used for things like determining how much space
can be freed up by deleting a particular backup, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8042
Test Plan:
validation of the API added to many existing backup unit
tests
Reviewed By: mrambacher
Differential Revision: D26936577
Pulled By: pdillinger
fbshipit-source-id: f0bbd90f0917b9781a6837652fb4616d9247816a
Summary:
This PR adds support for custom file systems to ldb and sst_dump by adding command line options for specifying --fs_uri and --backup_fs uri (for ldb backup/restore commands). fs_uri is already supported in db_bench and db_stress, and there is already support in ldb and db stress for specifying customized envs.
The PR also fixes what looks like a bug in the ldb backup/restore commands. As it is right now, backups can only be made from and to the same environment/file system which does not seem to be the intended behavior. This PR makes it possible to do/restore backups between different envs/file systems.
Example:
`./ldb backup --fs_uri=zenfs://dev:nvme2n1 --backup_fs_uri=posix:// --backup_dir=/tmp/my_rocksdb_backup --db=rocksdbtest/dbbench
`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8010
Reviewed By: jay-zhuang
Differential Revision: D26904654
Pulled By: ajkr
fbshipit-source-id: 9b695ed8b944fcc6b27c4daaa9f52e87ee2c1fb4
Summary:
Removed confusing, awkward, and undocumented internal API
ReadOneLine and replaced with very simple LineFileReader.
In refactoring backupable_db.cc, this has the side benefit of
removing the arbitrary cap on the size of backup metadata files.
Also added Status::MustCheck to make it easy to mark a Status as
"must check." Using this, I can ensure that after
LineFileReader::ReadLine returns false the caller checks GetStatus().
Also removed some excessive conditional compilation in status.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8026
Test Plan: added unit test, and running tests with ASSERT_STATUS_CHECKED
Reviewed By: mrambacher
Differential Revision: D26831687
Pulled By: pdillinger
fbshipit-source-id: ef749c265a7a26bb13cd44f6f0f97db2955f6f0f
Summary:
New comment for share_files_with_checksum:
// Only used if share_table_files is set to true. Setting to false is
// DEPRECATED and potentially dangerous because in that case BackupEngine
// can lose data if backing up databases with distinct or divergent
// history, for example if restoring from a backup other than the latest,
// writing to the DB, and creating another backup. Setting to true (default)
// prevents these issues by ensuring that different table files (SSTs) with
// the same number are treated as distinct. See
// share_files_with_checksum_naming and ShareFilesNaming.
I have also removed interim option kFlagMatchInterimNaming, which is no
longer needed and was never needed for correct+compatible operation
(just performance).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8020
Test Plan:
tests updated. Backward+forward compatibility verified with
SHORT_TEST=1 check_format_compatible.sh. ldb uses default backup
options, and I manually verified shared_checksum in
/tmp/rocksdb_format_compatible_peterd/bak/current/ after run.
Reviewed By: ajkr
Differential Revision: D26786331
Pulled By: pdillinger
fbshipit-source-id: 36f968dfef1f5cacbd65154abe1d846151a55130
Summary:
Haven't seen any production issues with new Bloom filter and
it's now > 1 year old (added in 6.6.0).
Updated check_format_compatible.sh and HISTORY.md
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8017
Test Plan: tests updated (or prior bugs fixed)
Reviewed By: ajkr
Differential Revision: D26762197
Pulled By: pdillinger
fbshipit-source-id: 0e755c46b443087c1544da0fd545beb9c403d1c2
Summary:
I recently discovered the confusing, undocumented semantics of
Read() functions in the FileSystem and Env APIs. I have added
clarification to the best of my reverse-engineered understanding, and
made a note in HISTORY.md for implementors to check their
implementations, as a subtly non-adherent implementation could lead to
RocksDB quietly ignoring some portion of a file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8029
Test Plan: no code changes
Reviewed By: anand1976
Differential Revision: D26831698
Pulled By: pdillinger
fbshipit-source-id: 208f97ff6037bc13bb2ef360b987c2640c79bd03
Summary:
The patch does the following:
1) Exposes the amount of data (number of bytes) read from blob files from
`BlobFileReader::GetBlob` / `Version::GetBlob`.
2) Tracks the total number and size of blobs read from blob files during a
compaction (due to garbage collection or compaction filter usage) in
`CompactionIterationStats` and propagates this data to
`InternalStats::CompactionStats` / `CompactionJobStats`.
3) Updates the formulae for write amplification calculations to include the
amount of data read from blob files.
4) Extends the compaction stats dump with a new column `Rblob(GB)` and
a new line containing the total number and size of blob files in the current
`Version` to complement the information about the shape and size of the LSM tree
that's already there.
5) Updates `CompactionJobStats` so that the number of files and amount of data
written by a compaction are broken down per file type (i.e. table/blob file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022
Test Plan: Ran `make check` and `db_bench`.
Reviewed By: riversand963
Differential Revision: D26801199
Pulled By: ltamasi
fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2
Summary:
This PR adds SetBufferSize() to the WriteBufferManager object. This enables user code to adjust the global budget for write_buffers based upon other memory conditions such as growth in table reader memory as the dataset grows.
The buffer_size_ member variable is now atomic to match design of other changeable size_t members within WriteBufferManager.
This change is useful as is. However, this change is also essential if someone decides they wanted to enable db_write_buffer_size modifications through the DB::SetOptions() API, i.e. no waste taking this as is.
Any format / spacing changes are due to clang-format as required by check-in automation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7961
Reviewed By: ajkr
Differential Revision: D26639075
Pulled By: akankshamahajan15
fbshipit-source-id: 0604348caf092d35f44e85715331dc920e5c1033
Summary:
Allow applications to implement a custom compaction filter and pass it to BlobDB.
The compaction filter's custom logic can operate on blobs.
To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method.
Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely
on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in
db/blob/db_blob_compaction_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D26509280
Pulled By: riversand963
fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39
Summary:
RocksDB does auto-readahead for iterators on noticing more
than two reads for a table file. The readahead starts at 8KB and doubles on every
additional read upto BlockBasedTable::kMaxAutoReadAheadSize which is
256*1024.
This PR adds a new option BlockBasedTableOptions::max_auto_readahead_size which
replaces BlockBasedTable::kMaxAutoReadAheadSize and the new option can be
configured.
If max_auto_readahead_size is set 0 then no implicit auto prefetching will
be done. If max_auto_readahead_size provided is less than
8KB (which is initial readahead size used by rocksdb in case of
auto-readahead), readahead size will remain same as max_auto_readahead_size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7951
Test Plan: Add new unit test case.
Reviewed By: anand1976
Differential Revision: D26568085
Pulled By: akankshamahajan15
fbshipit-source-id: b6543520fc74e97d859f2002328d4c5254d417af
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
Summary:
Added a "only_mutable_options" flag to the ConfigOptions. When set, the Configurable methods will only look at/update options that are marked as kMutable.
Fixed DB::SetOptions to allow for the update of any mutable TableFactory options. Fixes https://github.com/facebook/rocksdb/issues/7385.
Added tests for the new flag. Updated HISTORY.md
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936
Reviewed By: akankshamahajan15
Differential Revision: D26389646
Pulled By: mrambacher
fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0
Summary:
Hi, I noticed a bug in rocksdb C API, where a function is not exported and created a fix.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7967
Reviewed By: jay-zhuang
Differential Revision: D26505722
Pulled By: ajkr
fbshipit-source-id: 05d676dbd59ec87fe32322cda9e39e405b07178d
Summary:
The patch adds checkpoint support to BlobDB. Blob files are hard linked or
copied, depending on whether the checkpoint directory is on the same filesystem
or not, similarly to table files.
TODO: Add support for blob files to `ExportColumnFamily` and to the checksum
verification logic used by backup/restore.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7959
Test Plan: Ran `make check` and the crash test for a while.
Reviewed By: riversand963
Differential Revision: D26434768
Pulled By: ltamasi
fbshipit-source-id: 994be55a8dc08133028250760fca440d2c7c4dc5
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523
Test Plan: add new unit test, pass make check/ make asan_check
Reviewed By: pdillinger
Differential Revision: D24313271
Pulled By: zhichao-cao
fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
Summary:
Explicitly reject all range deletions on `TransactionDB` or `OptimisticTransactionDB`, except when the user provides sufficient promises that allow us to proceed safely. The necessary promises are described in the API doc for `TransactionDB::DeleteRange()`. There is currently no way to provide enough promises to make it safe in `OptimisticTransactionDB`.
Fixes https://github.com/facebook/rocksdb/issues/7913.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7929
Test Plan: unit tests covering the cases it's permitted/rejected
Reviewed By: ltamasi
Differential Revision: D26240254
Pulled By: ajkr
fbshipit-source-id: 2834a0ce64cc3e4c3799e35b885a5e79c2f4f6d9
Summary:
Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584
Test Plan: Run all existing tests.
Reviewed By: pdillinger
Differential Revision: D20626739
fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9
Summary:
This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.).
The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer.
When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748
Test Plan:
- an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught
- add to stress/crash test to verify it works in variety of configs/operations without intentional corruption
- [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc.
Reviewed By: pdillinger
Differential Revision: D25754492
Pulled By: ajkr
fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866
Summary:
Closes https://github.com/facebook/rocksdb/issues/7035
Changed how build_version.cc was generated:
- Included the GIT tag/branch in the build_version file
- Changed the "Build Date" to be:
- If the GIT branch is "clean" (no changes), the date of the last git commit
- If the branch is not clean, the current date
- Added APIs to access the "build information", rather than accessing the strings directly.
The build_version.cc file is now regenerated whenever the library objects are rebuilt.
Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866
Reviewed By: pdillinger
Differential Revision: D26086565
Pulled By: mrambacher
fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f
Summary:
Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
Reviewed By: pdillinger
Differential Revision: D26006406
Pulled By: mrambacher
fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
Summary:
When the --try_load_options is used in conjunction with the
--column_family option, ldb incorrectly sets the ColumnFamilyOptions for
that column family to defaults. This PR fixes that by retaining from the
OPTIONS file and applying command line overrides.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7847
Test Plan: Add a unit test in ldb_cmd_test
Reviewed By: ajkr
Differential Revision: D25874720
Pulled By: anand1976
fbshipit-source-id: 04bcf23b55e5a30b5b6a59b0e5cb4faef3da7429
Summary:
The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html
There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source.
Also some minor fixes to `Env::GetChildren`:
* Improve error handling in POSIX implementation
* Remove unnecessary array allocation on Windows
* Fix struct name for Windows Non-UTF-8 API
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819
Reviewed By: ajkr
Differential Revision: D25837394
Pulled By: jay-zhuang
fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e
Summary:
Add new API WriteBufferManager::dummy_entries_in_cache_usage() which reports the dummy entries size stored in cache to account for DataBlocks in WriteBufferManager.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7837
Test Plan: Updated test ./write_buffer_manager_test
Reviewed By: ajkr
Differential Revision: D25794312
Pulled By: akankshamahajan15
fbshipit-source-id: 197f5e8701e3dc57a7df72dab1735624f90daf4b
Summary:
This PR does the following:
-> Creates a WinFileSystem class. This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems.
-> Introduces a CustomEnv class. A CustomEnv is an Env that takes a FileSystem as constructor argument. I believe there will only ever be two implementations of this class (PosixEnv and WinEnv). There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls
-> Eliminates the public uses of the LegacyFileSystemWrapper.
With this change in place, there are effectively the following patterns of Env:
- "Base Env classes" (PosixEnv, WinEnv). These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem. These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system.
- Wrapped Composite Env classes (MemEnv). These classes take in an Env and a FileSystem. The core env functions are re-directed to the wrapped env. The file system calls are redirected to the input file system
- Legacy Wrapped Env classes. These classes take in an Env input (but no FileSystem). The core env functions are re-directed to the wrapped env. A "Legacy File System" is created using this env and the file system calls directed to the env itself.
With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created. Any other use of the PosixEnv is via another wrapped env. This cleans up some of the issues with the env construction and destruction.
Additionally, there were places in the code that required had an Env when they required a FileSystem. Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem(). These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703
Reviewed By: zhichao-cao
Differential Revision: D25762190
Pulled By: anand1976
fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b
Summary:
1. Made `WriteBatchWithIndexInternal` into a class that stores the `DB*` or `DBOptions*`.
2. Changed the `GetFromBatch` method to be non-static and use an instance of the class. Added `MergeKey` methods to perform the merge itself and return any status.
This change unifies the multiple calls to the `MergeHelper` under a single wrapped API.
Closes https://github.com/facebook/rocksdb/issues/6683
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6851
Reviewed By: ajkr
Differential Revision: D21706574
Pulled By: pdillinger
fbshipit-source-id: 6860bd64d62669aaa591846e914eed3b674e68b1
Summary:
Range Locking - an implementation based on the locktree library
- Add a RangeTreeLockManager and RangeTreeLockTracker which implement
range locking using the locktree library.
- Point locks are handled as locks on single-point ranges.
- Add a unit test: range_locking_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7506
Reviewed By: akankshamahajan15
Differential Revision: D25320703
Pulled By: cheng-chang
fbshipit-source-id: f86347384b42ba2b0257d67eca0f45f806b69da7
Summary:
So that we can more easily get aggregate live table data such
as total filter, index, and data sizes.
Also adds ldb support for getting properties
Also fixed some missing/inaccurate related comments in db.h
For example:
$ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties
rocksdb.aggregated-table-properties.data_size: 102871
rocksdb.aggregated-table-properties.filter_size: 0
rocksdb.aggregated-table-properties.index_partitions: 0
rocksdb.aggregated-table-properties.index_size: 2232
rocksdb.aggregated-table-properties.num_data_blocks: 100
rocksdb.aggregated-table-properties.num_deletions: 0
rocksdb.aggregated-table-properties.num_entries: 15000
rocksdb.aggregated-table-properties.num_merge_operands: 0
rocksdb.aggregated-table-properties.num_range_deletions: 0
rocksdb.aggregated-table-properties.raw_key_size: 288890
rocksdb.aggregated-table-properties.raw_value_size: 198890
rocksdb.aggregated-table-properties.top_level_index_size: 0
$ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1
rocksdb.aggregated-table-properties-at-level1.data_size: 80909
rocksdb.aggregated-table-properties-at-level1.filter_size: 0
rocksdb.aggregated-table-properties-at-level1.index_partitions: 0
rocksdb.aggregated-table-properties-at-level1.index_size: 1787
rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81
rocksdb.aggregated-table-properties-at-level1.num_deletions: 0
rocksdb.aggregated-table-properties-at-level1.num_entries: 12466
rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0
rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0
rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210
rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414
rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0
$
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779
Test Plan: Added a test to ldb_test.py
Reviewed By: jay-zhuang
Differential Revision: D25653103
Pulled By: pdillinger
fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
Summary:
The behavior of options.ttl has been updated long ago but we didn't update the code comments.
Also update the periodic compaction's comment.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7775
Test Plan: See it can still build through CI.
Reviewed By: ajkr
Differential Revision: D25592015
fbshipit-source-id: b1db18b6787e7048ce6aedcbc3bb44493c9fc49b
Summary:
Deprecate CalculateNumEntry and replace with
ApproximateNumEntries (better name) using size_t instead of int and
uint32_t, to minimize confusing casts and bad overflow behavior
(possible though probably not realistic). Bloom sizes are now explicitly
capped at max size supported by implementations: just under 4GiB for
fv=5 Bloom, and just under 512MiB for fv<5 Legacy Bloom. This
hardening could help to set up for fuzzing.
Also, since RocksDB only uses this information as an approximation
for trying to hit certain sizes for partitioned filters, it's more important
that the function be reasonably fast than for it to be completely
accurate. It's hard enough to be 100% accurate for Ribbon (currently
reversing CalculateSpace) that adding optimize_filters_for_memory
into the mix is just not worth trying to be 100% accurate for num
entries for bytes.
Also:
- Cleaned up filter_policy.h to remove MSVC warning handling and
potentially unsafe use of exception for "not implemented"
- Correct the number of entries limit beyond which current Ribbon
implementation falls back on Bloom instead.
- Consistently use "num_entries" rather than "num_entry"
- Remove LegacyBloomBitsBuilder::CalculateNumEntry as it's essentially
obsolete from general implementation
BuiltinFilterBitsBuilder::CalculateNumEntries.
- Fix filter_bench to skip some tests that don't make sense when only
one or a small number of filters has been generated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7726
Test Plan:
expanded existing unit tests for CalculateSpace /
ApproximateNumEntries. Also manually used filter_bench to verify Legacy and
fv=5 Bloom size caps work (much too expensive for unit test). Note that
the actual bits per key is below requested due to space cap.
$ ./filter_bench -impl=0 -bits_per_key=20 -average_keys_per_filter=256000000 -vary_key_count_ratio=0 -m_keys_total_max=256 -allow_bad_fp_rate
...
Total size (MB): 511.992
Bits/key stored: 16.777
...
$ ./filter_bench -impl=2 -bits_per_key=20 -average_keys_per_filter=2000000000 -vary_key_count_ratio=0 -m_keys_total_max=2000
...
Total size (MB): 4096
Bits/key stored: 17.1799
...
$
Reviewed By: jay-zhuang
Differential Revision: D25239800
Pulled By: pdillinger
fbshipit-source-id: f94e6d065efd31e05ec630ae1a82e6400d8390c4
Summary:
This PR has two commits:
1. Modify the code to allow different Lock Managers (of any kind) to be used. It is implied that a LockManager uses its own custom LockTracker.
2. Add definitions for Range Locking (class Endpoint and GetRangeLock() function.
cheng-chang, is this what you've had in mind (should the PR have both item 1 and item 2?)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7443
Reviewed By: zhichao-cao
Differential Revision: D24123172
Pulled By: cheng-chang
fbshipit-source-id: c6548ad6d4cc3c25f68d13b29147bc6fdf357185
Summary:
In the current code base, all the manifest writes with IO error will be set with reason: BackgroundErrorReason::kManifestWrite, which will be mapped to the kHardError if the IO Error is retryable. However, if the system does not use the WAL, all the retryable IO error should be mapped to kSoftError. Create this PR to handle is special case by adding kManifestWriteNoWAL to BackgroundErrorReason.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7693
Test Plan: make check, add new testing cases to error_handler_fs_test
Reviewed By: anand1976
Differential Revision: D25066204
Pulled By: zhichao-cao
fbshipit-source-id: d59553896c2eac3fb37c05238544d2b265379462
Summary:
Add timestamp to the `CompactRange()` and `GetApproximateSizes` range keys if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7684
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D25015421
Pulled By: jay-zhuang
fbshipit-source-id: 51ca0756087eb053a3b11801e5c7ce1c6e2d38a9
Summary:
Fix initialization order of DBOptions and kHostnameForDbHostId by making the initialization of the latter static rather than dynamic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7702
Reviewed By: ajkr
Differential Revision: D25111633
Pulled By: anand1976
fbshipit-source-id: 7afad834a66e40bcd8694a43b40d378695212224
Summary:
Added experimental public API for Ribbon filter:
NewExperimentalRibbonFilterPolicy(). This experimental API will
take a "Bloom equivalent" bits per key, and configure the Ribbon
filter for the same FP rate as Bloom would have but ~30% space
savings. (Note: optimize_filters_for_memory is not yet implemented
for Ribbon filter. That can be added with no effect on schema.)
Internally, the Ribbon filter is configured using a "one_in_fp_rate"
value, which is 1 over desired FP rate. For example, use 100 for 1%
FP rate. I'm expecting this will be used in the future for configuring
Bloom-like filters, as I expect people to more commonly hold constant
the filter accuracy and change the space vs. time trade-off, rather than
hold constant the space (per key) and change the accuracy vs. time
trade-off, though we might make that available.
### Benchmarking
```
$ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 34.1341
Number of filters: 1993
Total size (MB): 238.488
Reported total allocated memory (MB): 262.875
Reported internal fragmentation: 10.2255%
Bits/key stored: 10.0029
----------------------------
Mixed inside/outside queries...
Single filter net ns/op: 18.7508
Random filter net ns/op: 258.246
Average FP rate %: 0.968672
----------------------------
Done. (For more info, run with -legend or -help.)
$ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 130.851
Number of filters: 1993
Total size (MB): 168.166
Reported total allocated memory (MB): 183.211
Reported internal fragmentation: 8.94626%
Bits/key stored: 7.05341
----------------------------
Mixed inside/outside queries...
Single filter net ns/op: 58.4523
Random filter net ns/op: 363.717
Average FP rate %: 0.952978
----------------------------
Done. (For more info, run with -legend or -help.)
```
168.166 / 238.488 = 0.705 -> 29.5% space reduction
130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction)
### Working around a hashing "flaw"
bloom_test discovered a flaw in the simple hashing applied in
StandardHasher when num_starts == 1 (num_slots == 128), showing an
excessively high FP rate. The problem is that when many entries, on the
order of number of hash bits or kCoeffBits, are associated with the same
start location, the correlation between the CoeffRow and ResultRow (for
efficiency) can lead to a solution that is "universal," or nearly so, for
entries mapping to that start location. (Normally, variance in start
location breaks the effective association between CoeffRow and
ResultRow; the same value for CoeffRow is effectively different if start
locations are different.) Without kUseSmash and with num_starts > 1 (thus
num_starts ~= num_slots), this flaw should be completely irrelevant. Even
with 10M slots, the chances of a single slot having just 16 (or more)
entries map to it--not enough to cause an FP problem, which would be local
to that slot if it happened--is 1 in millions. This spreadsheet formula
shows that: =1/(10000000*(1 - POISSON(15, 1, TRUE)))
As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is
intended for CPU efficiency of filters with many more entries/slots than
kCoeffBits, a very reasonable work-around is to disallow num_starts==1
when !kUseSmash, by making the minimum non-zero number of slots
2*kCoeffBits. This is the work-around I've applied. This also means that
the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not
space-efficient for less than a few hundred entries. Because of this, I
have made it fall back on constructing a Bloom filter, under existing
schema, when that is more space efficient for small filters. (We can
change this in the future if we want.)
TODO: better unit tests for this case in ribbon_test, and probably
update StandardHasher for kUseSmash case so that it can scale nicely to
small filters.
### Other related changes
* Add Ribbon filter to stress/crash test
* Add Ribbon filter to filter_bench as -impl=3
* Add option string support, as in "filter_policy=experimental_ribbon:5.678;"
where 5.678 is the Bloom equivalent bits per key.
* Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom
* Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on
binary searching CalculateSpace (inefficient), so that subclasses
(especially experimental ones) don't have to provide an efficient
implementation inverting CalculateSpace.
* Minor refactor FastLocalBloomBitsBuilder for new base class
XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder,
which allows the latter to fall back on Bloom construction in some
extreme cases.
* Mostly updated bloom_test for Ribbon filter, though a test like
FullBloomTest::Schema is a next TODO to ensure schema stability
(in case this becomes production-ready schema as it is).
* Add some APIs to ribbon_impl.h for configuring Ribbon filters.
Although these are reasonably covered by bloom_test, TODO more unit
tests in ribbon_test
* Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data
for constructing the linear approximations in GetNumSlotsFor95PctSuccess.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658
Test Plan:
Some unit tests updated but other testing is left TODO. This
is considered experimental but laying down schema compatibility as early
as possible in case it proves production-quality. Also tested in
stress/crash test.
Reviewed By: jay-zhuang
Differential Revision: D24899349
Pulled By: pdillinger
fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8
Summary:
This patch simply adds a couple of options that will enable users to
configure garbage collection when using the integrated BlobDB
implementation. The actual GC logic will be added in a separate step.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7661
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24906544
Pulled By: ltamasi
fbshipit-source-id: ee0e056a712a4b4475cd90de8b27d969bd61b7e1
Summary:
The Customizable class is an extension of the Configurable class and allows instances to be created by a name/ID. Classes that extend customizable can define their Type (e.g. "TableFactory", "Cache") and a method to instantiate them (TableFactory::CreateFromString). Customizable objects can be registered with the ObjectRegistry and created dynamically.
Future PRs will make more types of objects extend Customizable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6590
Reviewed By: cheng-chang
Differential Revision: D24841553
Pulled By: zhichao-cao
fbshipit-source-id: d0c2132bd932e971cbfe2c908ca2e5db30c5e155
Summary:
There is an undocumented behavior about a certain combination of options and operations.
- inplace_update_support = true, and
- call `SeekForPrev()`, `SeekToLast()`, and/or `Prev()` on unflushed data.
We should stop the backward iteration and report an error of `Status::NotSupported`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7618
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D24769619
Pulled By: riversand963
fbshipit-source-id: 81d199fa55ed4739ab10e719cc345a992238ccbb
Summary:
Existing API `VerifyChecksum()` allows application to verify sst files' block checksums.
Since whole file, user-specified checksum is tracked in MANIFEST, we can expose a new
API to verify sst files' file checksums.
```
// Compute table file checksums if applicable and compare with MANIFEST.
// Returns OK if no file has mismatching whole-file checksum.
Status DB::VerifyFileChecksums(const ReadOptions& /*read_options*/);
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7578
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D24436783
Pulled By: riversand963
fbshipit-source-id: 52b51519b842f2b3c4e3351998a97c86cbec85b3
Summary:
A user who extended `Logger` recently pointed out it is unusual to
require they implement the two-argument `Logv()` overload when they've
already implemented the three-argument `Logv()` overload. I agree with
that and think we can fix it by only calling the two-argument overload
from the default implementation of the three-argument overload. Then
when the three-argument overload is overridden, RocksDB would not
rely on the two-argument overload. Only `Logger::LogHeader()` needed
adjustment to achieve this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7605
Reviewed By: riversand963
Differential Revision: D24584749
Pulled By: ajkr
fbshipit-source-id: 9aabe040ac761c4c0dbebc4be046967403ecaf21
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
Summary:
In addition to trace block cache access, we want to support trace queries on MySQL. To achieve that StartTrace and EndTrace need to be added to the stackable_db.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7585
Reviewed By: zhichao-cao
Differential Revision: D24482306
Pulled By: nmjnmjnmj
fbshipit-source-id: de641b4837c64cd33b44b5bebaeae5d1527c8c31
Summary:
As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in https://github.com/facebook/rocksdb/issues/7523
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7580
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D24485420
Pulled By: zhichao-cao
fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799
Summary:
Further refinement of the earlier PR. Now the Status is NotFound with a subcode of PathNotFound. Also the existing functions for options parsing/loading are reverted to return InvalidArgument no matter in which way the user-provided arguments are deemed invalid.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7563
Reviewed By: zhichao-cao
Differential Revision: D24422491
Pulled By: ajkr
fbshipit-source-id: ba6b237cd0584d3f925c5ba0d349aeb8c250af67
Summary:
This PR adds support for writing a location identifier of the DB host to SST files as a table property. By default, the hostname is used, but can be overridden by the user. There have been some recent corruptions in files written by ```SstFileWriter``` before checksumming, so this property can be used to trace it back to the writing host and checking the host for hardware isues.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7479
Test Plan: Add new unit tests
Reviewed By: pdillinger
Differential Revision: D24340671
Pulled By: anand1976
fbshipit-source-id: 2038949fd8d160c0633ccb4f9da77740f19fa2a2
Summary:
In order to be able to introduce more locking protocols, we need to abstract out the locking subsystem in TransactionDB into a set of interfaces.
PR https://github.com/facebook/rocksdb/pull/7013 introduces interface `LockTracker`. This PR is a follow up to take the first step to abstract out a `LockManager` interface.
Further modifications to the interface may be needed when introducing the first implementation of range lock. But the idea here is to put the range lock implementation based on range tree under the `utilities/transactions/lock/range/range_tree`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7532
Test Plan: point_lock_manager_test
Reviewed By: ajkr
Differential Revision: D24238731
Pulled By: cheng-chang
fbshipit-source-id: 2a9458cd8b3fb008d9529dbc4d3b28c24631f463
Summary:
Update IOTrace operations in stackabledb.h and also trace few
other IO operations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7514
Test Plan: make check -j64
Reviewed By: anand1976
Differential Revision: D24151202
Pulled By: akankshamahajan15
fbshipit-source-id: 112cd3d2041f8c6398b7b0ba1a783b8c93224d4a
Summary:
The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520
Test Plan:
- new unit test
- added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0`
Reviewed By: pdillinger
Differential Revision: D24200034
Pulled By: ajkr
fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e
Summary:
This option determines whether WALs will be tracked in MANIFEST and verified on recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7275
Test Plan:
db_options_test
options_test
Reviewed By: pdillinger
Differential Revision: D23181418
Pulled By: cheng-chang
fbshipit-source-id: 5dd1cdc166f3dfc1c93c094df4a2f7734e3b4547
Summary:
Add following stats for MultiGet in Histogram to get more insight on MultiGet.
1. Number of index and filter blocks read from file as part of MultiGet
request per level.
2. Number of data blocks read from file per level.
3. Number of SST files loaded from file system per level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7366
Reviewed By: anand1976
Differential Revision: D24127040
Pulled By: akankshamahajan15
fbshipit-source-id: e63a003056b833729b277edc0639c08fb432756b
Summary:
This exposes to the listener interface whether a compaction was
full or not. Also cleaned up API comment for CompactionJobInfo::stats,
which is not of a nullable type. And since CompactionJob is always
created with non-null CompactionJobStats, removed conditionals on it
being nullptr and instead assert non-null.
TODO later: update C and Java interfaces
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7451
Test Plan: updated existing unit tests to check new field, make check
Reviewed By: ltamasi
Differential Revision: D23977796
Pulled By: pdillinger
fbshipit-source-id: 1ae7e26cb949631c2b2fb9e696710daf53cc378d
Summary:
Introduce an new option options.check_flush_compaction_key_order, by default set to true, which checks key order of flush and compaction, and fail the operation if the order is violated.
Also did minor refactor hash checking code, which consolidates the hashing logic to a vlidation class, where the key ordering logic is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7467
Test Plan: Add unit tests to validate the check can catch reordering in flush and compaction, and can be properly disabled.
Reviewed By: riversand963
Differential Revision: D24010683
fbshipit-source-id: 8dd6292d2cda8006054e9ded7cfa4bf405f0527c
Summary:
This has been running in production on some key workloads, so
we believe it to be safe and extremely low cost. Nevertheless, I've
added code to ensure that "force_consistency_checks" is mentioned in
any corruption reports so that people know how to disable in case of
false positive corruption reports.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7446
Test Plan:
make check, CI, temporary debug print new message with
./version_builder_test
Reviewed By: ajkr
Differential Revision: D23972101
Pulled By: pdillinger
fbshipit-source-id: 9623e400f3752577c0ecf977e6d0915562cf9968
Summary:
Add a new Option "allow_data_in_errors". When it's set by users, it allows them to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data.
By default value is set false to prevent users data to be exposed in the messages.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7420
Test Plan:
1. make check -j64
2. Add a new test case
Reviewed By: ajkr
Differential Revision: D23835028
Pulled By: akankshamahajan15
fbshipit-source-id: 8d2eba8fb898e79fcf1fccc07295065a75eb59b1
Summary:
Fix few test cases and add them in ASSERT_STATUS_CHECKED build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7427
Test Plan:
1. ASSERT_STATUS_CHECKED=1 make -j48 check,
2. travis build for ASSERT_STATUS_CHECKED,
3. Without ASSERT_STATUS_CHECKED: make check -j64, CircleCI build and travis build
Reviewed By: pdillinger
Differential Revision: D23909983
Pulled By: akankshamahajan15
fbshipit-source-id: 42d7e4aea972acb9fcddb7ca73fcb82f93272434
Summary:
Add new AppendWithVerify and PositionedAppendWithVerify APIs to Env and FileSystem to bring the data verification information (data checksum information) from upper layer (e.g., WritableFileWriter) to the storage layer. This PR only include the API definition, no functional codes are added to unblock other developers which depend on these APIs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7419
Test Plan: make -j32
Reviewed By: pdillinger
Differential Revision: D23883196
Pulled By: zhichao-cao
fbshipit-source-id: 94676c26bc56144cc32e3661f84f21eccd790411
Summary:
This option is apparently used by some teams within Facebook
(internal ref T75998621)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7431
Test Plan: USE_CLANG=1 make check before (fails) and after
Reviewed By: jay-zhuang
Differential Revision: D23876584
Pulled By: pdillinger
fbshipit-source-id: abb8b67a1f1aac75327944d266e284b2b6727191
Summary:
Two relatively simple functional changes to incremental backup
behavior, integrated with a minor refactoring to reduce code redundancy and
improve error/log message. There are nuances to the impact of these changes,
but I believe they are fundamentally good and generally safe. Those functional
changes:
* Incremental backups no longer read DB table files that are already saved to a
shared part of the backup directory, unless `share_files_with_checksum` is used
with `kLegacyCrc32cAndFileSize` naming (discouraged) where crc32c full file
checksums are needed to determine file naming.
* Justification: incremental backups should not need to read the whole DB,
especially without rate limiting. (Although other BackupEngine reads are not
rate limited either, other non-trivial reads are generally limited by a
corresponding write, as in copying files.) Also, the fact that this is not
already fixed was arguably a bug/oversight in the implementation of https://github.com/facebook/rocksdb/issues/7110.
* When considering whether a table file is already backed up in a shared part
of backup directory, BackupEngine would already query the sizes of source (DB)
and pre-existing destination (backup) files. BackupEngine now uses these file
sizes to detect corruption, as at least one of (a) old backup, (b) backup in
progress, or (c) current DB is corrupt if there's a size mismatch.
* Justification: a random related fix that also helps to cover a small hole
in corruption checking uncovered by the other functional change:
* For `share_table_files` without "checksum" (not recommended), the other
change regresses in detecting fundamentally unsafe use of this option
combination: when you might generate different versions of same SST file
number. As demonstrated by `BackupableDBTest.FailOverwritingBackups,` this
regression is greatly mitigated by the new file size checking. Nevertheless,
almost no reason to use `share_files_with_checksum=false` should remain, and
comments are updated appropriately.
Also, this change renames internal function `CalculateChecksum` to
`ReadFileAndComputeChecksum` to make the performance impact of this function
clear in code reviews.
It is not clear what 'same_path' is for in backupable_db.cc, and I suspect it
cannot be true for a DB with unique file names (like DBImpl). Nevertheless,
I've tried to keep its functionality intact when `true` to minimize risk for
now, despite having no unit tests for which it is true.
Select impact details (much more in unit tests): For
`share_files_with_checksum`, I am confident there is no regression (vs.
pre-6.12) in detecting DB or backup corruption at backup creation time, mostly
because the old design did not leverage this extra checksum computation for
detecting inconsistencies at backup creation time. (With computed checksums in
names, a recently corrupted file just looked like a different file vs. what was
already backed up.)
Even in the hypothetical case of DB session id collision (~100 bits entropy
collision), file size in name and/or our file size check add an extra layer of
protection against false success in creating an accurate new backup. (Unit test
included.)
`DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking
are still able to catch corruptions that `CreateNewBackup` does not. Note that
when custom file checksum support is added to BackupEngine, that will
essentially give the same power as `DB::VerifyChecksum` into `CreateNewBackup`.
We could add options for `CreateNewBackup` to cover some of what would be
caught by `VerifyBackup` with checksum checking.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7413
Test Plan:
Two new unit tests included, both of which fail without these
changes. Although we don't test the I/O improvement directly, we test it
indirectly in DB corruption detection power that was inadvertently unlocked
with new backup file naming PLUS computing current content checksums (now
removed). (I don't think that case of DB corruption detection justifies reading
the whole DB on incremental backup.)
Reviewed By: zhichao-cao
Differential Revision: D23818480
Pulled By: pdillinger
fbshipit-source-id: 148aff16f001af5b9fd4b22f155311c2461f1bac
Summary:
This change reverts BackupEngine to 6.12 state to accommodate a
higher-priority fix that does not easily merge with this custom checksum
support. We intend to reinstate this support soon, by merging a revert
of this change.
For backupable_db_test, I've removed the tests depending on this
feature.
I've also removed relevant HISTORY.md entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7411
Test Plan: unit tests
Reviewed By: ajkr
Differential Revision: D23793835
Pulled By: pdillinger
fbshipit-source-id: 7e861436539584799b13d1a8ae559b81b6d08052
Summary:
In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310
Test Plan: adding new unit test, pass make check.
Reviewed By: anand1976
Differential Revision: D23710892
Pulled By: zhichao-cao
fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9