Summary:
* Add missing unit test for schema stability of FileChecksumGenCrc32c
(previously was only comparing to itself)
* A lot of clarifying comments
* Add some assertions for preconditions
* Rename WritableFileWriter::CalculateFileChecksum -> UpdateFileChecksum
* Simplify FileChecksumGenCrc32c with shared functions
* Implement EndianSwapValue to replace unused EndianTransform
And incidentally since I had trouble with 'make check-format' GitHub action disagreeing with local run,
* Output full diagnostic information when 'make check-format' fails in CI
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6861
Test Plan: new unit test passes before & after other changes
Reviewed By: zhichao-cao
Differential Revision: D21667115
Pulled By: pdillinger
fbshipit-source-id: 6a99970f87605aa024fa540c78cd519ff322c3e6
Summary:
In level compaction, if the total size (even if compensated after taking account of the deletions) of a level hasn't exceeded the limit, but there are lots of deletion entries in some SST files of the level, these files should also be good candidates for compaction. Otherwise, queries for the deleted keys might be slow because they need to go over all the tombstones.
This PR adds an option `deletion_ratio` to the factory of `CompactOnDeletionCollector` to configure it to trigger compaction when the ratio of tombstones >= `deletion_ratio`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6806
Test Plan:
Added new unit test in `compact_on_deletion_collector_test.cc`.
make compact_on_deletion_collector_test && ./compact_on_deletion_collector_test
Reviewed By: ajkr
Differential Revision: D21511981
Pulled By: cheng-chang
fbshipit-source-id: 65a9d0150e8c9c00337787686475252e4535a3e1
Summary:
When using ldb, users cannot turn on force consistency check in most commands, while they cannot use checksonsistnecy with --try_load_options. The change fixes both by:
1. checkconsistency now calls OpenDB() so that it gets all the options loading and sanitized options logic
2. use options.check_consistency_checks = true by default, and add a --disable_consistency_checks to turn it off.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6802
Test Plan: Add a new unit test. Some manual tests with corrupted DBs.
Reviewed By: pdillinger
Differential Revision: D21388051
fbshipit-source-id: 8d122732d391b426e3982a1c3232a8e3763ffad0
Summary:
In current commit protocol of pessimistic transaction, if the transaction is not prepared before commit, the commit protocol implicitly assumes that the user wants to commit without prepare.
This PR adds TransactionOptions::skip_prepare, the default value is `true` because if set to `false`, all existing users who commit without prepare need to update their code to set skip_prepare to true. Although this does not force the user to explicitly express their intention of skip_prepare, it at least lets the user be aware of the assumption of being able to commit without prepare.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6775
Test Plan: added a new unit test TransactionTest::CommitWithoutPrepare
Reviewed By: lth
Differential Revision: D21313270
Pulled By: cheng-chang
fbshipit-source-id: 3d95b7c9b2d6cdddc09bdd66c561bc4fae8c3251
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings. There is a mishmash of parameters in use here with more needed in the future. This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389
Reviewed By: siying
Differential Revision: D21163707
Pulled By: zhichao-cao
fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
Summary:
Log it in the info log to help in troubleshooting. It is logged as follows -
```
2020/04/10-10:51:39.886662 7ffff7fef340 Options.table_properties_collectors: CompactOnDeletionCollector (Sliding window size = 100 Deletion trigger = 90);
```
Tests:
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6686
Reviewed By: ltamasi
Differential Revision: D21002442
Pulled By: anand1976
fbshipit-source-id: 7adf0dbae7f1febcb00ce61fea5097118ede5c6a
Summary:
When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries.
This PR makes it possible to decrease CPU priority of these threads when creating a new backup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602
Test Plan: make check
Reviewed By: siying, zhichao-cao
Differential Revision: D20683216
Pulled By: cheng-chang
fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84
Summary:
Add sequence_number_ptr to the checkpoint interface to expose the sequence number during taking the checkpoint. The number will be consistent with the seq # in rocksdb log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5528
Test Plan: make check -j64
Reviewed By: Winger1994
Differential Revision: D16080209
fbshipit-source-id: 6dc3c7680287ee97d673c5e61f89aae1f43e33df
Summary:
In most places in the code the variable names are spelled correctly as
COMMITTED but in a couple places not. This fixes them and ensures the
variable is always called COMMITTED everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6481
Differential Revision: D20306776
Pulled By: pdillinger
fbshipit-source-id: b6c1bfe41db559b4bc6955c530934460c07f7022
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
This is a continuation of https://github.com/facebook/rocksdb/pull/5320/files
I open a new mr for these purposes, half a year has past since the old mr is posted so it's almost impossible to fulfill some points below on the old mr, especially 5)
1) add validation modes for optimistic txns
2) modify unittests to test both modes
3) make format
4) refine hash functor
5) push to master
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6240
Differential Revision: D19301296
fbshipit-source-id: 5b5b3cbd39558f43947f7d2dec6cd31a06386edb
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Summary:
Currently the default txn write policy in crash tests is WRITE_PREPARED. The patch randomly picks the write policy at the start of the crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6158
Test Plan:
```
make -j32 crash_test_with_txn
```
Differential Revision: D18946307
Pulled By: maysamyabandeh
fbshipit-source-id: f77d7a94f99a08791ef9626da153d284bf521950
Summary:
To reflect changes in PR https://github.com/facebook/rocksdb/issues/6072
This comment also implies that a seemingly valid use-case for
max_valid_backups_to_open is flawed: even if you only want to add a new
backup without trying to delete, you might need to clean up after a
backup creation that never finished. To clean up properly requires
opening all backups to get proper ref counts on shared files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6105
Test Plan: code comment only
Differential Revision: D18736716
Pulled By: pdillinger
fbshipit-source-id: 2447c0000eefe3a4ca606926bfe922a8456b0cb7
Summary:
Production:
* Fixes GarbageCollect (and auto-GC triggered by PurgeOldBackups, DeleteBackup, or CreateNewBackup) to clean up backup directory independent of current settings (except max_valid_backups_to_open; see issue https://github.com/facebook/rocksdb/issues/4997) and prior settings used with same backup directory.
* Fixes GarbageCollect (and auto-GC) not to attempt to remove "." and ".." entries from directories.
* Clarifies contract with users in modifying BackupEngine operations. In short, leftovers from any incomplete operation are cleaned up by any subsequent call to that same kind of operation (PurgeOldBackups and DeleteBackup considered the same kind of operation). GarbageCollect is available to clean up after all kinds. (NB: right now PurgeOldBackups and DeleteBackup will clean up after incomplete CreateNewBackup, but we aren't promising to continue that behavior.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6023
Test Plan:
* Refactors open parameters to use an option enum, for readability, etc. (Also fixes an unused parameter bug in the redundant OpenDBAndBackupEngineShareWithChecksum.)
* Fixes an apparent bug in ShareTableFilesWithChecksumsTransition in which old backup data was destroyed in the transition to be tested. That test is now augmented to ensure GarbageCollect (or auto-GC) does not remove shared files when BackupEngine is opened with share_table_files=false.
* Augments DeleteTmpFiles test to ensure that CreateNewBackup does auto-GC when an incompletely created backup is detected.
Differential Revision: D18453559
Pulled By: pdillinger
fbshipit-source-id: 5e54e7b08d711b161bc9c656181012b69a8feac4
Summary:
Only if there is a crash, power failure, or I/O error in
DeleteBackup, shared or private files from the backup might be left
behind that are not cleaned up by PurgeOldBackups or DeleteBackup-- only
by GarbageCollect. This makes the BackupEngine API "leaky by default."
Even if it means a modest performance hit, I think we should make
Delete and Purge do as they say, with ongoing best effort: i.e. future
calls will attempt to finish any incomplete work from earlier calls.
This change does that by having DeleteBackup and PurgeOldBackups do a
GarbageCollect, unless (to minimize performance hit) this BackupEngine
has already done a GarbageCollect and there have been no
deletion-related I/O errors in that GarbageCollect or since then.
Rejected alternative 1: remove meta file last instead of first. This would in theory turn partially deleted backups into corrupted backups, but code changes would be needed to allow the missing files and consider it acceptably corrupt, rather than failing to open the BackupEngine. This might be a reasonable choice, but I mostly rejected it because it doesn't solve the legacy problem of cleaning up existing lingering files.
Rejected alternative 2: use a deletion marker file. If deletion started with creating a file that marks a backup as flagged for deletion, then we could reliably detect partially deleted backups and efficiently finish removing them. In addition to not solving the legacy problem, this could be precarious if there's a disk full situation, and we try to create a new file in order to delete some files. Ugh.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6015
Test Plan: Updated unit tests
Differential Revision: D18401333
Pulled By: pdillinger
fbshipit-source-id: 12944e372ce6809f3f5a4c416c3b321a8927d925
Summary:
According to
https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format,
the block read by BlockBasedTable::ReadMetaBlock is actually the meta index
block. Therefore, it is better to rename the function to ReadMetaIndexBlock.
This PR also applies some format change to existing code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6009
Test Plan: make check
Differential Revision: D18333238
Pulled By: riversand963
fbshipit-source-id: 2c4340a29b3edba53d19c132cbfd04caf6242aed
Summary:
For MDEV-19670: MyRocks: key lookups into deleted data are very slow
BaseDeltaIterator remembers iterate_upper_bound and will not let delta_iterator_
walk above the iterate_upper_bound if base_iterator_ is not valid
anymore.
== Rationale ==
The most straightforward way would be to make the delta_iterator
(which is a rocksdb::WBWIIterator) to support iterator bounds. But
checking for bounds has an extra CPU overhead.
So we put the check into BaseDeltaIterator, and only make it when
base_iterator_ is not valid.
(note: We could take it even further, and move the check a few lines
down, and only check iterator bounds ourselves if base_iterator_ is
not valid AND delta_iterator_ hit a tombstone).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5403
Differential Revision: D15863092
Pulled By: maysamyabandeh
fbshipit-source-id: 8da458e7b9af95ff49356666f69664b4a6ccf49b
Summary:
Adding a new API to db.h that allows users to get file_creation_time of the oldest file in the DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5948
Test Plan: Added unit test.
Differential Revision: D18056151
Pulled By: vjnadimpalli
fbshipit-source-id: 448ec9d34cb6772e1e5a62db399ace00dcbfbb5d
Summary:
This PR allows for the creation of custom env when using sst_dump. If
the user does not set options.env or set options.env to nullptr, then sst_dump
will automatically try to create a custom env depending on the path to the sst
file or db directory. In order to use this feature, the user must call
ObjectRegistry::Register() beforehand.
Test Plan (on devserver):
```
$make all && make check
```
All tests must pass to ensure this change does not break anything.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5845
Differential Revision: D17678038
Pulled By: riversand963
fbshipit-source-id: 58ecb4b3f75246d52b07c4c924a63ee61c1ee626
Summary:
as title.
Test Plan (on devserver):
```
$make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5855
Differential Revision: D17615125
Pulled By: riversand963
fbshipit-source-id: bd6ed8cf59eafff41f0d1fc044f39e8f3573172a
Summary:
Further apply formatter to more recent commits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830
Test Plan: Run all existing tests.
Differential Revision: D17488031
fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab
Summary:
Manual compaction may bring in very high load because sometime the amount of data involved in a compaction could be large, which may affect online service. So it would be good if the running compaction making the server busy can be stopped immediately. In this implementation, stopping manual compaction condition is only checked in slow process. We let deletion compaction and trivial move go through.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3971
Test Plan: add tests at more spots.
Differential Revision: D17369043
fbshipit-source-id: 575a624fb992ce0bb07d9443eb209e547740043c
Summary:
Use delete to disable automatic generated methods instead of private, and put the constructor together for more clear.This modification cause the unused field warning, so add unused attribute to disable this warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5009
Differential Revision: D17288733
fbshipit-source-id: 8a767ce096f185f1db01bd28fc88fef1cdd921f3
Summary:
Adding a light weight API to get last live WAL file name and size. Meant to be used as a helper for backup/restore tooling in a larger ecosystem such as MySQL with a MyRocks storage engine.
Specifically within MySQL's backup/restore mechanism, this call can be made with a write lock on the mysql db to get a transactionally consistent snapshot of the current WAL file position along with other non-rocksdb log/data files.
Without this, the alternative would be to take the aforementioned lock, scan the WAL dir for all files, find the last file and note its exact size as the rocksdb 'checkpoint'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5765
Differential Revision: D17172717
Pulled By: affandar
fbshipit-source-id: f2fabafd4c0e6fc45f126670c8c88a9f84cb8a37
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
Differential Revision: D14394062
Pulled By: miasantreble
fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
Summary:
Right now VerifyChecksum() doesn't do read-ahead. In some use cases, users won't be able to achieve good performance. With this change, by default, RocksDB will do a default readahead, and users will be able to overwrite the readahead size by passing in a ReadOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5713
Test Plan: Add a new unit test.
Differential Revision: D16860874
fbshipit-source-id: 0cff0fe79ac855d3d068e6ccd770770854a68413
Summary:
Add a command in ldb so that users can print out tombstones in SST files.
In order to test the code, change the interface of LDBCommandRunner::RunCommand() so that it doesn't return from the program, but return the status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5615
Test Plan: Add a new unit test
Differential Revision: D16550326
fbshipit-source-id: 88ddfe6984bdcbb3a528abdd115089df09eba52e
Summary:
The changes transaction_test to set `txn_db_options.default_write_batch_flush_threshold = 1` in order to give better test coverage for WriteUnprepared.
As part of the change, some tests had to be updated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5658
Differential Revision: D16740468
Pulled By: lth
fbshipit-source-id: 3821eec20baf13917c8c1fab444332f75a509de9
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
const ReadOptions& options, ColumnFamilyHandle* column_family,
const Slice& key, PinnableSlice* merge_operands,
GetMergeOperandsOptions* get_merge_operands_options,
int* number_of_operands)
Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);
Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604
Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist
Differential Revision: D16657366
Pulled By: vjnadimpalli
fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
Summary:
Add savepoint support when the current transaction has flushed unprepared batches.
Rolling back to savepoint is similar to rolling back a transaction. It requires the set of keys that have changed since the savepoint, re-reading the keys at the snapshot at that savepoint, and the restoring the old keys by writing out another unprepared batch.
For this strategy to work though, we must be capable of reading keys at a savepoint. This does not work if keys were written out using the same sequence number before and after a savepoint. Therefore, when we flush out unprepared batches, we must split the batch by savepoint if any savepoints exist.
eg. If we have the following:
```
Put(A)
Put(B)
Put(C)
SetSavePoint()
Put(D)
Put(E)
SetSavePoint()
Put(F)
```
Then we will write out 3 separate unprepared batches:
```
Put(A) 1
Put(B) 1
Put(C) 1
Put(D) 2
Put(E) 2
Put(F) 3
```
This is so that when we rollback to eg. the first savepoint, we can just read keys at snapshot_seq = 1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5627
Differential Revision: D16584130
Pulled By: lth
fbshipit-source-id: 6d100dd548fb20c4b76661bd0f8a2647e64477fa
Summary:
The ssize_t type was introduced in https://github.com/facebook/rocksdb/pull/5633, but it seems like it's a POSIX specific type.
I just need a signed type to represent number of bytes, so use int64_t instead. It seems like we have a typedef from SSIZE_T for Windows, but it doesn't seem like we ever include "port/port.h" in our public header files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5638
Differential Revision: D16526269
Pulled By: lth
fbshipit-source-id: 8d3a5c41003951b74b29bc5f1d949b2b22da0cee
Summary:
Instead of reusing `TransactionOptions::max_write_batch_size` for determining when to flush a write batch for write unprepared, add a new variable called `write_batch_flush_threshold` for this use case instead.
Also add `TransactionDBOptions::default_write_batch_flush_threshold` which sets the default value if `TransactionOptions::write_batch_flush_threshold` is unspecified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5633
Differential Revision: D16520364
Pulled By: lth
fbshipit-source-id: d75ae5a2141ce7708982d5069dc3f0b58d250e8c
Summary:
The new DB::GetApproximateSizes with SizeApproximationOptions argument, which allows to add more options/knobs to the DB::GetApproximateSizes call (beyond only the include_flags)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5626
Differential Revision: D16496913
Pulled By: elipoz
fbshipit-source-id: ee8c6c182330a285fa056ecfc3905a592b451720
Summary:
The ObjectRegistry class replaces the Registrar and NewCustomObjects. Objects are registered with the registry by Type (the class must implement the static const char *Type() method).
This change is necessary for a few reasons:
- By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable.
- By having a class with instances, different units could have different objects registered. This could be useful if, for example, one Option allowed for a dynamic library and one did not.
When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program.
Test plan (on riversand963's devserver)
```
$COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check
```
All tests pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293
Differential Revision: D16363396
Pulled By: riversand963
fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180
Summary:
Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135
This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469
"Add support for taking snapshot of a column family and creating column family from a given CF snapshot"
We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.
(1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
// Exports all live SST files of a specified Column Family onto export_dir,
// returning SST files information in metadata.
// - SST files will be created as hard links when the directory specified
// is in the same partition as the db directory, copied otherwise.
// - export_dir should not already exist and will be created by this API.
// - Always triggers a flush.
virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
const std::string& export_dir,
ExportImportFilesMetaData** metadata);
Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
returning the list of file metadata.
(2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
// CreateColumnFamilyWithImport() will create a new column family with
// column_family_name and import external SST files specified in metadata into
// this column family.
// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
// an existing DB.
// Option in import_options specifies whether the external files are copied or
// moved (default is copy). When option specifies copy, managing files at
// external_file_path is caller's responsibility. When option specifies a
// move, the call ensures that the specified files at external_file_path are
// deleted on successful return and files are not modified on any error
// return.
// On error return, column family handle returned will be nullptr.
// ColumnFamily will be present on successful return and will not be present
// on error return. ColumnFamily may be present on any crash during this call.
virtual Status CreateColumnFamilyWithImport(
const ColumnFamilyOptions& options, const std::string& column_family_name,
const ImportColumnFamilyOptions& import_options,
const ExportImportFilesMetaData& metadata,
ColumnFamilyHandle** handle);
Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.
If incoming sequence number is higher than current local sequence number, local sequence
number is updated to reflect this.
Note, as the sst files is are being moved across Column Families, Column Family name in sst file
will no longer match the actual column family on destination DB. The API does not modify Column
Family name or id in the sst files being imported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495
Differential Revision: D16018881
fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
Summary:
Right now ldb can open running DB through read-only DB. However, it might leave info logs files to the read-only DB directory. Add an option to open the DB as secondary to avoid it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5537
Test Plan:
Run
./ldb scan --max_keys=10 --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp --no_value --hex
and
./ldb get 0x00000000000000103030303030303030 --hex --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp
against a normal db_bench run and observe the output changes. Also observe that no new info logs files are created under /tmp/rocksdbtest-2491/dbbench.
Run without --secondary_path and observe that new info logs created under /tmp/rocksdbtest-2491/dbbench.
Differential Revision: D16113886
fbshipit-source-id: 4e09dec47c2528f6ca08a9e7a7894ba2d9daebbb
Summary:
Previously `GetAllKeyVersions()` supports default column family only. This PR add support for other column families.
Test plan (devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 db_basic_test
$./db_basic_test --gtest_filter=DBBasicTest.GetAllKeyVersions
```
All other unit tests must pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5544
Differential Revision: D16147551
Pulled By: riversand963
fbshipit-source-id: 5a61aece2a32d789e150226a9b8d53f4a5760168
Summary:
This PR adds a BlockCacheTraceSimulator that reports the miss ratios given different cache configurations. A cache configuration contains "cache_name,num_shard_bits,cache_capacities". For example, "lru, 1, 1K, 2K, 4M, 4G".
When we replay the trace, we also perform lookups and inserts on the simulated caches.
In the end, it reports the miss ratio for each tuple <cache_name, num_shard_bits, cache_capacity> in a output file.
This PR also adds a main source block_cache_trace_analyzer so that we can run the analyzer in command line.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5449
Test Plan:
Added tests for block_cache_trace_analyzer.
COMPILE_WITH_ASAN=1 make check -j32.
Differential Revision: D15797073
Pulled By: HaoyuHuang
fbshipit-source-id: aef0c5c2e7938f3e8b6a10d4a6a50e6928ecf408
Summary:
This PR integrates the block cache tracer class into db_impl.cc.
db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433
Differential Revision: D15728016
Pulled By: HaoyuHuang
fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402
Differential Revision: D15701195
Pulled By: miasantreble
fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
Summary:
This enables the user to set TransactionDBOptions::skip_concurrency_control so the standard `DB::Write(const WriteOptions& opts, WriteBatch* updates)` would skip the concurrency control. This would give higher throughput to the users who know their use case doesn't need concurrency control.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5330
Differential Revision: D15525932
Pulled By: maysamyabandeh
fbshipit-source-id: 68421ac1ba34f549a4a8de9ce4c2dccf6fb4b06b
Summary:
- By providing the "env" field in any text-based options (i.e., string, map, or file), we can use `NewCustomObject` to deserialize the text value into an actual `Env` object.
- Currently factory functions for `Env` registered with object registry should only return pointer to static `Env` objects. That's because `DBOptions::env` is a raw pointer so we cannot easily delegate cleanup.
- Note I did not add `env` to `db_option_type_info`. It wasn't needed for (de)serialization, and I believe we don't want to do verification on `env`, even by checking name. That's because the user should be able to copy their DB from Linux to Windows, change envs, and not see an option verification error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5237
Differential Revision: D15056360
Pulled By: siying
fbshipit-source-id: 4b5f0b83297a5058f8949ec955dbf27d98d73d7e
Summary:
MultiGet batching was implemented in #5011 in order to reduce CPU utilization when looking up multiple keys at once. This PR implements corresponding ```MultiGet``` and ```MultiGetSingleCFForUpdate``` in ```rocksdb::Transaction``` that call the underlying batching implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5210
Differential Revision: D15048164
Pulled By: anand1976
fbshipit-source-id: c52f6043102ab0cbc723f4cba2a7b7d1767f6f52
Summary:
In `PessimisticTransaction::TryLock`, we were calling `TrackKey` even when assume_tracked=true, which defeats the purpose of assume_tracked. Remove this.
For keys that are already tracked, TrackKey will actually bump some counters (num_reads/num_writes) which are consumed in `TransactionBaseImpl::GetTrackedKeysSinceSavePoint`, and this is used to determine which keys were tracked since the last savepoint. I believe this functionality should still work, since I think the user should not call GetForUpdate/Put(assume_tracked=true) across savepoints, and if they do, they should not expect the Put(assume_tracked=true) to show up as a tracked key in the second savepoint.
This is another 2-3% cpu improvement.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5173
Differential Revision: D14883809
Pulled By: lth
fbshipit-source-id: 7d09f0772da422384af0519773e310c22b0cbca3
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
Summary:
Expose DB methods to lock and unlock the WAL.
These methods are intended to use by MyRocks in order to obtain WAL
coordinates in consistent way.
Usage scenario is following:
MySQL has performance_schema.log_status which provides information that
enables a backup tool to copy the required log files without locking for
the duration of copy. To populate this table MySQL does following:
1. Lock the binary log. Transactions are not allowed to commit now
2. Save the binary log coordinates
3. Walk through the storage engines and lock writes on each engine. For
InnoDB, redo log is locked. For MyRocks, WAL should be locked.
4. Ask storage engines for their coordinates. InnoDB reports its current
LSN and checkpoint LSN. MyRocks should report active WAL files names
and sizes.
5. Release storage engine's locks
6. Unlock binary log
Backup tool will then use this information to copy InnoDB, RocksDB and
MySQL binary logs up to specified positions to end up with consistent DB
state after restore.
Currently, RocksDB allows to obtain the list of WAL files. Only missing
bit is the method to lock the writes to WAL files.
LockWAL method must flush the WAL in order for the reported size to be
accurate (GetSortedWALFiles is using file system stat call to return the
file size), also, since backup tool is going to copy the WAL, it is
better to be flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5146
Differential Revision: D14815447
Pulled By: maysamyabandeh
fbshipit-source-id: eec9535a6025229ed471119f19fe7b3d8ae888a3