Summary:
This fixes the same performance issue that #3992 fixes but with much more invasive cleanup.
I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.
With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.
ajkr please take a look when you have a minute!
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4014
Differential Revision: D8773253
Pulled By: ajkr
fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
Summary:
Two CI tests never pass because of the environment problem. Delete them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4110
Differential Revision: D8805713
Pulled By: siying
fbshipit-source-id: 6eb4813dc2094ee2045ec8ede7fe8967d546d6e8
Summary:
This test routinely exceeds the FB contbuild test timeout of 10 minutes,
due to the large number of iterations. The large number (mainly due to
100 randomly selected window sizes) does not seem to add any value.
Reduce it to allow the test to finish in < 10 mins.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4117
Differential Revision: D8815646
Pulled By: anand1976
fbshipit-source-id: 260690d24f444767ad93b039dec3ae8b9cdd1843
Summary:
Run the basic range deletion tests against the standard set of
configurations. This testing exposed that files with hash indexes and
partitioned indexes were not handling the case where the file contained
only range deletions--i.e., where the index was empty.
Additionally file a TODO about the fact that range deletions are broken
when allow_mmap_reads = true is set.
/cc ajkr nvanbenschoten
Best viewed with ?w=1: https://github.com/facebook/rocksdb/pull/4021/files?w=1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4021
Differential Revision: D8811860
Pulled By: ajkr
fbshipit-source-id: 3cc07e6d6210a2a00b932866481b3d5c59775343
Summary:
Exposes BackupEngine::CreateNewBackupWithMetadata and BackupInfo metadata to the Java API.
Full disclaimer, I'm not familiar with JNI stuff, so I might have forgotten something (hopefully no memory leaks!). I also tried to find contributing guidelines but didn't see any, but I hope the PR style is consistent with the rest of the code base.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4111
Differential Revision: D8811180
Pulled By: ajkr
fbshipit-source-id: e38b3e396c7574328c2a1a0e55acc8d092b6a569
Summary:
Prior to this PR, there was a race condition between `DBImpl::SetOptions` and `BackupEngine::CreateNewBackup`, as illustrated below.
```
Time thread 1 thread 2
| CreateNewBackup -> GetLiveFiles
| SetOptions -> RenameTempFileToOptionsFile
| SetOptions -> RenameTempFileToOptionsFile
| SetOptions -> RenameTempFileToOptionsFile // unlink oldest OPTIONS file
| copy the oldest OPTIONS // IO error!
V
```
Proposed fix is to check the value of `DBImpl::disable_obsolete_files_deletion_` before calling `DeleteObsoleteOptionsFiles`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4108
Differential Revision: D8796360
Pulled By: riversand963
fbshipit-source-id: 02045317f793ea4c7d4400a5bf333b8502fa3e82
Summary:
Copy data between buffers inside FilePrefetchBuffer only when chunk length is greater than 0. Otherwise AlignedBuffer was accessing memory out of its range causing crashes.
Removing the tracking of buffer length outside of `AlignedBuffer`, i.e. in `FilePrefetchBuffer` and `ReadaheadRandomAccessFile`, will follow in a separate PR, as it is not the root cause of the crash reported in #4051. (`FilePrefetchBuffer` itself has been this way from its inception, and `ReadaheadRandomAccessFile` was updated to add the buffer length at some point).
Comprehensive tests for `FilePrefetchBuffer` also to follow in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4100
Differential Revision: D8792590
Pulled By: sagar0
fbshipit-source-id: 3578f45761cf6884243e767f749db4016ccc93e1
Summary:
Right now slow deletion with ftruncate doesn't work well with checkpoints because it ruin hard linked files in checkpoints. To fix it, check the file has no other hard link before ftruncate it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4093
Differential Revision: D8730360
Pulled By: siying
fbshipit-source-id: 756eea5bce8a87b9a2ea3a5bfa190b2cab6f75df
Summary:
This adds support for recovering WriteUnprepared transactions through the following changes:
- The information in `RecoveredTransaction` is extended so that it can reference multiple batches.
- `MarkBeginPrepare` is extended with a bool indicating whether it is an unprepared begin, and this is passed down to `InsertRecoveredTransaction` to indicate whether the current transaction is prepared or not.
- `WriteUnpreparedTxnDB::Initialize` is overridden so that it will rollback unprepared transactions from the recovered transactions. This can be done without updating the prepare heap/commit map, because this is before the DB has finished initializing, and after writing the rollback batch, those data structures should not contain information about the rolled back transaction anyway.
Commit/Rollback of live transactions is still unimplemented and will come later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4078
Differential Revision: D8703382
Pulled By: lth
fbshipit-source-id: 7e0aada6c23bd39299f1f20d6c060492e0e6b60a
Summary:
`std::map::at(key)` throws std::out_of_range if key does not exist. Current
code does not handle this. Although this case is unlikely, I feel it's safe to
use `std::map::find`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4098
Differential Revision: D8753865
Pulled By: riversand963
fbshipit-source-id: 9a9ba43badb0fb5e0d24cd87903931fd12f3f8ec
Summary:
This was caught by crash test, and the following is a simple way to reproduce it and verify the fix.
One way to trigger this code path is to use the following configuration:
- Compress SST file
- Enable direct IO and prefetch buffer
- Do NOT use compressed block cache
Closes https://github.com/facebook/rocksdb/pull/4096
Differential Revision: D8742009
Pulled By: riversand963
fbshipit-source-id: f13381078bbb0dce92f60bd313a78ab602bcacd2
Summary:
Increase the size of each shard so that the number of cache hit/miss match
expectation. Otherwise FilterBlockInBlockCache test will fail.
Closes https://github.com/facebook/rocksdb/pull/4090
Differential Revision: D8736158
Pulled By: riversand963
fbshipit-source-id: 5cdbc06b02390389fd5b72a6d251d88949ad3d91
Summary:
Now by default, with NewSstFileManager, checkpoints may be corrupted. Disable this feature to avoid this issue.
Closes https://github.com/facebook/rocksdb/pull/4092
Differential Revision: D8729856
Pulled By: siying
fbshipit-source-id: 914c321d6eaf52d8c5981171322d85dd29088307
Summary:
It seems that compilation has been made stricter about unused args.
Closes https://github.com/facebook/rocksdb/pull/4080
Differential Revision: D8712049
Pulled By: sagar0
fbshipit-source-id: 984af1982638af3568aac1a167f565f4741badee
Summary:
We can have prefetch_index without prefetch_filter but not the other way around. The assert statement is fixed.
Closes https://github.com/facebook/rocksdb/pull/4077
Differential Revision: D8694472
Pulled By: maysamyabandeh
fbshipit-source-id: ccd2804d9d9cdafb1c3e65062c7bc38603e69004
Summary:
Currently the block cache is charged only by the size of the raw data block and excludes the overhead of the c++ objects that contain the raw data block. The patch improves the accuracy of the charge by including the c++ object overhead into it.
Closes https://github.com/facebook/rocksdb/pull/4073
Differential Revision: D8686552
Pulled By: maysamyabandeh
fbshipit-source-id: 8472f7fc163c0644533bc6942e20cdd5725f520f
Summary:
clang analyze is giving the following warnings:
> db/compaction_job.cc:1178:16: warning: Called C++ object pointer is null
} else if (meta->smallest.size() > 0) {
^~~~~~~~~~~~~~~~~~~~~
db/compaction_job.cc:1201:33: warning: Access to field 'marked_for_compaction' results in a dereference of a null pointer (loaded from variable 'meta')
meta->marked_for_compaction = sub_compact->builder->NeedCompact();
~~~~
db/version_set.cc:2770:26: warning: Called C++ object pointer is null
uint32_t cf_id = last_writer->cfd->GetID();
^~~~~~~~~~~~~~~~~~~~~~~~~
Closes https://github.com/facebook/rocksdb/pull/4072
Differential Revision: D8685852
Pulled By: miasantreble
fbshipit-source-id: b0e2fd9dfc1cbba2317723e09886384b9b1c9085
Summary:
This adds a new WAL marker of type kTypeBeginUnprepareXID.
Also, DBImpl now contains a field called batch_per_txn (meaning one WriteBatch per transaction, or possibly multiple WriteBatches). This would also indicate that this DB is using WriteUnprepared policy.
Recovery code would be able to make use of this extra field on DBImpl in a separate diff. For now, it is just used to determine whether the WAL is compatible or not.
Closes https://github.com/facebook/rocksdb/pull/4069
Differential Revision: D8675099
Pulled By: lth
fbshipit-source-id: ca27cae1738e46d65f2bb92860fc759deb874749
Summary:
Since the filter data is unaligned, even though we ensure all probes are within a span of `cache_line_size` bytes, those bytes can span two cache lines. In that case I doubt hardware prefetching does a great job considering we don't necessarily access those two cache lines in order. This guess seems correct since adding explicit prefetch instructions reduced filter lookup overhead by 19.4%.
Closes https://github.com/facebook/rocksdb/pull/4068
Differential Revision: D8674189
Pulled By: ajkr
fbshipit-source-id: 747427d9a17900151c17820488e3f7efe06b1871
Summary:
Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
3. Provide an API for the user to clear the error and resume the DB instance
This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
Closes https://github.com/facebook/rocksdb/pull/3997
Differential Revision: D8653831
Pulled By: anand1976
fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
Summary:
This PR supports the group commit of multiple version edit entries corresponding to different column families. Column family drop/creation still cannot be grouped. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Closes https://github.com/facebook/rocksdb/pull/3944
Differential Revision: D8432536
Pulled By: riversand963
fbshipit-source-id: 8f11bd05193b6c0d9272d82e44b676abfac113cb
Summary:
Lite does not support readonly DBs.
Closes https://github.com/facebook/rocksdb/pull/4070
Differential Revision: D8677858
Pulled By: maysamyabandeh
fbshipit-source-id: 536887d2363ee2f5d8e1ea9f1a511e643a1707fa
Summary:
Instead of __SANITIZE_ADDRESS__ macro, LLVM uses __has_feature(address_sanitzer) to check if ASAN is enabled for the build. I tested it with MySQL sanitizer build that uses RocksDB as a submodule.
Closes https://github.com/facebook/rocksdb/pull/4066
Reviewed By: riversand963
Differential Revision: D8668941
Pulled By: taewookoh
fbshipit-source-id: af4d1da180c1470d257a228f431eebc61490bc36
Summary:
Remove over-alignment on `StatisticsImpl` whose benefit is vague and causes UBSAN check to fail due to `std::make_shared` not respecting the over-alignment requirement.
Test plan
```
$ make clean && COMPILE_WITH_UBSAN=1 OPT=-g make -j16 ubsan_check
```
Closes https://github.com/facebook/rocksdb/pull/4061
Differential Revision: D8656506
Pulled By: riversand963
fbshipit-source-id: db355ae9c7bdd2c9e9c5e63cabba13d8d82cc5f9
Summary:
with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests.
Closes https://github.com/facebook/rocksdb/pull/4067
Differential Revision: D8669002
Pulled By: miasantreble
fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a
Summary:
…ression
For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts
Closes https://github.com/facebook/rocksdb/pull/3985
Reviewed By: riversand963
Differential Revision: D8385911
Pulled By: zhichao-cao
fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc
Summary:
https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case.
Closes https://github.com/facebook/rocksdb/pull/4053
Differential Revision: D8662546
Pulled By: maysamyabandeh
fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd
Summary:
1. I added PingCap's more up-to-date Rust Binding of RocksDB
2. I also added ObjectiveRocks which is a very nice binding for _both_ Swift and Objective-C
Closes https://github.com/facebook/rocksdb/pull/4065
Differential Revision: D8670340
Pulled By: siying
fbshipit-source-id: 3db28bf3a464c3e050df52cc92b19248b7f43944
Summary:
- Summary
Add timestamp into the DeadlockInfo to store the timestamp when deadlock detected on the rocksdb side.
- Testplan:
`make check -j64`
Closes https://github.com/facebook/rocksdb/pull/4060
Differential Revision: D8655380
Pulled By: chouxi
fbshipit-source-id: f58e1aa5e09eb1d1eed0a181d4e2304aaf01efe8
Summary:
Various rearrangements of the cch maths failed or replacing = '\0' with
memset failed to convince the compiler it was nul terminated. So took
the perverse option of changing strncpy to strcpy.
Return null if memory couldn't be allocated.
util/status.cc: In static member function ‘static const char* rocksdb::Status::CopyState(const char*)’:
util/status.cc:28:15: error: ‘char* strncpy(char*, const char*, size_t)’ output truncated before terminating nul copying as many bytes from a string as its length [-Werror=stringop-truncation]
std::strncpy(result, state, cch - 1);
~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
util/status.cc:19:18: note: length computed here
std::strlen(state) + 1; // +1 for the null terminator
~~~~~~~~~~~^~~~~~~
cc1plus: all warnings being treated as errors
make: *** [Makefile:645: shared-objects/util/status.o] Error 1
closes#2705
Closes https://github.com/facebook/rocksdb/pull/3870
Differential Revision: D8594114
Pulled By: anand1976
fbshipit-source-id: ab20f3a456a711e4d29144ebe630e4fe3c99ec25
Summary:
This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.
There are the places where reads had to be modified.
- Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
- Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
- Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
Closes https://github.com/facebook/rocksdb/pull/3955
Differential Revision: D8286688
Pulled By: lth
fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR #3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.
/cc ajkr nvanbenschoten
Closes https://github.com/facebook/rocksdb/pull/4016
Differential Revision: D8527575
Pulled By: ajkr
fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
Summary:
Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file.
This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at.
Closes https://github.com/facebook/rocksdb/pull/3899
Differential Revision: D8157459
Pulled By: miasantreble
fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41
Summary:
As discussed with thatsafunnyname [here](https://discuss.lgtm.com/t/c-c-lang-missing-for-facebook-rocksdb/1079): this configuration enables C/C++ analysis for RocksDB on LGTM.com.
The initial commit will contain a build command (simple `make`) that previously resulted in a build error. The build log will then be available on LGTM.com for you to investigate (if you like). I'll immediately add a second commit to this PR to correct the build command to `make static_lib`, which worked when I tested it earlier today.
If you like you can also enable automatic code review in pull requests. This will alert you to any new code issues before they actually get merged into `master`. Here's an example of how that works for the AMPHTML project: https://github.com/ampproject/amphtml/pull/13060. You can enable it yourself here: https://lgtm.com/projects/g/facebook/rocksdb/ci/.
I'll also add a badge to your README.md in a separate commit — feel free to remove that from this PR if you don't like it.
(Full disclosure: I'm part of the LGTM.com team 🙂. Ping samlanning)
Closes https://github.com/facebook/rocksdb/pull/4058
Differential Revision: D8648410
Pulled By: ajkr
fbshipit-source-id: 98d55fc19cff1b07268ac8425b63e764806065aa
Summary:
Universal size-amp-triggered compaction was pulling the final sorted run into the compaction without checking whether any of its files are already being compacted. When all compactions are automatic, it is safe since it verifies the second-last sorted run is not already being compacted, which implies the last sorted run is also not being compacted (in automatic compaction multiple sorted runs are always compacted together). But with manual compaction, files in the last sorted run can be compacted independently, so the last sorted run also must be checked.
We were seeing the below assertion failure in `db_stress`. Also the test case included in this PR repros the failure.
```
db_universal_compaction_test: db/compaction.cc:312: void rocksdb::Compaction::MarkFilesBeingCompacted(bool): Assertion `mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted' failed.
Aborted (core dumped)
```
Closes https://github.com/facebook/rocksdb/pull/4055
Differential Revision: D8630094
Pulled By: ajkr
fbshipit-source-id: ac3b30a874678b76e113d4f6c42c1260411b08f8
Summary:
Pinned the alignment of StatisticsData to the cacheline size rather than just extending its size (which could go over two cache lines)if unaligned in allocation.
Avoid compile errors in the process as per individual commit messages.
strengthen static_assert to CACHELINE rather than the highest common multiple.
Closes https://github.com/facebook/rocksdb/pull/4036
Differential Revision: D8582844
Pulled By: yiwu-arbug
fbshipit-source-id: 363c37029f28e6093e06c60b987bca9aa204bc71
Summary:
Previously with is_fifo=true we only evict TTL file. Changing it to also evict non-TTL files from oldest to newest, after exhausted TTL files.
Closes https://github.com/facebook/rocksdb/pull/4049
Differential Revision: D8604597
Pulled By: yiwu-arbug
fbshipit-source-id: bc4209ee27c1528ce4b72833e6f1e1bff80082c1
Summary:
Pass in `for_compaction` to `BlockBasedTableIterator` via `BlockBasedTableReader::NewIterator`.
In 7103559f49, `for_compaction` was set in `BlockBasedTable::Rep` via `BlockBasedTable::SetupForCompaction`. In hindsight it was not the right decision; it also caused TSAN to complain.
Closes https://github.com/facebook/rocksdb/pull/4048
Differential Revision: D8601056
Pulled By: sagar0
fbshipit-source-id: 30127e898c15c38c1080d57710b8c5a6d64a0ab3
Summary:
Enable readahead for blob DB garbage collection, which should improve GC performance a little bit.
Closes https://github.com/facebook/rocksdb/pull/3648
Differential Revision: D7383791
Pulled By: yiwu-arbug
fbshipit-source-id: 642b3327f7105eca85986d3fb2d8f960a3d83cf1
Summary:
Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead.
Closes https://github.com/facebook/rocksdb/pull/4037
Differential Revision: D8596218
Pulled By: maysamyabandeh
fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2