Summary:
The `TransactionTest.MultiGetBatchedTest` were failing with unprepared batches because we were not using the correct callbacks. Override MultiGet to pass down the correct ReadCallback. A similar problem is also fixed in WritePrepared.
This PR also fixes an issue similar to (https://github.com/facebook/rocksdb/pull/5147), but for MultiGet instead of Get.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5634
Differential Revision: D16552674
Pulled By: lth
fbshipit-source-id: 736eaf8e919c6b13d5f5655b1c0d36b57ad04804
Summary:
This PR optimizes the hybrid row-block cache simulator. If a Get request hits the cache, we treat all its future accesses as hits.
Consider a Get request (no snapshot) accesses multiple files, e.g, file1, file2, file3. We construct the row key as "fdnumber_key_0". Before this PR, if it hits the cache when searching the key in file1, we continue to process its accesses in file2 and file3 which is unnecessary.
With this PR, if "file1_key_0" is in the cache, we treat all future accesses of this Get request as hits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5616
Differential Revision: D16453187
Pulled By: HaoyuHuang
fbshipit-source-id: 56f3169cc322322305baaf5543226a0824fae19f
Summary:
The ssize_t type was introduced in https://github.com/facebook/rocksdb/pull/5633, but it seems like it's a POSIX specific type.
I just need a signed type to represent number of bytes, so use int64_t instead. It seems like we have a typedef from SSIZE_T for Windows, but it doesn't seem like we ever include "port/port.h" in our public header files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5638
Differential Revision: D16526269
Pulled By: lth
fbshipit-source-id: 8d3a5c41003951b74b29bc5f1d949b2b22da0cee
Summary:
This test frequently times out under TSAN; reducing the number of random
iterations to make it complete faster.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5635
Test Plan: buck test mode/dev-tsan internal_repo_rocksdb/repo:compact_on_deletion_collector_test
Differential Revision: D16523505
Pulled By: ltamasi
fbshipit-source-id: 6a69909bce9d204c891150fcb3d536547b3253d0
Summary:
Instead of reusing `TransactionOptions::max_write_batch_size` for determining when to flush a write batch for write unprepared, add a new variable called `write_batch_flush_threshold` for this use case instead.
Also add `TransactionDBOptions::default_write_batch_flush_threshold` which sets the default value if `TransactionOptions::write_batch_flush_threshold` is unspecified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5633
Differential Revision: D16520364
Pulled By: lth
fbshipit-source-id: d75ae5a2141ce7708982d5069dc3f0b58d250e8c
Summary:
Transaction::RollbackToSavePoint undos the modification made since the SavePoint beginning, and also unlocks the corresponding keys, which are tracked in the last SavePoint. Currently ::PopSavePoint simply discard these tracked keys, leaving them locked in the lock manager. This breaks a subsequent ::RollbackToSavePoint behavior as it loses track of such keys, and thus cannot unlock them. The patch fixes ::PopSavePoint by passing on the track key information to the previous SavePoint.
Fixes https://github.com/facebook/rocksdb/issues/5618
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5628
Differential Revision: D16505325
Pulled By: lth
fbshipit-source-id: 2bc3b30963ab4d36d996d1f66543c93abf358980
Summary:
Simplify WriteUnpreparedTxnReadCallback so we just have one function `CalcMaxVisibleSeq`. Also, there's no need for the read callback to hold onto the transaction any more, so just hold the set of unprep_seqs, reducing about of indirection in `IsVisibleFullCheck`.
Also, some comments about using transaction snapshot were out of date, so remove them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5621
Differential Revision: D16459883
Pulled By: lth
fbshipit-source-id: cd581323fd18982e817d99af57b6eaba59e599bb
Summary:
The ObjectRegistry class replaces the Registrar and NewCustomObjects. Objects are registered with the registry by Type (the class must implement the static const char *Type() method).
This change is necessary for a few reasons:
- By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable.
- By having a class with instances, different units could have different objects registered. This could be useful if, for example, one Option allowed for a dynamic library and one did not.
When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program.
Test plan (on riversand963's devserver)
```
$COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check
```
All tests pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293
Differential Revision: D16363396
Pulled By: riversand963
fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180
Summary:
There are a number of fixes in this PR (with most bugs found via the added stress tests):
1. Re-enable reseek optimization. This was initially disabled to avoid infinite loops in https://github.com/facebook/rocksdb/pull/3955 but this can be resolved by remembering not to reseek after a reseek has already been done. This problem only affects forward iteration in `DBIter::FindNextUserEntryInternal`, as we already disable reseeking in `DBIter::FindValueForCurrentKeyUsingSeek`.
2. Verify that ReadOption.snapshot can be safely used for iterator creation. Some snapshots would not give correct results because snaphsot validation would not be enforced, breaking some assumptions in Prev() iteration.
3. In the non-snapshot Get() case, reads done at `LastPublishedSequence` may not be enough, because unprepared sequence numbers are not published. Use `std::max(published_seq, max_visible_seq)` to do lookups instead.
4. Add stress test to test reading own writes.
5. Minor bug in the allow_concurrent_memtable_write case where we forgot to pass in batch_per_txn_.
6. Minor performance optimization in `CalcMaxUnpreparedSequenceNumber` by assigning by reference instead of value.
7. Add some more comments everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5573
Differential Revision: D16276089
Pulled By: lth
fbshipit-source-id: 18029c944eb427a90a87dee76ac1b23f37ec1ccb
Summary:
- Compute correlation between a few features and predictions, e.g., number of accesses since the last access vs number of accesses till the next access on a block.
- Output human readable trace file so python can consume it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5596
Test Plan: make clean && USE_CLANG=1 make check -j32
Differential Revision: D16373200
Pulled By: HaoyuHuang
fbshipit-source-id: c848d26bc2e9210461f317d7dbee42d55be5a0cc
Summary:
This PR traces the referenced key for Get for all types of blocks. This is useful when evaluating hybrid row-block caches.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5548
Test Plan: make clean && USE_CLANG=1 make check -j32
Differential Revision: D16157979
Pulled By: HaoyuHuang
fbshipit-source-id: f6327411c9deb74e35e22a35f66cdbae09ab9d87
Summary:
Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135
This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469
"Add support for taking snapshot of a column family and creating column family from a given CF snapshot"
We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.
(1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
// Exports all live SST files of a specified Column Family onto export_dir,
// returning SST files information in metadata.
// - SST files will be created as hard links when the directory specified
// is in the same partition as the db directory, copied otherwise.
// - export_dir should not already exist and will be created by this API.
// - Always triggers a flush.
virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
const std::string& export_dir,
ExportImportFilesMetaData** metadata);
Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
returning the list of file metadata.
(2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
// CreateColumnFamilyWithImport() will create a new column family with
// column_family_name and import external SST files specified in metadata into
// this column family.
// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
// an existing DB.
// Option in import_options specifies whether the external files are copied or
// moved (default is copy). When option specifies copy, managing files at
// external_file_path is caller's responsibility. When option specifies a
// move, the call ensures that the specified files at external_file_path are
// deleted on successful return and files are not modified on any error
// return.
// On error return, column family handle returned will be nullptr.
// ColumnFamily will be present on successful return and will not be present
// on error return. ColumnFamily may be present on any crash during this call.
virtual Status CreateColumnFamilyWithImport(
const ColumnFamilyOptions& options, const std::string& column_family_name,
const ImportColumnFamilyOptions& import_options,
const ExportImportFilesMetaData& metadata,
ColumnFamilyHandle** handle);
Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.
If incoming sequence number is higher than current local sequence number, local sequence
number is updated to reflect this.
Note, as the sst files is are being moved across Column Families, Column Family name in sst file
will no longer match the actual column family on destination DB. The API does not modify Column
Family name or id in the sst files being imported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495
Differential Revision: D16018881
fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
Summary:
Currently, we are tracking keys we need to rollback via a separate structure specific to WriteUnprepared in write_set_keys_.
We already have a data structure called tracked_keys_ used to track which keys to unlock on transaction termination. This is exactly what we want, since we should only rollback keys that we have locked anyway.
Save some memory by reusing that data structure instead of making our own.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5562
Differential Revision: D16206484
Pulled By: lth
fbshipit-source-id: 5894d2b824a4b19062d84adbd6e6e86f00047488
Summary:
- Provide assignment operator in CompactionStats
- Provide a copy constructor for FileDescriptor
- Remove std::move from "return std::move(t)" in BoundedQueue
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5553
Differential Revision: D16230170
fbshipit-source-id: fd7c6e52390b2db1be24141e25649cf62424d078
Summary:
This PR adds a ghost cache for admission control. Specifically, it admits an entry on its second access.
It also adds a hybrid row-block cache that caches the referenced key-value pairs of a Get/MultiGet request instead of its blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5534
Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32
Differential Revision: D16101124
Pulled By: HaoyuHuang
fbshipit-source-id: b99edda6418a888e94eb40f71ece45d375e234b1
Summary:
Previously `GetAllKeyVersions()` supports default column family only. This PR add support for other column families.
Test plan (devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 db_basic_test
$./db_basic_test --gtest_filter=DBBasicTest.GetAllKeyVersions
```
All other unit tests must pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5544
Differential Revision: D16147551
Pulled By: riversand963
fbshipit-source-id: 5a61aece2a32d789e150226a9b8d53f4a5760168
Summary:
1. Cleanup WAL trash files on open
2. Don't apply deletion rate limit if WAL dir is different from db dir
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5520
Test Plan: Add new unit tests and make check
Differential Revision: D16096750
Pulled By: anand1976
fbshipit-source-id: 6f07858ad864b754b711db416f0389c45ede599b
Summary:
This PR creates cache_simulator.h file. It contains a CacheSimulator that runs against a block cache trace record. We can add alternative cache simulators derived from CacheSimulator later. For example, this PR adds a PrioritizedCacheSimulator that inserts filter/index/uncompressed dictionary blocks with high priority.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5517
Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32
Differential Revision: D16043689
Pulled By: HaoyuHuang
fbshipit-source-id: 65f28ed52b866ffb0e6eceffd7f9ca7c45bb680d
Summary:
~DBWithTTLImpl() fails after calling Close() function (will invoke the
Close() function of DBImpl), because the Close() function deletes
default_cf_handle_ which is used in the GetOptions() function called
in ~DBWithTTLImpl(), hence lead to segfault.
Fix by creating a Close() function for the DBWithTTLImpl class and do
the close and the work originally in ~DBWithTTLImpl(). If the Close()
function is not called, it will be called in the ~DBWithTTLImpl()
function.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5485
Test Plan: make clean; USE_CLANG=1 make all check -j
Differential Revision: D15924498
fbshipit-source-id: 567397fb972961059083a1ae0f9f99ff74872b78
Summary:
This PR adds a BlockCacheTraceSimulator that reports the miss ratios given different cache configurations. A cache configuration contains "cache_name,num_shard_bits,cache_capacities". For example, "lru, 1, 1K, 2K, 4M, 4G".
When we replay the trace, we also perform lookups and inserts on the simulated caches.
In the end, it reports the miss ratio for each tuple <cache_name, num_shard_bits, cache_capacity> in a output file.
This PR also adds a main source block_cache_trace_analyzer so that we can run the analyzer in command line.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5449
Test Plan:
Added tests for block_cache_trace_analyzer.
COMPILE_WITH_ASAN=1 make check -j32.
Differential Revision: D15797073
Pulled By: HaoyuHuang
fbshipit-source-id: aef0c5c2e7938f3e8b6a10d4a6a50e6928ecf408
Summary:
CLANG would complain if we pass const to lambda function and appveyor complains if we don't (https://github.com/facebook/rocksdb/pull/5443). The patch fixes that by using the default capture mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5447
Differential Revision: D15788722
Pulled By: maysamyabandeh
fbshipit-source-id: 47e7f49264afe31fdafe42cb8bf93da126abfca9
Summary:
CLANG complains that passing const to thread is not necessary. The patch removes it form PreparedHeap::Concurrent test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5443
Differential Revision: D15781598
Pulled By: maysamyabandeh
fbshipit-source-id: 3aceb05d96182fa4726d6d37eed45fd3aac4c016
Summary:
Internally PreparedHeap is currently using a priority_queue. The rationale was the in the initial design PreparedHeap::AddPrepared could be called in arbitrary order. With the recent optimizations, we call ::AddPrepared only from the main write queue, which results into in-order insertion into PreparedHeap. The patch thus replaces the underlying priority_queue with a more efficient deque implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5436
Differential Revision: D15752147
Pulled By: maysamyabandeh
fbshipit-source-id: e6960f2b2097e13137dded1ceeff3b10b03b0aeb
Summary:
This is a port of this PR into WriteUnprepared:
https://github.com/facebook/rocksdb/pull/5014
This also reverts this test change to restore some flaky write unprepared
tests: https://github.com/facebook/rocksdb/pull/5315
Tested with:
$ gtest-parallel ./transaction_test --gtest_filter=MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/9 --repeat=128
[128/128] MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/9 (18250 ms)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5439
Differential Revision: D15761405
Pulled By: lth
fbshipit-source-id: ae2581fd942d8a5b3f9278fd6bc3c1ac0b2c964c
Summary:
If a memtable definitely covers a key, there isn't a need to check older memtables.
We can skip them by checking the earliest sequence number.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4941
Differential Revision: D13932666
fbshipit-source-id: b9d52f234b8ad9dd3bf6547645cd457175a3ca9b
Summary:
The patch reduces the contention over prepared_mutex_ using these techniques:
1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues.
2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop()
3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420
Differential Revision: D15741985
Pulled By: maysamyabandeh
fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402
Differential Revision: D15701195
Pulled By: miasantreble
fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
Summary:
Currently we validate options in DB::Open. However the validation step is missing when options are dynamically updated in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5368
Differential Revision: D15540101
Pulled By: maysamyabandeh
fbshipit-source-id: d27bbffd8f0252d1b50bcf59e0a70a278ed937f4
Summary:
util/ means for lower level libraries. trace_replay is highly integrated to DB and sometimes call DB. Move it out to a separate directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5376
Differential Revision: D15550938
Pulled By: siying
fbshipit-source-id: f46dce5ceffdc05a73f26379c7bb1b79ebe6c207
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387
Differential Revision: D15579036
Pulled By: siying
fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
Summary:
Right now, with auto roll logger, options.keep_log_file_num enforcement is triggered by events like DB reopen or full obsolete scan happens. In the mean time, the size and number of log files can grow without a limit. We put a stronger enforcement to the option, so that the number of log files can always under control.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5370
Differential Revision: D15570413
Pulled By: siying
fbshipit-source-id: 0916c3c4d42ab8fdd29389ee7fd7e1557b03176e
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377
Differential Revision: D15551366
Pulled By: siying
fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375
Differential Revision: D15550935
Pulled By: siying
fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
Summary:
This enables the user to set TransactionDBOptions::skip_concurrency_control so the standard `DB::Write(const WriteOptions& opts, WriteBatch* updates)` would skip the concurrency control. This would give higher throughput to the users who know their use case doesn't need concurrency control.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5330
Differential Revision: D15525932
Pulled By: maysamyabandeh
fbshipit-source-id: 68421ac1ba34f549a4a8de9ce4c2dccf6fb4b06b
Summary:
When committing a transaction without prepare, WritePrepared simply writes the batch to db and add the commit entry to CommitCache. When two_write_queues=true, following the rule of committing only from 2nd write queue, the first write, writes the batch and the only thing the 2nd write does is to write the commit entry to CommitCache. Currently the write batch in 2nd write is set to an empty LogData entry, while the write to the WAL could simply be entirely disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5327
Differential Revision: D15424546
Pulled By: maysamyabandeh
fbshipit-source-id: 3d9ea3922d5196984c584d62a3ed57e1f7ca7b9f
Summary:
WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes.
The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313
Differential Revision: D15379977
Pulled By: maysamyabandeh
fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00
Summary:
They are kind of flaky at the moment. Will re-enable it when flakiness is fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5315
Differential Revision: D15382744
Pulled By: maysamyabandeh
fbshipit-source-id: 8b2f9d81a4bb34bfd51481727a682d5cd063c5e3
Summary:
The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306
Differential Revision: D15341458
Pulled By: maysamyabandeh
fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323
Summary:
Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
The patch is prepared based on an original by siying
Existing unit tests are extended to include unordered_write option.
Benchmark Results:
```
TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1
```
With WAL
- Vanilla RocksDB: 78.6 MB/s
- WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
- unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
Without WAL
- Vanilla RocksDB: 111.3 MB/s
- WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
- unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
- WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
Limitations:
- The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
Differential Revision: D15219029
Pulled By: maysamyabandeh
fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
Summary:
MultiGet batching was implemented in #5011 in order to reduce CPU utilization when looking up multiple keys at once. This PR implements corresponding ```MultiGet``` and ```MultiGetSingleCFForUpdate``` in ```rocksdb::Transaction``` that call the underlying batching implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5210
Differential Revision: D15048164
Pulled By: anand1976
fbshipit-source-id: c52f6043102ab0cbc723f4cba2a7b7d1767f6f52
Summary:
The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.
My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.
Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.
There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).
The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183
Differential Revision: D14953553
Pulled By: siying
fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
Summary:
Savepoints are assumed to be used in a stack-wise fashion (only
the top element should be used), so they were stored by `WriteBatch`
in a member variable `save_points` using an std::stack.
Conceptually this is fine, but the implementation had a few issues:
- the `save_points_` instance variable was a plain pointer to a heap-
allocated `SavePoints` struct. The destructor of `WriteBatch` simply
deletes this pointer. However, the copy constructor of WriteBatch
just copied that pointer, meaning that copying a WriteBatch with
active savepoints will very likely have crashed before. Now a proper
copy of the savepoints is made in the copy constructor, and not just
a copy of the pointer
- `save_points_` was an std::stack, which defaults to `std::deque` for
the underlying container. A deque is a bit over the top here, as we
only need access to the most recent savepoint (i.e. stack.top()) but
never any elements at the front. std::deque is rather expensive to
initialize in common environments. For example, the STL implementation
shipped with GNU g++ will perform a heap allocation of more than 500
bytes to create an empty deque object. Although the `save_points_`
container is created lazily by RocksDB, moving from a deque to a plain
`std::vector` is much more memory-efficient. So `save_points_` is now
a vector.
- `save_points_` was changed from a plain pointer to an `std::unique_ptr`,
making ownership more explicit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192
Differential Revision: D15024074
Pulled By: maysamyabandeh
fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7
Summary:
Depending on the config, manual compaction (leveled compaction style) does following compactions:
L0->L1
L1->L2
...
Ln-1 -> Ln
Ln -> Ln
The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
Resolves issue https://github.com/facebook/rocksdb/issues/4995
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138
Differential Revision: D14940106
Pulled By: miasantreble
fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
Summary:
This branch contains two small improvements:
* Create `LockMap` entries using `std::make_shared`. This saves one heap allocation per LockMap entry but also locates the control block and the LockMap object closely together in memory, which can help with caching
* Reorder the members of `TrackedTrxInfo`, so that the resulting struct uses less memory (at least on 64bit systems)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5193
Differential Revision: D14934536
Pulled By: maysamyabandeh
fbshipit-source-id: f7b49812bb4b6029eef9d131e7cd56260df5b28e
Summary:
In `PessimisticTransaction::TryLock`, we were calling `TrackKey` even when assume_tracked=true, which defeats the purpose of assume_tracked. Remove this.
For keys that are already tracked, TrackKey will actually bump some counters (num_reads/num_writes) which are consumed in `TransactionBaseImpl::GetTrackedKeysSinceSavePoint`, and this is used to determine which keys were tracked since the last savepoint. I believe this functionality should still work, since I think the user should not call GetForUpdate/Put(assume_tracked=true) across savepoints, and if they do, they should not expect the Put(assume_tracked=true) to show up as a tracked key in the second savepoint.
This is another 2-3% cpu improvement.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5173
Differential Revision: D14883809
Pulled By: lth
fbshipit-source-id: 7d09f0772da422384af0519773e310c22b0cbca3
Summary:
When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete.
The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache.
A unit test is added to reproduce the race condition with duplicate keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147
Differential Revision: D14758815
Pulled By: maysamyabandeh
fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
Summary:
The LockInfo struct is not easy to copy because it contains std::vector. Reduce copies by using move constructor and `unordered_map::emplace`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5172
Differential Revision: D14882053
Pulled By: lth
fbshipit-source-id: 93999ec6ab1a5841fb5115abb764b6c1831a6de1
Summary:
Create new function NPHash64() and GetSliceNPHash64(), which are currently
implemented using murmurhash.
Replace the current direct call of murmurhash() to use the new functions
if the hash results are not used in on-disk format.
This will make it easier to try out or switch to alternative functions
in the uses where data format compatibility doesn't need to be considered.
This part shouldn't have any performance impact.
Also, the sharded cache hash function is changed to the new format, because
it falls into this categoery. It doesn't show visible performance impact
in db_bench results. CPU showed by perf is increased from about 0.2% to 0.4%
in an extreme benchmark setting (4KB blocks, no-compression, everything
cached in block cache). We've known that the current hash function used,
our own Hash() has serious hash quality problem. It can generate a lots of
conflicts with similar input. In this use case, it means extra lock contention
for reads from the same file. This slight CPU regression is worthy to me
to counter the potential bad performance with hot keys. And hopefully this
will get further improved in the future with a better hash function.
cache_test's condition is relaxed a little bit to. The new hash is slightly
more skewed in this use case, but I manually checked the data and see
the hash results are still in a reasonable range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5155
Differential Revision: D14834821
Pulled By: siying
fbshipit-source-id: ec9a2c0a2f8ae4b54d08b13a5c2e9cc97aa80cb5
Summary:
Annotate all of the logging functions to inform the compiler that these
use printf-style formatting arguments. This allows the compiler to emit
warnings if the format arguments are incorrect.
This also fixes many problems reported now that format string checking
is enabled. Many of these are simply mix-ups in the argument type (e.g,
int vs uint64_t), but in several cases the wrong number of arguments
were being passed in which can cause the code to crash.
The primary motivation for this was to fix the log message in
`DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s
format parameter with no argument supplied.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089
Differential Revision: D14574795
Pulled By: simpkins
fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a
Summary:
Ubsna complains that in initialization of WriteUnpreparedTxnReadCallback the method of the child class is used before the parent class is constructed. The patch fixes that by making the aforementioned method static.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5148
Differential Revision: D14760098
Pulled By: maysamyabandeh
fbshipit-source-id: cf19b7c1fdb5de0a54e62c1deebe09a0fa048ded
Summary:
In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121
Differential Revision: D14665210
Pulled By: maysamyabandeh
fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534
Summary:
WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads.
The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call.
Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots.
The following benchmark shows +0.26% higher throughput in seekrandom benchmark.
Benchmark:
./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench
./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG 10 runs] : 20355 ops/sec; 225.2 MB/sec
seekrandom [MEDIAN 10 runs] : 20425 ops/sec; 225.9 MB/sec
./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG 10 runs] : 20409 ops/sec; 225.8 MB/sec
seekrandom [MEDIAN 10 runs] : 20487 ops/sec; 226.6 MB/sec
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049
Differential Revision: D14366459
Pulled By: maysamyabandeh
fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef
Summary:
Compaction would depend on max_evicted_seq_ value. The ::Initialize method should do that after max_evicted_seq_ is properly initialized. The patch also back ports #4853 from WritePrepared txn to WriteUnPrepared.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5128
Differential Revision: D14686562
Pulled By: maysamyabandeh
fbshipit-source-id: b2355025712a72676ac3b20a95258adcf4774490
Summary:
WAL files are currently not subject to deletion rate limiting by DeleteScheduler. If the size of the WAL files is significant, this can cause a high delete rate on SSDs that may affect other operations. To fix it, force WAL file deletions to go through the SstFileManager. Original PR for this is #2768
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5116
Differential Revision: D14669437
Pulled By: anand1976
fbshipit-source-id: c5f62d0640cebaa1574de841a1d01e4ce2faadf0
Summary:
The code convention we are following, Google C++ Style, discourage
alias in header files, especially public headers:
https://google.github.io/styleguide/cppguide.html#Aliases
Remove some of them. Might removed some from .cc files as well to be consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113
Differential Revision: D14633030
Pulled By: siying
fbshipit-source-id: b990edc919d5de60295992284f980195e501d424
Summary:
Right now, BlobDB::Open() fails to put all trash files to delete scheduler,
which causes some trash files permanently untracked.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5103
Differential Revision: D14606095
Pulled By: siying
fbshipit-source-id: 41a9437a2948abb235c0ed85f9a04612d0e50183
Summary:
[RocksDB] Make it easier for users to load options from option file and set shared block cache.
Right now, it requires several dynamic casting for users to set the shared block cache to their option struct cast from the option file.
If people don't do that, every CF of every DB will generate its own 8MB block cache. It's not a usable setting. So we are dragging every user who loads options from the file into such a mess.
Instead, we should allow them to pass their cache object to LoadLatestOptions() and LoadOptionsFromFile(), so that those loaded option structs will have the shared block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5063
Differential Revision: D14518584
Pulled By: rashmishrm
fbshipit-source-id: c91430ff9425a0e67d76fc67931d755f491ca5aa
Summary:
This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms:
1. lz4 or snappy for fast compression
2. zstd or zlib for slow but higher compression.
The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType.
The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842
Differential Revision: D13629011
Pulled By: shobhitdayal
fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa
Summary:
Initialize magic_number to zero to avoid such failure.
utilities/blob_db/blob_log_format.cc:91:3: error: 'magic_number' may be used
uninitialized in this function [-Werror=maybe-uninitialized]
if (magic_number != kMagicNumber) {
^~
Signed-off-by: He Zhe <zhe.he@windriver.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5074
Differential Revision: D14505514
Pulled By: miasantreble
fbshipit-source-id: 4334462958c2b9c5a7c68c6ab24dadf94ad70902
Summary:
Add a mutex to the test to synchronize before accessing the shared txn object.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5052
Differential Revision: D14386861
Pulled By: maysamyabandeh
fbshipit-source-id: 5b32e209840b210c35af53848dc77f489a76c95a
Summary:
The patch fixes an improbable race condition between AddPrepared from one write queue and AdvanceMaxEvictedSeq from another queue. In this scenario AddPrepared finds prepare_seq lower than max and adding to PrepareHeap as usual while AdvanceMaxEvictedSeq has finished checking PrepareHeap against the future max. Thus when AdvanceMaxEvictedSeq finishes off by updating the max_evicted_seq_, PrepareHeap ends up with a prepared_seq lower than it which breaks the PrepareHeap contract. The fix is that in AddPrepared we check against the future_max_evicted_seq_ instead, which is update before AdvanceMaxEvictedSeq acquire prepare_mutex_ and looks into PrepareHeap.
A unit test added to test for the failure scenario. The code is also refactored a bit to remove the duplicate code between AdvanceMaxEvictedSeq and AddPrepared.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5025
Differential Revision: D14249028
Pulled By: maysamyabandeh
fbshipit-source-id: 072ea56663f40359662c05fafa6ac524417b0622
Summary:
The patch adds the sequence number of the rollback patch to the PrepareHeap when two_write_queues is enabled. Although the current behavior is still correct, the change simplifies reasoning about the code, by having all uncommitted batches registered with the PreparedHeap.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5026
Differential Revision: D14249401
Pulled By: maysamyabandeh
fbshipit-source-id: 1e3424edee5cd14e56ee35931ad3c93ed997cd5a
Summary:
When two_write_queues is enabled we call ::AddPrepared only from the main queue, which writes to both WAL and memtable, and call ::AddCommitted from the 2nd queue, which writes only to WAL. This simplifies the logic by avoiding concurrency between AddPrepared and also between AddCommitted. The patch fixes one case that did not conform with the rule above. This would allow future refactoring. For example AdvaneMaxEvictedSeq, which is invoked by AddCommitted, can be simplified by assuming lack of concurrent calls to it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5014
Differential Revision: D14210493
Pulled By: maysamyabandeh
fbshipit-source-id: 6db5ba372a294a568a14caa010576460917a4eab
Summary:
Statistics cost too much CPU for some use cases. Add two stats levels
so that people can choose to skip two types of expensive stats, timers and
histograms.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027
Differential Revision: D14252765
Pulled By: siying
fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592
Summary:
The read path includes a callback function, ReadCallback, which would eventually calls IsInSnapshot to figure if a particular seq is in the reading snapshot or not. This callback is virtual, which adds the cost of multiple virtual function call to each read. The first few checks in IsInSnapshot, however, are quite trivial and take care of majority of the cases. The patch moves those to a non-virtual function in the the parent class, ReadCallback, to lower the virtual callback cost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5018
Differential Revision: D14226562
Pulled By: maysamyabandeh
fbshipit-source-id: 6feed5b34f3b082e52092c5ef143e29b49c46b44
Summary:
When closing a BlobDB, it only waits for background tasks
to finish as the last thing, but the background task may access
some variables that are destroyed. The fix is to introduce a
shutdown function in the timer queue and call the function as
the first thing when destorying BlobDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5005
Differential Revision: D14170342
Pulled By: siying
fbshipit-source-id: 081e6a2d99b9765d5956cf6cdfc290c07270c233
Summary:
Currently the transaction stress tests use thread id as the seed. Since the thread ids are likely to be the same across multiple runs, the seed is thus going to be the same. The patch includes time in calculating the seed to help covering a very different part of state space in each run of the stress tests. To be able to reproduce the bug in case the stress tests failed, it also prints out the time that was used to calculate the seed value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5004
Differential Revision: D14144356
Pulled By: maysamyabandeh
fbshipit-source-id: 728ed522f550fc8b4f5f9f373259c05fe9a54556
Summary:
The transaction stress tests, stress a high concurrency scenario. In WritePrepared/WriteUnPrepared we need to also stress the scenarios where an inserting/reading transaction is very slow. This would stress the corner cases that the caching is not sufficient and other slower data structures are engaged. To emulate such cases we make use of slow inserter/verifier threads and also reduce the size of cache data structures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4974
Differential Revision: D14143070
Pulled By: maysamyabandeh
fbshipit-source-id: 81eb674678faf9fae0f654cd60ebcc74e26aeee7
Summary:
max_evicted_seq_ could be updated in the middle of the read in ::IsInSnapshot. The code to be correct in presence of this update would be complicated. The patch simplifies it by checking the value of max_evicted_seq_ before and after looking into commit_cache_ and retries in the unlucky case that it was changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4955
Differential Revision: D13999556
Pulled By: maysamyabandeh
fbshipit-source-id: 7a1bdfa95ea8b5d8d73ddff3263ed31d7297b39c
Summary:
as title. For people who continue to need Lua compaction filter, you
can copy the include/rocksdb/utilities/rocks_lua/lua_compaction_filter.h and
utilities/lua/rocks_lua_compaction_filter.cc to your own codebase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4971
Differential Revision: D14047468
Pulled By: riversand963
fbshipit-source-id: 9ad1a6484a7c94e478f1e108127a3184e4069f70
Summary:
Enhance ::Insert and ::Verify test functions to add artificial delay between prepare and commit, and take snapshot and reads respectively. A future PR will make use of these to improve stress tests to test against long-running transactions as well as long-running backup jobs. Also randomly sets set_snapshot to false for inserters to skip setting the snapshot in the initialization phase and let the snapshot be taken later explicitly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4970
Differential Revision: D14031342
Pulled By: maysamyabandeh
fbshipit-source-id: b52b453751f0b25b81b23c48892bc1d152464cab
Summary:
WritePreparedTransactionDB operates with more options which should not be configurable to avoid complicating it for the users. For testing purposes however we need to change the default value of this parameters. This patch makes these parameters private fields in TransactionDBOptions so that the existing ::Open API could use them seamlessly without however exposing them to the users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4966
Differential Revision: D14015986
Pulled By: maysamyabandeh
fbshipit-source-id: 13037efa7dfdd6f73ec7a19414b66571e044c633
Summary:
ValidateSnapshot checks if another txn has committed a value to about-to-be-locked key since a particular snapshot. It applies an optimization of looking into only the memtable if snapshot seq is larger than the earliest seq in the memtables. With a long-running txn in WritePrepared, the prepared value might be flushed out to the disk and yet it commits after the snapshot, which breaks this optimization. The patch fixes that by disabling this optimization when the min_uncomitted seq at the time the snapshot was taken is lower than earliest seq in the memtables.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4961
Differential Revision: D14009947
Pulled By: maysamyabandeh
fbshipit-source-id: 1d11679950326f7c4094b433e6b821b729f08850
Summary:
Commit of delayed prepared has two non-atomic steps: add to commit cache, remove from delayed_prepared_. Similarly in ::IsInSnapshot we read from commit cache first and then look into delayed_prepared_. Due to non-atomicity thus the reader might not find the
prep_seq that is just committed neither in commit cache nor in delayed_prepared_. To fix that i)
we check if there was any delayed prepared BEFORE looking into commit
cache, ii) if there was, we complete the search steps to be these: i)
commit cache, ii) delayed prepared, commit cache again. In this way if
the first query to commit cache missed the commit, the 2nd will catch it. The cost of the redundant read from commit cache is paid only if delayed_prepared_ is nonempty which should be a very rare scenario.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4947
Differential Revision: D13952754
Pulled By: maysamyabandeh
fbshipit-source-id: 8f47826b13f8ce154398d842028342423f4ca2b2
Summary:
WritePrepared maintains a list of snapshots that are <= max_evicted_seq_. Based on this list, old_commit_map_ is updated if an evicted commit entry overlaps with such snapshot. Such lists are garbage collected when the release of snapshot is reported to WritePreparedTxnDB, which is the next time max_evicted_seq_ is updated and yet the snapshot is not found is the list returned from DB. This logic was broken since ReleaseSnapshotInternal was using "< max_evicted_seq_" to cleanup old_commit_map_, which would leave a snapshot uncleaned if it "= max_evicted_seq_". The patch fixes that and adds a unit test to check for the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4944
Differential Revision: D13945000
Pulled By: maysamyabandeh
fbshipit-source-id: 0c904294f735911f52348a148bf1f945282fc17c
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889
Differential Revision: D13701276
Pulled By: zinoale
fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
Summary:
Blob DB files are not tracked by the SFM, so they currently don't get
deleted in the background. Force them to be deleted in background so
rate limiting can be applied
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4928
Differential Revision: D13854649
Pulled By: anand1976
fbshipit-source-id: 8031ce66842ff0af440c715d886b377983dad7d8
Summary:
Right now, deleting blob files is not rate limited, even if SstFileManger is specified.
On the other hand, rate limiting blob deletion is not supported. With this change, Blob file
deletion will go through SstFileManager too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4904
Differential Revision: D13772545
Pulled By: siying
fbshipit-source-id: bd1b1d0beb26d5167385e00b7ecb8b94b879de84
Summary:
Remove unused blob WAL filter so that users are not confused.
I was initially under the impression that we have WAL Filter support in BlobDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4896
Differential Revision: D13725709
Pulled By: sagar0
fbshipit-source-id: f997d7546e138a474036e88b957907cc714327f1
Summary:
This is essentially a re-submission of #4251 with a few improvements:
- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849
Differential Revision: D13606039
Pulled By: ajkr
fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
Summary:
Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios:
Scenario 1:
key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1.
The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key.
Scenario 2:
key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here.
The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890
Differential Revision: D13705538
Pulled By: maysamyabandeh
fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4
Summary:
Here is the order of ops in a commit: 1) update commit cache 2) publish seq, 3) RemovePrepared. In case of a delayed prepared, there will be a gap between when the commit is visible to snapshots until delayed_prepared_ is cleaned up. To tell apart this case from a delayed uncommitted txn from, the commit entry of a delayed prepared is also stored in delayed_prepared_commits_, which is updated before publishing the commit.
Also logic in GetSnapshotInternal that ensures that each new snapshot is always larger than max_evicted_seq_ is updated to check against the upcoming value of max_evicted_seq_ rather than its current one. This is because AdvanceMaxEvictedSeq gets the list of snapshots lower than the new max, before updating max_evicted_seq_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4894
Differential Revision: D13726988
Pulled By: maysamyabandeh
fbshipit-source-id: 1e70d78061b50c944c9816bf4b6dac405ab4ccd3
Summary:
Remove `garbage_collection_deletion_size_threshold` as it is not used anywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4888
Differential Revision: D13685982
Pulled By: sagar0
fbshipit-source-id: e08d3017b9a0c8fa99bc332b595ee4ed9db70c87
Summary:
Compaction iterator keep a copy of list of live snapshots at the beginning of compaction, and then query snapshot checker to verify if values of a sequence number is visible to these snapshots. However when the snapshot is released in the middle of compaction, the snapshot checker implementation (i.e. WritePreparedSnapshotChecker) may remove info with the snapshot and may report incorrect result, which lead to values being compacted out when it shouldn't. This patch conservatively keep the values if snapshot checker determines that the snapshots is released.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4858
Differential Revision: D13617146
Pulled By: maysamyabandeh
fbshipit-source-id: cf18a94f6f61a94bcff73c280f117b224af5fbc3
Summary:
With WritePrepared transaction, flush/compaction can contain uncommitted keys, and those keys can get committed during compaction. If a snapshot is taken before the key is committed, it should not see the key. On the other hand, compaction grab the list of snapshots at its beginning, and only consider those snapshots to dedup keys. Consider the case:
```
seq = 1: put "foo" = "bar"
seq = 2: transaction T: delete "foo", prepare
seq = 3: compaction start
seq = 4: take snapshot S
seq = 5: transaction T: commit.
...
seq = N: compaction iterator reached key "foo".
```
When compaction start, the list of snapshot is empty. Compaction doesn't take snapshot S into account. When it reached "foo", transaction T is committed. Compaction may think the value "foo=bar" is not visible by any snapshot (which is wrong), and compact the value out.
The fix is to explicitly take a snapshot before compaction grabbing the list of snapshots. Compaction will then has to keep keys visible to this snapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4883
Differential Revision: D13668775
Pulled By: maysamyabandeh
fbshipit-source-id: 1cab9615f94b7d3e8522cc3d44c3a14c7d4720e4