Summary:
1. Remove redundant text.
2. Make terminology consistent across all comments and doc of RocksDB. Also do
our best to conform to conventions. Specifically, use 'callback' instead of
'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)).
Closes https://github.com/facebook/rocksdb/pull/3693
Differential Revision: D7560396
Pulled By: riversand963
fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8
Summary:
In this case, we add input files of compaction, not outputs.
Closes https://github.com/facebook/rocksdb/pull/3686
Differential Revision: D7556781
Pulled By: ajkr
fbshipit-source-id: ae135bb6eda60db8f275a9ba2d21c18aaadef5b7
Summary:
- inflate the argument passed as `max_compact_bytes_per_del_file` by a bit (10%). The intent of this argument is prevent L0 files from being intra-L0 compacted multiple times. Without compression, some intra-L0 compactions exceed this limit (and thus aren't executed), even though none of their files have gone through intra-L0 before.
- fix `FindIntraL0Compaction` as it was rejecting some valid intra-L0 compactions. In particular, `compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len), whereas `new_compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len+1). The former is more correct for checking whether we've found an eligible span.
Closes https://github.com/facebook/rocksdb/pull/3684
Differential Revision: D7530396
Pulled By: ajkr
fbshipit-source-id: cad4f50902bdc428ac9ff6fffb13eb288648d85e
Summary:
Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex.
Closes https://github.com/facebook/rocksdb/pull/3683
Differential Revision: D7529393
Pulled By: maysamyabandeh
fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102
Differential Revision: D6951697
Pulled By: ajkr
fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
Summary:
Primitive types constness does not affect the signature of the
method and has no influence on whether the overriding method would
actually have that const bool instead of just bool. In addition,
it is rarely useful but does produce a compatibility warnings
in VS 2015 compiler.
Closes https://github.com/facebook/rocksdb/pull/3663
Differential Revision: D7475739
Pulled By: ajkr
fbshipit-source-id: fb275378b5acc397399420ae6abb4b6bfe5bd32f
Summary:
currently rocksdb lite build fails due to the following errors:
> db/db_sst_test.cc:29:51: error: ‘FlushJobInfo’ does not name a type
virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
^
db/db_sst_test.cc:29:16: error: ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’ marked ‘override’, but does not override
virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
^
db/db_sst_test.cc:24:7: error: ‘class rocksdb::FlushedFileCollector’ has virtual functions and accessible non-virtual destructor [-Werror=non-virtual-dtor]
class FlushedFileCollector : public EventListener {
^
db/db_sst_test.cc: In member function ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’:
db/db_sst_test.cc:31:35: error: request for member ‘file_path’ in ‘info’, which is of non-class type ‘const int’
flushed_files_.push_back(info.file_path);
^
cc1plus: all warnings being treated as errors
make: *** [db/db_sst_test.o] Error 1
Closes https://github.com/facebook/rocksdb/pull/3676
Differential Revision: D7493006
Pulled By: miasantreble
fbshipit-source-id: 77dff0a5b23e27db51be9b9798e3744e6fdec64f
Summary:
Ttl-triggered and snapshot-release-triggered compactions should not be considered as manual compactions. This is a bug.
Closes https://github.com/facebook/rocksdb/pull/3678
Differential Revision: D7498151
Pulled By: sagar0
fbshipit-source-id: a2d5bed05268a4dc93d54ea97a9ae44b366df15d
Summary:
Level Compaction with TTL.
As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.
Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.
This functionality can be controlled by the newly introduced column family option -- ttl.
TODO for later:
- Make ttl mutable
- Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
- Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
Closes https://github.com/facebook/rocksdb/pull/3591
Differential Revision: D7275442
Pulled By: sagar0
fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
Summary:
The is an optimization to reduce lookup in the CommitCache when querying IsInSnapshot. The optimization takes the smallest uncommitted data at the time that the snapshot was taken and if the sequence number of the read data is lower than that number it assumes the data as committed.
To implement this optimization two changes are required: i) The AddPrepared function must be called sequentially to avoid out of order insertion in the PrepareHeap (otherwise the top of the heap does not indicate the smallest prepare in future too), ii) non-2PC transactions also call AddPrepared if they do not commit in one step.
Closes https://github.com/facebook/rocksdb/pull/3649
Differential Revision: D7388630
Pulled By: maysamyabandeh
fbshipit-source-id: b79506238c17467d590763582960d4d90181c600
Summary:
Manual compactions should be cancelled, just like scheduled compactions are cancelled, if sfm->EnoughRoomForCompaction is not true.
Closes https://github.com/facebook/rocksdb/pull/3670
Differential Revision: D7457683
Pulled By: amytai
fbshipit-source-id: 669b02fdb707f75db576d03d2c818fb98d1876f5
Summary:
This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Closes https://github.com/facebook/rocksdb/pull/3488
Differential Revision: D6967893
Pulled By: maysamyabandeh
fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1
Summary:
When using two_write_queue, the published seq and the last allocated sequence could be ahead of the LastSequence, even if both write queues are stopped as in WriteRecoverableState. The patch fixes a bug in WriteRecoverableState in which LastSequence was used as a reference but the result was applied to last fetched sequence and last published seq.
Closes https://github.com/facebook/rocksdb/pull/3665
Differential Revision: D7446099
Pulled By: maysamyabandeh
fbshipit-source-id: 1449bed9aed8e9db6af85946efd347cb8efd3c0b
Summary:
Currently if the CommitTimeWriteBatch is set to be used only as a state that is required only for recovery , the user cannot see that in DB until it is restarted. This while the state is already inserted into the DB after the memtable flush. It would be useful for debugging if make this state visible to the user after the flush by committing it. The patch does it by a invoking a callback that does the commit on the recoverable state.
Closes https://github.com/facebook/rocksdb/pull/3661
Differential Revision: D7424577
Pulled By: maysamyabandeh
fbshipit-source-id: 137f9408662f0853938b33fa440f27f04c1bbf5c
Summary:
Possible interleaved execution of background compaction thread calling `FindObsoleteFiles (no full scan) / PurgeObsoleteFiles` and user thread calling `FindObsoleteFiles (full scan) / PurgeObsoleteFiles` can lead to race condition on which RocksDB attempts to delete a file twice. The second attempt will fail and return `IO error`. This may occur to other files, but this PR targets sst.
Also add a unit test to verify that this PR fixes the issue.
The newly added unit test `obsolete_files_test` has a test case for this scenario, implemented in `ObsoleteFilesTest#RaceForObsoleteFileDeletion`. `TestSyncPoint`s are used to coordinate the interleaving the `user_thread` and background compaction thread. They execute as follows
```
timeline user_thread background_compaction thread
t1 | FindObsoleteFiles(full_scan=false)
t2 | FindObsoleteFiles(full_scan=true)
t3 | PurgeObsoleteFiles
t4 | PurgeObsoleteFiles
V
```
When `user_thread` invokes `FindObsoleteFiles` with full scan, it collects ALL files in RocksDB directory, including the ones that background compaction thread have collected in its job context. Then `user_thread` will see an IO error when trying to delete these files in `PurgeObsoleteFiles` because background compaction thread has already deleted the file in `PurgeObsoleteFiles`.
To fix this, we make RocksDB remember which (SST) files have been found by threads after calling `FindObsoleteFiles` (see `DBImpl#files_grabbed_for_purge_`). Therefore, when another thread calls `FindObsoleteFiles` with full scan, it will not collect such files.
ajkr could you take a look and comment? Thanks!
Closes https://github.com/facebook/rocksdb/pull/3638
Differential Revision: D7384372
Pulled By: riversand963
fbshipit-source-id: 01489516d60012e722ee65a80e1449e589ce26d3
Summary:
Currently log_writer->AddRecord in WriteImpl is protected from concurrent calls via FlushWAL only if two_write_queues_ option is set. The patch fixes the problem by i) skip log_writer->AddRecord in FlushWAL if manual_wal_flush is not set, ii) protects log_writer->AddRecord in WriteImpl via log_write_mutex_ if manual_wal_flush_ is set but two_write_queues_ is not.
Fixes#3599
Closes https://github.com/facebook/rocksdb/pull/3656
Differential Revision: D7405608
Pulled By: maysamyabandeh
fbshipit-source-id: d6cc265051c77ae49c7c6df4f427350baaf46934
Summary:
Currently AddPrepared is performed only on the first sub-batch if there are duplicate keys in the write batch. This could cause a problem if the transaction takes too long to commit and the seq number of the first sub-patch moved to old_prepared_ but not the seq of the later ones. The patch fixes this by calling AddPrepared for all sub-patches.
Closes https://github.com/facebook/rocksdb/pull/3651
Differential Revision: D7388635
Pulled By: maysamyabandeh
fbshipit-source-id: 0ccd80c150d9bc42fe955e49ddb9d7ca353067b4
Summary:
RangeDelAggregator will remember the files whose range tombstones have been added,
so the caller can check whether the file has been added before call AddTombstones.
Closes https://github.com/facebook/rocksdb/pull/3635
Differential Revision: D7354604
Pulled By: ajkr
fbshipit-source-id: 9b9f7ec130556028df417e650711554b46d8d107
Summary:
Summary
========
`InlineSkipList<>::Insert` takes the `key` parameter as a C-string. Then, it performs multiple comparisons with it requiring the `GetLengthPrefixedSlice()` to be spawn in `MemTable::KeyComparator::operator()(const char* prefix_len_key1, const char* prefix_len_key2)` on the same data over and over. The patch tries to optimize that.
Rough performance comparison
=====
Big keys, no compression.
```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom : 4.222 micros/op 236836 ops/sec; 80.4 MB/s
```
```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom : 4.064 micros/op 246059 ops/sec; 83.5 MB/s
```
TODO
======
In ~~a separated~~ this PR:
- [x] Go outside the write path. Maybe even eradicate the C-string-taking variant of `KeyIsAfterNode` entirely.
- [x] Try to cache the transformations applied by `KeyComparator` & friends in situations where we havy many comparisons with the same key.
Closes https://github.com/facebook/rocksdb/pull/3516
Differential Revision: D7059300
Pulled By: ajkr
fbshipit-source-id: 6f027dbb619a488129f79f79b5f7dbe566fb2dbb
Summary:
Fsync after writing global sequence number to the ingestion file in ExternalSstFileIngestionJob. Otherwise the file metadata could be incorrect.
Closes https://github.com/facebook/rocksdb/pull/3644
Differential Revision: D7373813
Pulled By: sagar0
fbshipit-source-id: 4da2c9e71a8beb5c08b4ac955f288ee1576358b8
Summary:
It was misnamed. It actually updates `bg_error_` if `PreprocessWrite()` or `WriteToWAL()` fail, not related to the user callback.
Closes https://github.com/facebook/rocksdb/pull/3485
Differential Revision: D6955787
Pulled By: ajkr
fbshipit-source-id: bd7afc3fdb7a52830c021cbfc25fcbc3ab7d5e10
Summary:
This commit fixes a race condition on calling SetLastPublishedSequence. The function must be called only from the 2nd write queue when two_write_queues is enabled. However there was a bug that would also call it from the main write queue if CommitTimeWriteBatch is provided to the commit request and yet use_only_the_last_commit_time_batch_for_recovery optimization is not enabled. To fix that we penalize the commit request in such cases by doing an additional write solely to publish the seq number from the 2nd queue.
Closes https://github.com/facebook/rocksdb/pull/3641
Differential Revision: D7361508
Pulled By: maysamyabandeh
fbshipit-source-id: bf8f7a27e5cccf5425dccbce25eb0032e8e5a4d7
Summary:
This pull request exposes the interface of PerfContext as C API
Closes https://github.com/facebook/rocksdb/pull/3607
Differential Revision: D7294225
Pulled By: ajkr
fbshipit-source-id: eddcfbc13538f379950b2c8b299486695ffb5e2c
Summary:
When destorying column family handle after the column family has been deleted, the handle may hold share pointers of some objects in ColumnFamilyOptions, but in the destructor, the destructing order may cause some of the objects to be destoryed before being used by the following steps. Fix it by making a copy of the option object and destory it as the last step.
Closes https://github.com/facebook/rocksdb/pull/3610
Differential Revision: D7281025
Pulled By: siying
fbshipit-source-id: ac18f3b2841788cba4ccfa1abd8d59158c1113bc
Summary:
Previously, the compaction in `DBCompactionTestWithParam.ForceBottommostLevelCompaction` generated multiple files in no-compression use case, andone file in compression use case. I increased `target_file_size_base` so it generates one file in both use cases.
Closes https://github.com/facebook/rocksdb/pull/3625
Differential Revision: D7311885
Pulled By: ajkr
fbshipit-source-id: 97f249fa83a9924ac34357a4bb3189c969ecb107
Summary:
If there are a lot of overlapped files in L0, creating a merging iterator for
all files in L0 to check overlap can be very slow because we need to read and
seek all files in L0. However, in that case, the ingested file is likely to
overlap with some files in L0, so if we check those files one by one, we can stop
once we encounter overlap.
Ref: https://github.com/facebook/rocksdb/issues/3540
Closes https://github.com/facebook/rocksdb/pull/3564
Differential Revision: D7196784
Pulled By: anand1976
fbshipit-source-id: 8700c1e903bd515d0fa7005b6ce9b3a3d9db2d67
Summary:
This is a small API extension to allow the CompactFiles method to return the names of files that were created during the compaction.
Closes https://github.com/facebook/rocksdb/pull/3608
Differential Revision: D7275789
Pulled By: siying
fbshipit-source-id: 1ec0c3954a0f10cd877efb5f29f9be6c7b59e9ba
Summary:
I landed #3544 which made this test flaky. The reason was the files scheduled for deletion sometimes went through the trash-marking process, and sometimes were deleted directly. Our counter only bumped on the former code path, so if the latter code path was used, we'd miss counting a file deleted by deletion scheduler. This PR also bumps the counter in the latter code path.
Closes https://github.com/facebook/rocksdb/pull/3593
Differential Revision: D7226173
Pulled By: yiwu-arbug
fbshipit-source-id: 81ab44c60834df6ff88db1d73ea34e26c6e93c39
Summary:
Added a stat that counts the number of cancelled compactions.
Closes https://github.com/facebook/rocksdb/pull/3574
Differential Revision: D7190259
Pulled By: amytai
fbshipit-source-id: d5ce82dc9398da6d6d34023ad4ed8cec909852a3
Summary:
The CRC is actually calculated based on the record type and payload.
The wiki should also be updated accordingly and extended with a section on the recyclable record format.
Closes https://github.com/facebook/rocksdb/pull/3576
Differential Revision: D7196478
Pulled By: siying
fbshipit-source-id: 39f7a0395075cc73e2aa2bfc9e42c85bce35e765
Summary:
This diff handles cases where compaction causes an ENOSPC error.
This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize.
It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC.
Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions.
Closes https://github.com/facebook/rocksdb/pull/3449
Differential Revision: D7016941
Pulled By: amytai
fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445
Summary:
This is the simplest way I could think of to speed up `CompactRange`. It works but isn't that optimal because it relies on the same `max_compaction_bytes` and `max_subcompactions` options that are used in other places. If it turns out to be useful we can allow overriding these in `CompactRangeOptions` in the future.
Closes https://github.com/facebook/rocksdb/pull/3549
Differential Revision: D7117634
Pulled By: ajkr
fbshipit-source-id: d0cd03d6bd0d2fd7ea3fb13cd3b8bf7c47d11e42
Summary:
Now that files scheduled for deletion are kept in the same directory, we don't need to constrain deletion scheduler to `db_paths[0]`. Previously this was done because there was a separate trash directory, and this constraint prevented files from being accidentally copied to another filesystem when they're scheduled for deletion.
Closes https://github.com/facebook/rocksdb/pull/3544
Differential Revision: D7093786
Pulled By: ajkr
fbshipit-source-id: 202f5c92d925eafebec1281fb95bb5828d33414f
Summary:
In attempting to build a static lib for use in iOS, I ran in to lots of type errors between uint64_t and size_t. This PR contains the changes I made to get `TARGET_OS=IOS make static_lib` to succeed while also getting Xcode to build successfully with the resulting `librocksdb.a` library imported.
This also compiles for me on macOS and tests fine, but I'm really not sure if I made the correct decisions about where to `static_cast` and where to change types.
Also up for discussion: is iOS worth supporting? Getting the static lib is just part one, we aren't providing any bridging headers or wrappers like the ObjectiveRocks project, it won't be a great experience.
Closes https://github.com/facebook/rocksdb/pull/3503
Differential Revision: D7106457
Pulled By: gfosco
fbshipit-source-id: 82ac2073de7e1f09b91f6b4faea91d18bd311f8e
Summary:
This patch addressed several issues.
Portability including db_test std::thread -> port::Thread Cc: @
and %z to ROCKSDB portable macro. Cc: maysamyabandeh
Implement Env::AreFilesSame
Make the implementation of file unique number more robust
Get rid of C-runtime and go directly to Windows API when dealing
with file primitives.
Implement GetSectorSize() and aling unbuffered read on the value if
available.
Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976
Fix test running script issue where $status var was of incorrect scope
so the failures were swallowed and not reported.
DestroyDB() creates a logger and opens a LOG file in the directory
being cleaned up. This holds a lock on the folder and the cleanup is
prevented. This fails one of the checkpoin tests. We observe the same in production.
We close the log file in this change.
Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test
attempts to open a directory with NewRandomAccessFile which does not
work on Windows.
Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug
Closes https://github.com/facebook/rocksdb/pull/3552
Differential Revision: D7156304
Pulled By: siying
fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b
Summary:
Improving blob db FIFO eviction with the following changes,
* Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size.
* FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion.
* FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files.
* If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO.
* Compaction filter also filter those blob indexes where corresponding blob file is gone.
* Add an event listener to listen compaction/flush event and update sst file size.
* Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db.
* More blob db statistics around FIFO.
* Fix some locking issue when accessing a blob file.
Closes https://github.com/facebook/rocksdb/pull/3556
Differential Revision: D7139328
Pulled By: yiwu-arbug
fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa
Summary:
Move DuplicateDetector and SetComparator to its own header file in util. It would also address a complaint in the unity test.
Closes https://github.com/facebook/rocksdb/pull/3567
Differential Revision: D7163268
Pulled By: maysamyabandeh
fbshipit-source-id: 6ddf82773473646dbbc1284ae601a78c4907c778
Summary:
Fix the following bugs:
- During recovery a duplicate key was inserted twice into the write batch of the recovery transaction,
once when the memtable returns false (because it was duplicates) and once for the 2nd attempt. This would result into different SubBatch count measured when the recovered transactions is committing.
- If a cf is flushed during recovery the memtable is not available to assist in detecting the duplicate key. This could result into not advancing the sequence number when iterating over duplicate keys of a flushed cf and hence inserting the next key with the wrong sequence number.
- SubBacthCounter would reset the comparator to default comparator after the first duplicate key. The 2nd duplicate key hence would have gone through a wrong comparator and not being detected.
Closes https://github.com/facebook/rocksdb/pull/3562
Differential Revision: D7149440
Pulled By: maysamyabandeh
fbshipit-source-id: 91ec317b165f363f5d11ff8b8c47c81cebb8ed77
Summary:
[FB - Internal]
MergeOperatorPinningTest.Randomized/x tests are frequently failing with timeouts when run with tsan, as they are exceeding 10 minute limit for tests. The tests are in turn getting disabled due to frequent failures.
I halved the number of rounds to make the test complete sooner. This reduces the number of testing iterations a little, but it still is much better than totally letting the test be disabled.
Closes https://github.com/facebook/rocksdb/pull/3523
Differential Revision: D7031498
Pulled By: sagar0
fbshipit-source-id: 9a694f2176b235259920a42bf24bca5346f7cff1
Summary:
Red diff to remove existing implementation of garbage collection. The current approach is reference counting kind of approach and require a lot of effort to get the size counter right on compaction and deletion. I'm going to go with a simple mark-sweep kind of approach and will send another PR for that.
CompactionEventListener was added solely for blob db and it adds complexity and overhead to compaction iterator. Removing it as well.
Closes https://github.com/facebook/rocksdb/pull/3551
Differential Revision: D7130190
Pulled By: yiwu-arbug
fbshipit-source-id: c3a375ad2639a3f6ed179df6eda602372cc5b8df
Summary:
The zeroed entries were not removed from prepared_section_completed_ map. This patch adds a unit test to show the problem and fixes that by refactoring the code. The new code is more efficient since i) it uses two separate mutex to avoid contention between commit and prepare threads, ii) it uses a sorted vector for maintaining uniq log entires with prepare which avoids a very large heap with many duplicate entries.
Closes https://github.com/facebook/rocksdb/pull/3545
Differential Revision: D7106071
Pulled By: maysamyabandeh
fbshipit-source-id: b3ae17cb6cd37ef10b6b35e0086c15c758768a48
Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548
Differential Revision: D7116939
Pulled By: yiwu-arbug
fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
Summary:
CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range.
- Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables`
- Reuse the above logic in `CompactRange` and skip flushing if no overlap
Closes https://github.com/facebook/rocksdb/pull/3520
Differential Revision: D7018897
Pulled By: ajkr
fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934
Summary:
Before:
> $ TEST_TMPDIR=/dev/shm ./db_bench -use_direct_reads=true -benchmarks=readrandomwriterandom -num=10000000 -reads=100000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -readwritepercent=50 -key_size=16 -value_size=48 -threads=32
DB path: [/dev/shm/dbbench]
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
db_bench: tpp.c:84: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed.
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
put error: IO error: While open a file for random read: /dev/shm/dbbench/000007.sst: Invalid argument
After:
> TEST_TMPDIR=/dev/shm ./db_bench -use_direct_reads=true -benchmarks=readrandomwriterandom -num=10000000 -reads=100000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -readwritepercent=50 -key_size=16 -value_size=48 -threads=32
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
open error: Not implemented: Direct I/O is not supported by the specified DB.
Closes https://github.com/facebook/rocksdb/pull/3539
Differential Revision: D7082658
Pulled By: miasantreble
fbshipit-source-id: f9d9c6ec3b5e9e049cab52154940ee101ba4d342
Summary:
The recent Logger::Close() and DBImpl::Close() implementation rely on
calling the CloseImpl() virtual function from the destructor, which will
not work. Refactor the implementation to have a private close helper
function in derived classes that can be called by both CloseImpl() and
the destructor.
Closes https://github.com/facebook/rocksdb/pull/3528
Reviewed By: gfosco
Differential Revision: D7049303
Pulled By: anand1976
fbshipit-source-id: 76a64cbf403209216dfe4864ecf96b5d7f3db9f4
Summary:
Some sanitizer is not happy with parameter name with ROCKSDB_JEMALLOC not set. Use another function instead.
Closes https://github.com/facebook/rocksdb/pull/3536
Differential Revision: D7064849
Pulled By: siying
fbshipit-source-id: c6ae94e044686176af1259df9172453d52c2f9d5
Summary:
Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped.
Closes https://github.com/facebook/rocksdb/pull/3525
Differential Revision: D7033694
Pulled By: sagar0
fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8
Summary:
These are optimization that we applied to improve sysbech's update_noindex performance.
1. Make use of LIKELY compiler hint
2. Move std::atomic so the subclass
3. Make use of skip_prepared in non-2pc transactions.
Closes https://github.com/facebook/rocksdb/pull/3512
Differential Revision: D7000075
Pulled By: maysamyabandeh
fbshipit-source-id: 1ab8292584df1f6305a4992973fb1b7933632181
Summary:
Deadlock: a memtable flush holds DB::mutex_ and calls ThreadLocalPtr::Scrape(), which locks ThreadLocalPtr mutex; meanwhile, a thread exit handler locks ThreadLocalPtr mutex and calls SuperVersionUnrefHandle, which tries to lock DB::mutex_.
This deadlock is hit all the time on our workload. It blocks our release.
In general, the problem is that ThreadLocalPtr takes an arbitrary callback and calls it while holding a lock on a global mutex. The same global mutex is (at least in some cases) locked by almost all ThreadLocalPtr methods, on any instance of ThreadLocalPtr. So, there'll be a deadlock if the callback tries to do anything to any instance of ThreadLocalPtr, or waits for another thread to do so.
So, probably the only safe way to use ThreadLocalPtr callbacks is to do only do simple and lock-free things in them.
This PR fixes the deadlock by making sure that local_sv_ never holds the last reference to a SuperVersion, and therefore SuperVersionUnrefHandle never has to do any nontrivial cleanup.
I also searched for other uses of ThreadLocalPtr to see if they may have similar bugs. There's only one other use, in transaction_lock_mgr.cc, and it looks fine.
Closes https://github.com/facebook/rocksdb/pull/3510
Reviewed By: sagar0
Differential Revision: D7005346
Pulled By: al13n321
fbshipit-source-id: 37575591b84f07a891d6659e87e784660fde815f
Summary:
The MemTableRep API was broken by this commit: 813719e952
This patch reverts the changes and instead adds InsertKey (and etc.) overloads to extend the MemTableRep API without breaking the existing classes that inherit from it.
Closes https://github.com/facebook/rocksdb/pull/3513
Differential Revision: D7004134
Pulled By: maysamyabandeh
fbshipit-source-id: e568d91fe1e17dd76c0c1f6c7dd51a18633b1c4f
Summary:
- removed a few unneeded variables
- fused some variable declarations and their assignments
- fixed right-trimming code in string_util.cc to not underflow
- simplifed an assertion
- move non-nullptr check assertion before dereferencing of that pointer
- pass an std::string function parameter by const reference instead of by value (avoiding potential copy)
Closes https://github.com/facebook/rocksdb/pull/3507
Differential Revision: D7004679
Pulled By: sagar0
fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4
Summary:
Right now it is possible that a file gets assigned to L0 but also assigned the seqno from a higher level which it doesn't fit
Under the current impl, it is possibe that seqno in lower levels (Ln) can be equal to smallest seqno of higher levels (Ln-1), which is undesirable from universal compaction's point of view.
This should fix the intermittent failure of `ExternalSSTFileBasicTest.IngestFileWithGlobalSeqnoPickedSeqno`
Closes https://github.com/facebook/rocksdb/pull/3411
Differential Revision: D6813802
Pulled By: miasantreble
fbshipit-source-id: 693d0462fa94725ccfb9d8858743e6d2d9992d14
Summary:
This fixes shift and signed-integer-overflow UBSAN checks in fault_injection_test by using a larger and unsigned type.
Closes https://github.com/facebook/rocksdb/pull/3498
Reviewed By: siying
Differential Revision: D6981116
Pulled By: igorsugak
fbshipit-source-id: 3688f62cce570534b161e9b5f42109ebc9ae5a2c
Summary:
A new LevelIterator was recently created. Rename the old one to make unity build happy. It's also not a good idea to have two classes in the same name anyway.
Closes https://github.com/facebook/rocksdb/pull/3499
Differential Revision: D6979325
Pulled By: siying
fbshipit-source-id: 3a032d93fe205650a08e92e5262594731ec726bb
Summary:
Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization.
The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators.
The two-level-iterator now is only used by partitioned index, so it is simplified.
Closes https://github.com/facebook/rocksdb/pull/3406
Differential Revision: D6809041
Pulled By: siying
fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa
Summary:
- Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur.
- Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true.
- Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3).
Closes https://github.com/facebook/rocksdb/pull/3381
Differential Revision: D6754983
Pulled By: ajkr
fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972
Summary:
Right now, users will encounter unexpected bahavior if they use key or value larger than 4GB. We should explicitly fail the queriers.
Closes https://github.com/facebook/rocksdb/pull/3484
Differential Revision: D6953895
Pulled By: siying
fbshipit-source-id: b60491e1af064fc5d52971956661f6c18ceac24f
Summary:
CompactionIterator invoke MergeHelper::MergeUntil() to do partial merge between snapshot boundaries. Previously it only depend on sequence number to tell snapshot boundary, but we also need to make use of snapshot_checker to verify visibility of the merge operands to the snapshots. For example, say there is a snapshot with seq = 2 but only can see data with seq <= 1. There are three merges, each with seq = 1, 2, 3. A correct compaction output would be (1),(2+3). Without taking snapshot_checker into account when generating merge result, compaction will generate output (1+2),(3).
By filtering uncommitted keys with read callback, the read path already take care of merges well and don't need additional updates.
Closes https://github.com/facebook/rocksdb/pull/3475
Differential Revision: D6926087
Pulled By: yiwu-arbug
fbshipit-source-id: 8f539d6f897cfe29b6dc27a8992f68c2a629d40a
Summary:
It's always a mystery from the logs why flush was triggered -- user triggered it manually, WriteBufferManager triggered it, logs were full, write buffer was full, etc.
This PR logs Flush reason whenever a flush is scheduled.
Closes https://github.com/facebook/rocksdb/pull/3401
Differential Revision: D6788142
Pulled By: miasantreble
fbshipit-source-id: a867e54d493c06adf5172bd36a180fb3faae3511
Summary:
…db_test
options_settable_test won't pass UBSAN so disable it.
blob_db_test fails in UBSAN as SnapshotList doesn't initialize all the fields in dummy snapshot. Fix it. I don't understand why only blob_db_test fails though.
Closes https://github.com/facebook/rocksdb/pull/3477
Differential Revision: D6928681
Pulled By: siying
fbshipit-source-id: e31dd300fcdecdfd4f6af279a0987fd0cdec5122
Summary:
Update compaction_iterator_test with write-prepared transaction DB related tests. Transaction related tests are group in CompactionIteratorWithSnapshotCheckerTest. The existing test are duplicated to make them also test with dummy SnapshotChecker that will say every key is visible to every snapshot (this is okay, we still compare sequence number to verify visibility). Merge related tests are disabled and will be revisit in another PR.
Existing db_iterator_tests are also duplicated to test with dummy read_callback that will say every key is committed.
Closes https://github.com/facebook/rocksdb/pull/3466
Differential Revision: D6909253
Pulled By: yiwu-arbug
fbshipit-source-id: 2ae4656b843a55e2e9ff8beecf21f2832f96cd25
Summary:
This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again.
Closes https://github.com/facebook/rocksdb/pull/3455
Differential Revision: D6873699
Pulled By: maysamyabandeh
fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d
Summary:
There are a couple of places where we swallow any error from
WriteBuffer() - in SwitchMemtable() and DBImpl::CloseImpl(). Propagate
the error up in those cases rather than ignoring it.
Closes https://github.com/facebook/rocksdb/pull/3404
Differential Revision: D6879954
Pulled By: anand1976
fbshipit-source-id: 2ef88b554be5286b0a8bad7384ba17a105395bdb
Summary:
ForwardIterator::SVCleanup() sometimes didn't pin superversion when it was supposed to. See the added test for the scenario. Here's the ASAN output of the added test without the fix (using `COMPILE_WITH_ASAN=1 make`): https://pastebin.com/9rD0Ywws
Closes https://github.com/facebook/rocksdb/pull/3415
Differential Revision: D6817414
Pulled By: al13n321
fbshipit-source-id: bc80c44ea78a3a1fa885dfa448a26111f91afb24
Summary:
Per some discussions, this will switch our Appveyor testing to use Visual Studio 2017.
Closes https://github.com/facebook/rocksdb/pull/3445
Differential Revision: D6874918
Pulled By: gfosco
fbshipit-source-id: c5a0032ca9f37f0d3baeae35c59d850d528c3176
Summary:
Currently DB does not accept duplicate keys (keys with the same user key and the same sequence number). If Memtable returns false when receiving such keys, we can benefit from this signal to properly increase the sequence number in the rare cases when we have a duplicate key in the write batch written to DB under WritePrepared transactions.
Closes https://github.com/facebook/rocksdb/pull/3418
Differential Revision: D6822412
Pulled By: maysamyabandeh
fbshipit-source-id: adea3ce5073131cd38ed52b16bea0673b1a19e77
Summary:
Updated the test case to handle tmpfs mounted at directories different from "/dev/shm/".
Closes https://github.com/facebook/rocksdb/pull/3440
Differential Revision: D6848213
Pulled By: ajkr
fbshipit-source-id: 465e9dbf0921d0930161f732db6b3766bb030589
Summary:
Using `DeleteFilesInRange` to delete files in a lot of ranges can be slow, because
`VersionSet::LogAndApply` is expensive.
This PR adds a new `DeleteFilesInRange` function to delete files in multiple
ranges at once.
Close https://github.com/facebook/rocksdb/issues/2951
Closes https://github.com/facebook/rocksdb/pull/3431
Differential Revision: D6849228
Pulled By: ajkr
fbshipit-source-id: daeedcabd8def4b1d9ee95a58266dee77b5d68cb
Summary:
In the test, there can be a dead lock between background flush thread and foreground main thread as following:
* background flush thread:
- holding db mutex, while
- waiting on "DBImpl::FlushMemTableToOutputFile:BeforeInstallSV" sync point.
* foreground thread:
- waiting for db mutex to write "key2"
Fixing by let background flush thread wait without holding db mutex.
Closes https://github.com/facebook/rocksdb/pull/3436
Differential Revision: D6841334
Pulled By: yiwu-arbug
fbshipit-source-id: b020768ac94e166e40953c5d09e505515a5f244d
Summary:
Added a test for three dynamic universal compaction options, in the realm of read amplification:
- size_ratio
- min_merge_width
- max_merge_width
Also updated DynamicUniversalCompactionSizeAmplification by adding a check on compaction reason.
Found a bug in compaction reason setting while working on this PR, and fixed in #3412 .
TODO for later: Still to add tests for these options: compression_size_percent, stop_style and trivial_move.
Closes https://github.com/facebook/rocksdb/pull/3419
Differential Revision: D6822217
Pulled By: sagar0
fbshipit-source-id: 074573fca6389053cbac229891a0163f38bb56c4
Summary:
ReadOptions.fill_cache is set in compaction inputs and can be set by users in their queries too. It tells RocksDB not to put a data block used to block cache.
The memory used by the data block is, however, not trackable by users.
To make the system more manageable, we can cost the block to block cache while using it, and then release it after using.
Closes https://github.com/facebook/rocksdb/pull/3333
Differential Revision: D6670230
Pulled By: miasantreble
fbshipit-source-id: ab848d3ed286bd081a13ee1903de357b56cbc308
Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues.
Reviewed By: yiwu-arbug
Differential Revision: D6821806
fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c
Summary:
While writing tests for dynamic Universal Compaction options, I found that the compaction reasons we set for size-ratio based and sorted-run based universal compactions are swapped with each other. Fixed it.
Closes https://github.com/facebook/rocksdb/pull/3412
Differential Revision: D6820540
Pulled By: sagar0
fbshipit-source-id: 270a188968ba25b2c96a8339904416c4c87ff5b3
Summary:
In DBIter, Prev() calls FindValueForCurrentKey() to search the current value backward. If it finds that there are too many stale value being skipped, it falls back to FindValueForCurrentKeyUsingSeek(), seeking directly to the key with snapshot sequence. After introducing read_callback, however, the key it seeks to might not be visible, according to read_callback. It thus needs to keep searching forward until the first visible value.
Closes https://github.com/facebook/rocksdb/pull/3382
Differential Revision: D6756148
Pulled By: yiwu-arbug
fbshipit-source-id: 064e39b1eec5e083af1c10142600f26d1d2697be
Summary:
In ColumnFamilySet destructor, assert it hold the last reference to cfd before destroy them.
Closes#3112
Closes https://github.com/facebook/rocksdb/pull/3397
Differential Revision: D6777967
Pulled By: yiwu-arbug
fbshipit-source-id: 60b19070e0c194b3b6146699140c1d68777866cb