Summary:
In DBCompactionTestWithParam::ManualLevelCompactionOutputPathId, there is
a race condition between `DBTestBase::GetSstFileCount` and
`DBImpl::PurgeObsoleteFiles`. The following graph explains why.
```
Timeline db_compact_test_t bg_flush_t bg_compact_t
| [initiate bg flush and
| start waiting]
| flush
| DeleteObsoleteFiles
| [waken up by bg_flush_t which
| signaled in DeleteObsoleteFiles]
|
| [initiate compaction and
| start waiting]
|
| [compact,
| set manual.done to true]
| [signal at the end of
| BackgroundCallFlush]
|
| [waken up by bg_flush_t
| which signaled before
| returning from
| BackgroundCallFlush]
|
| Check manual.done is true
|
| GetSstFileCount <-- race condition --> PurgeObsoleteFiles
V
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4440
Differential Revision: D10122628
Pulled By: riversand963
fbshipit-source-id: 3ede73c39fee6ad804dc6ac1ed84759c7e63977f
Summary:
TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346
Differential Revision: D9759149
Pulled By: maysamyabandeh
fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b
Summary:
This is a followup to #4311. Checking `!RangeDelAggregator::IsEmpty()` before opening a dedicated range tombstone SST did not properly prevent empty SSTs from being generated. That's because it relies on `CollapsedRangeDelMap::Size`, which had an underflow bug when the map was empty. This PR fixes that underflow bug.
Also fixed an uninitialized variable in db_stress.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4336
Differential Revision: D9600080
Pulled By: ajkr
fbshipit-source-id: bc6980ca79d2cd01b825ebc9dbccd51c1a70cfc7
Summary:
Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening.
One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable.
This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297
Differential Revision: D9420705
Pulled By: mikhail-antonov
fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc
Summary:
CompactFiles checked whether the existing files conflicted with the chosen compaction. But it missed checking whether future files would conflict, i.e., when another compaction was simultaneously writing new files to the same range at the same output level.
Closes https://github.com/facebook/rocksdb/pull/3926
Differential Revision: D8218996
Pulled By: ajkr
fbshipit-source-id: 21cb00a6fed4c8c62d3ed2ff810962e6bdc2fdfb
Summary:
In order to make valgrind check test to pass in a day, remove some tests that run prohibitively slow under valgrind.
Closes https://github.com/facebook/rocksdb/pull/3924
Differential Revision: D8210184
Pulled By: siying
fbshipit-source-id: 5b06fb08f3cf57571d422d05a0dbddc9f9376f7a
Summary:
This feature was introduced for universal compaction in cc01985d. At that point we thought it'd be used only to prevent long-running universal full compactions from blocking short-lived upper-level compactions. Now we have a level compaction user who could benefit from it since they use more expensive compression algorithm in the bottom level. So enable it for level.
Closes https://github.com/facebook/rocksdb/pull/3835
Differential Revision: D7957179
Pulled By: ajkr
fbshipit-source-id: 177285d2cef3b650b6a4d81dc5db84bc441c9fe4
Summary:
Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.
This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
Closes https://github.com/facebook/rocksdb/pull/3829
Differential Revision: D7915443
Pulled By: ajkr
fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279
Summary:
this PR fixes a few failed contbuild:
1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406
2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662
3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag.
Closes https://github.com/facebook/rocksdb/pull/3718
Reviewed By: maysamyabandeh
Differential Revision: D7621192
Pulled By: miasantreble
fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662
Differential Revision: D7426121
Pulled By: Dayvedde
fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
Summary:
Add `compaction_reason` as part of event log for event `compaction started`.
Add counters for each `CompactionReason`.
Closes https://github.com/facebook/rocksdb/pull/3679
Differential Revision: D7550348
Pulled By: riversand963
fbshipit-source-id: a19cff3a678c785aa5ef41aac78b9a5968fcc34d
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102
Differential Revision: D6951697
Pulled By: ajkr
fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
Summary:
Ttl-triggered and snapshot-release-triggered compactions should not be considered as manual compactions. This is a bug.
Closes https://github.com/facebook/rocksdb/pull/3678
Differential Revision: D7498151
Pulled By: sagar0
fbshipit-source-id: a2d5bed05268a4dc93d54ea97a9ae44b366df15d
Summary:
Level Compaction with TTL.
As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.
Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.
This functionality can be controlled by the newly introduced column family option -- ttl.
TODO for later:
- Make ttl mutable
- Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
- Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
Closes https://github.com/facebook/rocksdb/pull/3591
Differential Revision: D7275442
Pulled By: sagar0
fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
Summary:
Previously, the compaction in `DBCompactionTestWithParam.ForceBottommostLevelCompaction` generated multiple files in no-compression use case, andone file in compression use case. I increased `target_file_size_base` so it generates one file in both use cases.
Closes https://github.com/facebook/rocksdb/pull/3625
Differential Revision: D7311885
Pulled By: ajkr
fbshipit-source-id: 97f249fa83a9924ac34357a4bb3189c969ecb107
Summary:
CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range.
- Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables`
- Reuse the above logic in `CompactRange` and skip flushing if no overlap
Closes https://github.com/facebook/rocksdb/pull/3520
Differential Revision: D7018897
Pulled By: ajkr
fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934
Summary:
- Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur.
- Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true.
- Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3).
Closes https://github.com/facebook/rocksdb/pull/3381
Differential Revision: D6754983
Pulled By: ajkr
fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972
Summary:
Using `DeleteFilesInRange` to delete files in a lot of ranges can be slow, because
`VersionSet::LogAndApply` is expensive.
This PR adds a new `DeleteFilesInRange` function to delete files in multiple
ranges at once.
Close https://github.com/facebook/rocksdb/issues/2951
Closes https://github.com/facebook/rocksdb/pull/3431
Differential Revision: D6849228
Pulled By: ajkr
fbshipit-source-id: daeedcabd8def4b1d9ee95a58266dee77b5d68cb
Summary:
error message was
```
==3095==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7ffd18216c40 at pc 0x0000005edda1 bp 0x7ffd18215550 sp 0x7ffd18214d00
...
Address 0x7ffd18216c40 is located in stack of thread T0 at offset 1952 in frame
#0 internal_repo_rocksdb/db_compaction_test.cc:1520 rocksdb::DBCompactionTest_DeleteFileRangeFileEndpointsOverlapBug_Test::TestBody()
```
It was unsafe to have slices referring to the temporary string objects' buffers, as those strings were destroyed before the slices were used. Fixed it by assigning the strings returned by `Key()` to local variables.
Closes https://github.com/facebook/rocksdb/pull/3238
Differential Revision: D6507864
Pulled By: ajkr
fbshipit-source-id: dd07de1a0070c6748c1ab4f3d7bd31f9a81889d0
Summary:
Fix for #2833.
- In `DeleteFilesInRange`, use `GetCleanInputsWithinInterval` instead of `GetOverlappingInputs` to make sure we get a clean cut set of files to delete.
- In `GetCleanInputsWithinInterval`, support nullptr as `begin_key` or `end_key`.
- In `GetOverlappingInputsRangeBinarySearch`, move the assertion for non-empty range away from `ExtendFileRangeWithinInterval`, which should be allowed to return an empty range (via `end_index < begin_index`).
Closes https://github.com/facebook/rocksdb/pull/2843
Differential Revision: D5772387
Pulled By: ajkr
fbshipit-source-id: e554e8461823c6be82b21a9262a2da02b3957881
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
For now, what's done here according to initial discussions:
Preserving deletes:
- We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
- I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
- Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
Iterator changes:
- couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
TableCache changes:
- I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
What's left:
- Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999
Differential Revision: D6175602
Pulled By: mikhail-antonov
fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
Summary:
When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.
- Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
- Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
- Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
- Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
Closes https://github.com/facebook/rocksdb/pull/3009
Differential Revision: D6062044
Pulled By: ajkr
fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
Summary:
Bug report: https://www.facebook.com/groups/rocksdb.dev/permalink/1389452781153232/
Non-empty `level0_compactions_in_progress_` was aborting `CompactFiles` after incrementing `bg_compaction_scheduled_`, and in that case we never decremented it. This blocked future compactions and prevented DB close as we wait for scheduled compactions to finish/abort during close.
I eliminated `CompactFiles`'s dependency on `level0_compactions_in_progress_`. Since it takes a contiguous span of L0 files -- through the last L0 file if any L1+ files are included -- it's fine to run in parallel with other compactions involving L0. We make the same assumption in intra-L0 compaction.
Closes https://github.com/facebook/rocksdb/pull/2849
Differential Revision: D5780440
Pulled By: ajkr
fbshipit-source-id: 15b15d3faf5a699aed4b82a58352d4a7bb23e027
Summary:
if we're moving any L0 files down, we need to include older L0 files since they may contain older versions of the keys being moved down.
Closes https://github.com/facebook/rocksdb/pull/2845
Differential Revision: D5773800
Pulled By: ajkr
fbshipit-source-id: 9f0770a8eaaeea4c87df2e7a2a1d65bf9d7f4f7e
Summary:
add this counter stat to track usage of deletion-dropping optimization. if usage is low, we can delete it to prevent bugs like #2726.
Closes https://github.com/facebook/rocksdb/pull/2761
Differential Revision: D5665421
Pulled By: ajkr
fbshipit-source-id: 881befa2d199838dac88709e7b376a43d304e3d4
Summary:
`KeyNotExistsBeyondOutputLevel` didn't consider L0 files' key-ranges. So if a key only was covered by older L0 files' key-ranges, we would incorrectly drop deletions of that key. This PR just skips the deletion-dropping optimization when output level is L0.
Closes https://github.com/facebook/rocksdb/pull/2726
Differential Revision: D5617286
Pulled By: ajkr
fbshipit-source-id: 4bff1396b06d49a828ba4542f249191052915bce
Summary:
This reverts the previous commit 1d7048c598, which broke the build.
Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627
Differential Revision: D5476473
Pulled By: sagar0
fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary:
- `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility
- `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure.
- `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions.
- The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled.
Closes https://github.com/facebook/rocksdb/pull/2205
Differential Revision: D4937461
Pulled By: ajkr
fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9
Summary:
Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool.
Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool.
Closes https://github.com/facebook/rocksdb/pull/2204
Differential Revision: D4936256
Pulled By: ajkr
fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01
Summary:
When user doesn't set a limit on compaction output file size, let's use the sum of the input files' sizes. This will avoid passing UINT64_MAX as fallocate()'s length. Reported in #2249.
Test setup:
- command: `TEST_TMPDIR=/data/rocksdb-test/ strace -e fallocate ./db_compaction_test --gtest_filter=DBCompactionTest.ManualCompactionUnknownOutputSize`
- filesystem: xfs
before this diff:
`fallocate(10, 01, 0, 1844674407370955160) = -1 ENOSPC (No space left on device)`
after this diff:
`fallocate(10, 01, 0, 1977) = 0`
Closes https://github.com/facebook/rocksdb/pull/2252
Differential Revision: D5007275
Pulled By: ajkr
fbshipit-source-id: 4491404a6ae8a41328aede2e2d6f4d9ac3e38880
Summary:
A data race between a manual and an auto compaction can cause a scheduled automatic compaction to be cancelled and never rescheduled again. This may cause a condition of hanging forever. Fix this by always making sure the cancelled compaction is put back to the compaction queue.
Closes https://github.com/facebook/rocksdb/pull/2238
Differential Revision: D4984591
Pulled By: siying
fbshipit-source-id: 3ab153886403c7b991896dcb2158b96cac12f227
Summary:
Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction
Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get().
Closes https://github.com/facebook/rocksdb/pull/2117
Differential Revision: D4860912
Pulled By: lightmark
fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19
Summary:
Level-based L0->L0 compaction operates on spans of files that aren't currently being compacted. It reduces the number of L0 files, thus making write stall conditions harder to reach.
- L0->L0 is triggered when base level is unavailable due to pending compactions
- L0->L0 always outputs one file of at most `max_level0_burst_file_size` bytes.
- Subcompactions are disabled for L0->L0 since we want to output one file.
- Input files are chosen as the longest span of available files that will fit within the size limit. This minimizes number of files in L0.
Closes https://github.com/facebook/rocksdb/pull/2027
Differential Revision: D4760318
Pulled By: ajkr
fbshipit-source-id: 9d07183
Summary:
introduce new methods into a public threadpool interface,
- allow submission of std::functions as they allow greater flexibility.
- add Joining methods to the implementation to join scheduled and submitted jobs with
an option to cancel jobs that did not start executing.
- Remove ugly `#ifdefs` between pthread and std implementation, make it uniform.
- introduce pimpl for a drop in replacement of the implementation
- Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread.
- Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation.
- should be no functionality changes.
Closes https://github.com/facebook/rocksdb/pull/1823
Differential Revision: D4492902
Pulled By: siying
fbshipit-source-id: c74cb11
Summary:
When we introduced range deletion block, TableCache::Get() and TableCache::NewIterator() each did two table cache lookups, one for range deletion block iterator and another for getting the table reader to which the Get()/NewIterator() is delegated. This extra cache lookup was very CPU-intensive (about 10% overhead in a read-heavy benchmark). We can avoid it by reusing the Cache::Handle created for range deletion block iterator to get the file reader.
Closes https://github.com/facebook/rocksdb/pull/1537
Differential Revision: D4201167
Pulled By: ajkr
fbshipit-source-id: d33ffd8
Summary:
Previously we used TableCache::NewIterator() for multiple purposes (data
block iterator and range deletion iterator), and returned non-ok status in
the data block iterator. In one case where the caller only used the range
deletion block iterator (9e7cf3469b/db/version_set.cc (L965-L973)),
we didn't check/free the data block iterator containing non-ok status, which
caused a valgrind error.
So, this diff decouples creation of data block and range deletion block iterators,
and updates the callers accordingly. Both functions can return non-ok status
in an InternalIterator. Since the non-ok status is returned in an iterator that the
callers will definitely use, it should be more usable/less error-prone.
Closes https://github.com/facebook/rocksdb/pull/1513
Differential Revision: D4181423
Pulled By: ajkr
fbshipit-source-id: 835b8f5
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.
added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456
Differential Revision: D4111271
Pulled By: ajkr
fbshipit-source-id: 6e388d4
Summary:
This diff introduces RangeDelAggregator, which takes ownership of iterators
provided to it via AddTombstones(). The tombstones are organized in a two-level
map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data
copy by holding Slices returned by the iterator, which remain valid thanks to pinning.
For compaction, we create a hierarchical range tombstone iterator with structure
matching the iterator over compaction input data. An aggregator based on that
iterator is used by CompactionIterator to determine which keys are covered by
range tombstones. In case of merge operand, the same aggregator is used by
MergeHelper. Upon finishing each file in the compaction, relevant range tombstones
are added to the output file's range tombstone metablock and file boundaries are
updated accordingly.
To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete()
considers tombstones in the key's snapshot stripe. When this function is used outside of
compaction, it also checks newer stripes, which can contain covering tombstones. Currently
the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges
within a stripe such that binary search can be used.
RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range
to a new table's range tombstone meta-block. Since range tombstones may fall in the gap
between files, we may need to extend some files' key-ranges. The strategy is (1) first file
extends as far left as possible and other files do not extend left, (2) all files extend right
until either the start of the next file or the end of the last range tombstone in the gap,
whichever comes first.
One other notable change is adding release/move semantics to ScopedArenaIterator
such that it can be used to transfer ownership of an arena-allocated iterator, similar to
how unique_ptr is used for malloc'd data.
Depends on D61473
Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927
Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62205
Summary: We may wrongly drop delete operation if we pick a file with the entry to be delete, the put entry of the same user key is in the next file in the level, and the next file is not picked. We expand compaction inputs for output level too.
Test Plan: Add unit tests that reproduct the bug of dropping delete entry. Change compaction_picker_test to assert the new behavior.
Reviewers: IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61173
Summary: Add markers to sync points. A marked sync point will only be active when it is on the same thread as the marker sync point.
Test Plan: Write a unit test to validate.
Reviewers: sdong, IslamAbdelRahman, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60375
Summary: DBCompactionTest.SkipStatsUpdateTest sometimes fails. I don't see any verification related to the deletes issued. Remove them to avoid the uncertainty.
Test Plan: Run the test.
Reviewers: IslamAbdelRahman, andrewkr, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59613
Summary:
Changing several option defaults:
options.max_open_files changes from 5000 to -1
options.base_background_compactions changes from max_background_compactions to 1
options.wal_recovery_mode changes from kTolerateCorruptedTailRecords to kTolerateCorruptedTailRecords
options.compaction_pri changes from kByCompensatedSize to kByCompensatedSize
Test Plan: Write unit tests to see OldDefaults() works as expected.
Reviewers: IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: MarkCallaghan, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56427
Summary: The reason for this test flakiness is that we try to verify that number of files in L0 is 3 after flushing the 3rd file although we may have a compaction running in the background that may finish before we do the check and the 3 L0 files are converted to 1 L1 file
Test Plan: Run a modified version of the test that sleep before doing the check
Reviewers: sdong, andrewkr, kradhakrishnan, yhchiang
Reviewed By: yhchiang
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56643
Summary: Change some RocksDB default options to make it more friendly to server workloads.
Test Plan: Run all existing tests
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: sumeet, muthu, benj, MarkCallaghan, igor, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55941
Summary:
Add a new compaction priority as following:
For every file, we calculate total size of files overalapping with the file in the next level, over the file's size itself. The file with smallest ratio will be picked first.
My "db_bench --fillrandom" shows about 5% less compaction than kOldestSmallestSeqFirst if --hard_pending_compaction_bytes_limit value to keep LSM tree in shape. If not limiting hard_pending_compaction_bytes_limit, improvement is only 1% or 2%.
Test Plan: Add a unit test
Reviewers: andrewkr, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54075
Summary:
If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions.
The watermarks are calculated based on slowdown thresholds.
Test Plan:
Add new test cases in column_family_test.
Adding more unit tests.
Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D53409
Makefile adjust paths for solaris build
Makefile enable _GLIBCXX_USE_C99 so that std::to_string is available
db_compaction_test.cc Initialise a variable to avoid a compilation error
db_impl.cc Include <alloca.h>
db_test.cc Include <alloca.h>
Environment.java recognise solaris envrionment
options_bulder.cc Make log unambiguous
geodb_impl.cc Make log and floor unambiguous
Summary:
While running the myrocks regression suite, I found that while
dropping a table soon after inserting rows into it resulted in an
assertion failure in CheckConsistencyForDeletes for not finding
a file which was recently added or moved. Marking the files to be
deleted as being compacted before calling LogAndApplyChange
fixed the assertion failures.
Test Plan: DBCompactionTest.DeleteFileRange
Reviewers: IslamAbdelRahman, anthony, yhchiang, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, yoshinorim, leveldb
Differential Revision: https://reviews.facebook.net/D52599
Summary:
We need to clean the job context if we end up not deleting any
files because no files are in the range specified.
Test Plan: DBCompactionTest.DeleteFileRange
Reviewers: sdong, anthony, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52467
Summary:
Fix CLANG build error caused by type mismatch. Changed type to
size_t.
Test Plan: Clang build and make check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D52413
Summary:
This is an initial diff for providing the ability to delete
files which are completely within a given range of keys.
Test Plan: DBCompactionTest.DeleteRange
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52293
Summary:
List of changes:
1) Fix the snprintf() usage in cases where wrong variable was used to determine the output buffer size.
2) Remove unnecessary checks before calling delete operator.
3) Increase code correctness by using size_t type when getting vector's size.
4) Unify the coding style by removing namespace::std usage at the top of the file to confirm to the majority usage.
5) Fix various lint errors pointed out by 'arc lint'.
Test Plan:
Code review and build:
git diff
make clean
make -j 32 commit-prereq
arc lint
Reviewers: kradhakrishnan, sdong, rven, anthony, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51849
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases.
Test Plan: Add a unit test and run it in valgrind too.
Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51459
Summary:
Several tests in db_compaction_test are failing with aborts in
valgrind. These are LevelCompactionThirdPath, LevelCompactionPathUse and
CompressLevelCompaction. We now use the SpecialSkipListFactory to make
them more deterministic
Test Plan: valgrind
Reviewers: anthony, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51663
Summary: DBCompactionTestWithParam.CompactionTrigger fails in non-jemalloc build, after the skip list memtable change. Fix it by making mem table flush trigger by number of entries.
Test Plan: Run the test using both of jemalloc and non-jemalloc build.
Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, igor, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51471
* conversion from 'size_t' to 'type', by add static_cast
Tested:
* by build solution on Windows, Linux locally,
* run tests
* build CI system successful
Summary:
DBCompactionTest.SkipStatsUpdateTest relies on the number
of files opened during the DB::Open process, but the persisting
options file support altered this number and thus makes
DBCompactionTest.SkipStatsUpdateTest in certain environment.
This patch fixed this test failure.
Test Plan: db_compaction_test
Reviewers: igor, sdong, anthony, IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D50637
Summary: Manual compaction should not fill block cache. Add the verification in unit test
Test Plan: Run the test
Reviewers: yhchiang, kradhakrishnan, rven, IslamAbdelRahman, anthony, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49089
Summary:
This patch will block all tests (not including db_test) that don't compile / fail under ROCKSDB_LITE
Test Plan:
OPT=-DROCKSDB_LITE make db_compaction_filter_test -j64 &&
OPT=-DROCKSDB_LITE make db_compaction_test -j64 &&
OPT=-DROCKSDB_LITE make db_dynamic_level_test -j64 &&
OPT=-DROCKSDB_LITE make db_log_iter_test -j64 &&
OPT=-DROCKSDB_LITE make db_tailing_iter_test -j64 &&
OPT=-DROCKSDB_LITE make db_universal_compaction_test -j64 &&
OPT=-DROCKSDB_LITE make ldb_cmd_test -j64
make clean
make db_compaction_filter_test -j64 &&
make db_compaction_test -j64 &&
make db_dynamic_level_test -j64 &&
make db_log_iter_test -j64 &&
make db_tailing_iter_test -j64 &&
make db_universal_compaction_test -j64 &&
make ldb_cmd_test -j64
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48723
Summary:
As part of tech debt week, we are cleaning up dependencies.
This diff moves db_test_util.[h,cc] from util to db directory.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48543
Summary: As title
Test Plan: make check
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46983
Summary:
The test SuggestCompactRangeNoTwoLevel0Compactions in
DBCompactionTest fails when there are parallel L0-L1 compactions
taking place because the test makes sure that only one compaction
involving L0 takes place at any given time (since before having
parallel compactions this was impossible). I changed the test to only
run with DBOptions.max_subcompactions=1 so as to not hit this issue
which is not a correctness issue but just an inherent changing of
assumptions after introducing parallel compactions.
This failed after landing https://reviews.facebook.net/D43269#inline-321303
so now this should fix it
Test Plan: make all && make check
Reviewers: yhchiang, igor, anthony, noetzli, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46617
Summary: Currently, if users didn't set options.arena_block_size, we set "result.arena_block_size = result.write_buffer_size / 10". It makes result.arena_block_size not a multiplier of 4KB, even if options.write_buffer_size is a multiplier of MBs. When calling malloc to arena_block_size, we may waste a small amount of memory for it. We now make the default to be /8 or /16 and align it to 4KB.
Test Plan: unit tests
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46467
Summary:
Flushes in listener_test happened to early when ROCKSDB_MALLOC_USABLE_SIZE was
active (e.g. when compiling with ROCKSDB_FBCODE_BUILD_WITH_481=1) due to
malloc_usable_size() reporting a better estimate (similar to
https://reviews.facebook.net/D43317 ). This patch grows the write buffer size
slightly to compensate for this.
Test Plan: ROCKSDB_FBCODE_BUILD_WITH_481=1 make listener_test && ./listener_test
Reviewers: rven, anthony, yhchiang, igor, sdong
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45921
Summary: malloc_usable_size() gets a better estimation of memory usage. It is already used to calculate block cache memory usage. Use it in arena too.
Test Plan: Run all unit tests
Reviewers: anthony, kradhakrishnan, rven, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43317
Summary:
Up until this point we had DbOptions.num_subcompactions, but
it is semantically more correct to call this max_subcompactions since
we will schedule *up to* DbOptions.max_subcompactions smaller compactions
at a time during a compaction job.
I also added a --subcompactions option to db_bench
Test Plan: make all make check
Reviewers: sdong, igor, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D45069
Summary: Currently compaction inputs share the same file descriptor and table reader as other foreground threads. It makes fadvise works less predictable. Add options.new_table_reader_for_compaction_inputs to enforce to create a new file descriptor and new table reader for it.
Test Plan: Add the option.
Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman, igor, yhchiang
Reviewed By: igor
Subscribers: igor, MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43311
Summary:
In D43239 (https://reviews.facebook.net/D43239) the number
of subcompactions is set based on the number of L1 files with
unique starting keys. In certain cases when this number is very large
this causes issues, particularly with the overlap between files since
very small output files can be generated. This diff bounds the number
of subcompactions to the user option DBOption.num_subcompactions.
Test Plan: ./db_test ./db_compaction_test
Reviewers: sdong, igor, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D44883
Summary:
Updated DBTest DBCompactionTest and CompactionJobStatsTest
to run compaction-related tests once with subcompactions enabled and
once disabled using the TEST_P test type in the Google Test suite.
Test Plan: ./db_test ./db_compaction-test ./compaction_job_stats_test
Reviewers: sdong, igor, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43443
Summary:
Make DBCompactionTest.SkipStatsUpdateTest more stable by
removing flaky but unnecessary assertion on the size of db
as simply checking the random file open count is suffice.
Test Plan: db_compaction_test
Reviewers: igor, anthony, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43533
Summary:
UpdateAccumulatedStats() is used to optimize compaction decision
esp. when the number of deletion entries are high, but this function
can slowdown DBOpen esp. in disk environment.
This patch adds DBOptions::skip_sats_update_on_db_open, which skips
UpdateAccumulatedStats() in DB::Open() time when it's set to true.
Test Plan: Add DBCompactionTest.SkipStatsUpdateTest
Reviewers: igor, anthony, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: tnovak, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42843
Summary:
The compact files API had a bug where some overlapping files
are not added. These are files which overlap with files which were
added to the compaction input files, but not to the original set of
input files. This happens only when there are more than two levels
involved in the compaction. An example will illustrate this better.
Level 2 has 1 input file 1.sst which spans [20,30].
Level 3 has added file 2.sst which spans [10,25]
Level 4 has file 3.sst which spans [35,40] and
input file 4.sst which spans [46,50].
The existing code would not add 3.sst to the set of input_files because
it only becomes an overlapping file in level 4 and it wasn't one in
level 3.
When installing the results of the compaction, 3.sst would overlap with
output file from the compact files and result in the assertion in
version_set.cc:1130
// Must not overlap
assert(level <= 0 || level_files->empty() ||
internal_comparator_->Compare(
(*level_files)[level_files->size() - 1]->largest, f->smallest) <
0);
This change now adds overlapping files from the current level to the set
of input files also so that we don't hit the assertion above.
Test Plan:
d=/tmp/j; rm -rf $d; seq 1000 | parallel --gnu --eta
'd=/tmp/j/d-{}; mkdir -p $d; TEST_TMPDIR=$d ./db_compaction_test
--gtest_filter=*CompactilesOnLevel* --gtest_also_run_disabled_tests >&
'$d'/log-{}'
Reviewers: igor, yhchiang, sdong
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43437
Summary:
Made SuggestCompactRangeNoTwoLevel0Compactions by forcing
a flush after generating a file and waiting for compaction at the end.
Test Plan: Run SuggestCompactRangeNoTwoLevel0Compactions
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43449
Summary: DBCompactionTest.PartialCompactionFailure has a risk that one flush job writes out two mem tables into one file, so that the total files flushed are less than expected. Fix it by writing for flush to finish after every write.
Test Plan: Run the test
Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42831
Summary: Move general compaction tests from db_test.cc to db_compaction_test.cc
Test Plan:
db_test
db_compaction_test
Reviewers: igor, sdong, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42651