Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548
Differential Revision: D7116939
Pulled By: yiwu-arbug
fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
Summary:
Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization.
The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators.
The two-level-iterator now is only used by partitioned index, so it is simplified.
Closes https://github.com/facebook/rocksdb/pull/3406
Differential Revision: D6809041
Pulled By: siying
fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa
Summary:
db/version_builder.cc:
117 base_vstorage_->InternalComparator();
CID 1351713 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member field level_zero_cmp_.internal_comparator is not initialized in this constructor nor in any functions that it calls.
db/version_edit.h:
145 FdWithKeyRange()
146 : fd(),
147 smallest_key(),
148 largest_key() {
CID 1418254 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member file_metadata is not initialized in this constructor nor in any functions that it calls.
149 }
db/version_set.cc:
120 }
CID 1322789 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
4. uninit_member: Non-static class member curr_file_level_ is not initialized in this constructor nor in any functions that it calls.
121 }
db/write_batch.cc:
939 assert(cf_mems_);
CID 1419862 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
3. uninit_member: Non-static class member rebuilding_trx_seq_ is not initialized in this constructor nor in any functions that it calls.
940 }
Closes https://github.com/facebook/rocksdb/pull/3092
Differential Revision: D6505666
Pulled By: yiwu-arbug
fbshipit-source-id: fd2c68948a0280772691a419d72ac7e190951d86
Summary:
Fix for #2833.
- In `DeleteFilesInRange`, use `GetCleanInputsWithinInterval` instead of `GetOverlappingInputs` to make sure we get a clean cut set of files to delete.
- In `GetCleanInputsWithinInterval`, support nullptr as `begin_key` or `end_key`.
- In `GetOverlappingInputsRangeBinarySearch`, move the assertion for non-empty range away from `ExtendFileRangeWithinInterval`, which should be allowed to return an empty range (via `end_index < begin_index`).
Closes https://github.com/facebook/rocksdb/pull/2843
Differential Revision: D5772387
Pulled By: ajkr
fbshipit-source-id: e554e8461823c6be82b21a9262a2da02b3957881
Summary:
Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
Closes https://github.com/facebook/rocksdb/pull/3205
Differential Revision: D6438959
Pulled By: maysamyabandeh
fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
Summary:
Before we were checking every file in the level which was unnecessary. We can piggyback onto the code for checking point-key overlap, which already opens all the files that could possibly contain overlapping range deletions. This PR makes us check just the range deletions from those files, so no extra ones will be opened.
Closes https://github.com/facebook/rocksdb/pull/3179
Differential Revision: D6358125
Pulled By: ajkr
fbshipit-source-id: 00e200770fdb8f3cc6b1b2da232b755e4ba36279
Summary:
When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.
- Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
- Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
- Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
- Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
Closes https://github.com/facebook/rocksdb/pull/3009
Differential Revision: D6062044
Pulled By: ajkr
fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
Summary:
MSVC does not support unused attribute at this time. A separate assignment line fixes the issue probably by being counted as usage for MSVC and it no longer complains about unused var.
Closes https://github.com/facebook/rocksdb/pull/3048
Differential Revision: D6126272
Pulled By: maysamyabandeh
fbshipit-source-id: 4907865db45fd75a39a15725c0695aaa17509c1f
Summary:
ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now.
Some of the ways in which the fifo compaction options can be set are:
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})`
- `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})`
Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes.
Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`.
The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now.
Closes https://github.com/facebook/rocksdb/pull/3006
Differential Revision: D6058619
Pulled By: sagar0
fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92
Summary:
The file numbers assigned post-repair were sometimes smaller than older files' numbers due to `LogAndApply` saving the wrong next file number in the manifest.
- Mark the highest file seen during repair as used before `LogAndApply` so the correct next file number will be stored.
- Renamed `MarkFileNumberUsedDuringRecovery` to `MarkFileNumberUsed` since now it's used during repair in addition to during recovery
- Added `TEST_Current_Next_FileNo` to expose the next file number for the unit test.
Closes https://github.com/facebook/rocksdb/pull/2988
Differential Revision: D6018083
Pulled By: ajkr
fbshipit-source-id: 3f25cbf74439cb8f16dd12af90b67f9f9f75e718
Summary:
Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to
1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex.
2. Make rocksdb able to detect if the db contains value written by blob db, if so return error.
3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type).
The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob().
Changes on blob db side will be in a separate patch.
Closes https://github.com/facebook/rocksdb/pull/2886
Differential Revision: D5838431
Pulled By: yiwu-arbug
fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca
Summary:
Recover txns from the WAL. Also added some unit tests.
Closes https://github.com/facebook/rocksdb/pull/2901
Differential Revision: D5859596
Pulled By: maysamyabandeh
fbshipit-source-id: 6424967b231388093b4effffe0a3b1b7ec8caeb0
Summary:
- Switched all instances of SetMinPossibleForUserKey and SetMaxPossibleForUserKey in accordance to InternalKeyComparator's comparison logic
Closes https://github.com/facebook/rocksdb/pull/2868
Differential Revision: D5804152
Pulled By: axxufb
fbshipit-source-id: 80be35e04f2e8abc35cc64abe1fecb03af24e183
Summary:
This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
Closes https://github.com/facebook/rocksdb/pull/2850
Differential Revision: D5787375
Pulled By: maysamyabandeh
fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
Summary:
Allow user to reduce number of levels in LSM by issue a full CompactRange() and put the result in a lower level, and then reopen DB with reduced options.num_levels. Previous this will fail on reopen on when recovery replaying the previous MANIFEST and found a historical file was on a higher level than the new options.num_levels. The workaround was after CompactRange(), reopen the DB with old num_levels, which will create a new MANIFEST, and then reopen the DB again with new num_levels.
This patch relax the check of levels during recovery. It allows DB to open if there was a historical file on level > options.num_levels, but was also deleted.
Closes https://github.com/facebook/rocksdb/pull/2740
Differential Revision: D5629354
Pulled By: yiwu-arbug
fbshipit-source-id: 545903f6b36b6083e8cbaf777176aef2f488021d
Summary:
When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.
This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.
- Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
- Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
- Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
- Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
- Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
Closes https://github.com/facebook/rocksdb/pull/2580
Differential Revision: D5422916
Pulled By: ajkr
fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
Summary:
Breaking commit: d12691b86f
In the above commit, I moved the `TableCache` cleanup logic from `Version` destructor into `PurgeObsoleteFiles`. I missed cleaning up `TableCache` entries for the current `Version` during DB destruction.
This PR adds that logic to `VersionSet` destructor. One unfortunate side effect is now we're potentially deleting `TableReader`s after `column_family_set_.reset()`, which means we can't call `BlockBasedTableReader::Close` a second time as the block cache might already be destroyed.
Closes https://github.com/facebook/rocksdb/pull/2662
Differential Revision: D5515108
Pulled By: ajkr
fbshipit-source-id: 2cb820e19aa813e0d258d17f76b2d7b6b7ee0b18
Summary:
Post-compaction work holds onto db mutex for the longest time (found by tracing lock acquires/releases with LTTng and correlating timestamps with our info log). Further experimentation showed `TableCache::EraseHandle` is responsible for ~86% of time mutex is held. We can just release the handle outside the db mutex.
Closes https://github.com/facebook/rocksdb/pull/2654
Differential Revision: D5507126
Pulled By: ajkr
fbshipit-source-id: 703c01ddf2aea16bc0f9e33c08935d78aa6b781d
Summary:
This reverts the previous commit 1d7048c598, which broke the build.
Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627
Differential Revision: D5476473
Pulled By: sagar0
fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary:
We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages.
It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild.
Closes https://github.com/facebook/rocksdb/pull/2507
Differential Revision: D5345702
Pulled By: al13n321
fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339
Summary:
Introducing FIFO compactions with TTL.
FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.
To address that request:
- Added a new TTL option to FIFO compaction options.
- Updated FIFO compaction score to take TTL into consideration.
- Added a new table property, creation_time, to keep track of when the SST file is created.
- Creation_time is set as below:
- On Flush: Set to the time of flush.
- On Compaction: Set to the max creation_time of all the files involved in the compaction.
- On Repair and Recovery: Set to the time of repair/recovery.
- Old files created prior to this code change will have a creation_time of 0.
- FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
- FIFO compaction will fall back to the prior way of deleting files based on size if:
- the creation_time of all files involved in compaction is 0.
- the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.
This feature is not supported if max_open_files != -1 or with table formats other than Block-based.
**Test Plan:**
Added tests.
**Benchmark results:**
Base: FIFO with max size: 100MB ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100
readwhilewriting : 1.924 micros/op 519858 ops/sec; 13.6 MB/s (1176277 of 5000000 found)
```
With TTL (a low one for testing) ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20
readwhilewriting : 1.902 micros/op 525817 ops/sec; 13.7 MB/s (1185057 of 5000000 found)
```
Example Log lines:
```
2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
...
2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
```
SST Files remaining in the dbbench dir, after db_bench execution completed:
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ ls -l /dev/shm//dbbench/*.sst
-rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
-rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
-rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
```
Closes https://github.com/facebook/rocksdb/pull/2480
Differential Revision: D5305116
Pulled By: sagar0
fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174
Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)
The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.
Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.
Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345
Differential Revision: D5210732
Pulled By: maysamyabandeh
fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
Summary:
CreateColumnFamily() releases DB mutex after adding column family to the set and install super version (to write option file), so if users call GetAggregatedIntProperty() in the middle, then super version will be null and the process will crash. Fix it by skipping those column families without super version installed.
Maybe we should also fix the problem of releasing the lock when reading option file, but it is more risky. so I'm doing a quick and safer fix and we can investigate it later.
Closes https://github.com/facebook/rocksdb/pull/2475
Differential Revision: D5298053
Pulled By: siying
fbshipit-source-id: 4b3c8f91c60400b163fcc6cda8a0c77723be0ef6
Summary:
We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne.
Closes https://github.com/facebook/rocksdb/pull/2417
Differential Revision: D5193528
Pulled By: siying
fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df
Summary:
Fixes the following scenario:
1. Set prefix extractor. Enable bloom filters, with `whole_key_filtering = false`. Use compaction filter that sometimes returns `kRemoveAndSkipUntil`.
2. Do a compaction.
3. Compaction creates an iterator with `total_order_seek = false`, calls `SeekToFirst()` on it, then repeatedly calls `Next()`.
4. At some point compaction filter returns `kRemoveAndSkipUntil`.
5. Compaction calls `Seek(skip_until)` on the iterator. The key that it seeks to happens to have prefix that doesn't match the bloom filter. Since `total_order_seek = false`, iterator becomes invalid, and compaction thinks that it has reached the end. The rest of the compaction input is silently discarded.
The fix is to make compaction iterator use `total_order_seek = true`.
The implementation for PlainTable is quite awkward. I've made `kRemoveAndSkipUntil` officially incompatible with PlainTable. If you try to use them together, compaction will fail, and DB will enter read-only mode (`bg_error_`). That's not a very graceful way to communicate a misconfiguration, but the alternatives don't seem worth the implementation time and complexity. To be able to check in advance that `kRemoveAndSkipUntil` is not going to be used with PlainTable, we'd need to extend the interface of either `CompactionFilter` or `InternalIterator`. It seems unlikely that anyone will ever want to use `kRemoveAndSkipUntil` with PlainTable: PlainTable probably has very few users, and `kRemoveAndSkipUntil` has only one user so far: us (logdevice).
Closes https://github.com/facebook/rocksdb/pull/2349
Differential Revision: D5110388
Pulled By: lightmark
fbshipit-source-id: ec29101a99d9dcd97db33923b87f72bce56cc17a
Summary:
The range deletion meta-block iterators weren't getting cleaned up properly since they don't support arena allocation. I didn't implement arena support since, in the general case, each iterator is used only once and separately from all other iterators, so there should be no benefit to data locality.
Anyways, this diff fixes up #2370 by treating range deletion iterators as non-arena-allocated.
Closes https://github.com/facebook/rocksdb/pull/2399
Differential Revision: D5171119
Pulled By: ajkr
fbshipit-source-id: bef6f5c4c5905a124f4993945aed4bd86e2807d8
Summary:
also changed the `>` in the comparison against `level0_file_num_compaction_trigger` into a `>=` since exactly `level0_file_num_compaction_trigger` can trigger a compaction from L0.
Closes https://github.com/facebook/rocksdb/pull/2179
Differential Revision: D4915772
Pulled By: ajkr
fbshipit-source-id: e38fec6253de6f9a40e61734615c6670d84038aa
Summary:
Previously we returned NotSupported when ingesting files into a database containing any range deletions. This diff adds the support.
- Flush if any memtable contains range deletions overlapping the to-be-ingested file
- Place to-be-ingested file before any level that contains range deletions overlapping it.
- Added support for `Version` to return iterators over range deletions in a given level. Previously, we piggybacked getting range deletions onto `Version`'s `Get()` / `AddIterator()` functions by passing them a `RangeDelAggregator*`. But file ingestion needs to get iterators over range deletions, not populate an aggregator (since the aggregator does collapsing and doesn't expose the actual ranges).
Closes https://github.com/facebook/rocksdb/pull/2370
Differential Revision: D5127648
Pulled By: ajkr
fbshipit-source-id: 816faeb9708adfa5287962bafdde717db56e3f1a
Summary:
Add a histogram in statistics to help users understand how many merge operands they merge.
Closes https://github.com/facebook/rocksdb/pull/2373
Differential Revision: D5139983
Pulled By: siying
fbshipit-source-id: 61b9ba8ca83f358530a4833d68f0103b56a0e182
Summary:
Disable direct reads for log and manifest. Direct reads should not affect sequential_file
Also add kDirectIO for option_config_ in db_test_util
Closes https://github.com/facebook/rocksdb/pull/2337
Differential Revision: D5100261
Pulled By: lightmark
fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b
Summary:
Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks:
[a b c d][f g h]
Then the index key for the first block will be 'e'.
then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already.
This diff can reduce RA in most cases.
Closes https://github.com/facebook/rocksdb/pull/2239
Differential Revision: D4990693
Pulled By: lightmark
fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff
Summary:
Allow an option for users to do some compaction in FIFO compaction, to pay some write amplification for fewer number of files.
Closes https://github.com/facebook/rocksdb/pull/2163
Differential Revision: D4895953
Pulled By: siying
fbshipit-source-id: a1ab608dd0627211f3e1f588a2e97159646e1231
Summary:
Changed dynamic leveling to stop setting the base level's size bound below `max_bytes_for_level_base`.
Behavior for config where `max_bytes_for_level_base == level0_file_num_compaction_trigger * write_buffer_size` and same amount of data in L0 and base-level:
- Before #2027, compaction scoring would favor base-level due to dividing by size smaller than `max_bytes_for_level_base`.
- After #2027, L0 and Lbase get equal scores. The disadvantage is L0 is often compacted before reaching the num files trigger since `write_buffer_size` can be bigger than the dynamically chosen base-level size. This increases write-amp.
- After this diff, L0 and Lbase still get equal scores. Now it takes `level0_file_num_compaction_trigger` files of size `write_buffer_size` to trigger L0 compaction by size, fixing the write-amp problem above.
Closes https://github.com/facebook/rocksdb/pull/2123
Differential Revision: D4861570
Pulled By: ajkr
fbshipit-source-id: 467ddef56ed1f647c14d86bb018bcb044c39b964
Summary:
Makes max_open_files db option dynamically set-able by SetDBOptions. During the call of SetDBOptions we call SetCapacity on the table cache, which is a LRUCache.
Closes https://github.com/facebook/rocksdb/pull/2185
Differential Revision: D4979189
Pulled By: yiwu-arbug
fbshipit-source-id: ca7e8dc5e3619c79434f579be4847c0f7e56afda
Summary:
The goal is to avoid the problem of small number of L0 files triggering compaction to base level (which increased write-amp), while still allowing L0 compaction-by-size (so intra-L0 compactions cause score to increase).
Closes https://github.com/facebook/rocksdb/pull/2172
Differential Revision: D4908552
Pulled By: ajkr
fbshipit-source-id: 4b170142b2b368e24bd7948b2a6f24c69fabf73d
Summary:
Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction
Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get().
Closes https://github.com/facebook/rocksdb/pull/2117
Differential Revision: D4860912
Pulled By: lightmark
fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19
Summary:
This is an effort to club all string related utility functions into one common place, in string_util, so that it is easier for everyone to know what string processing functions are available. Right now they seem to be spread out across multiple modules, like logging and options_helper.
Check the sub-commits for easier reviewing.
Closes https://github.com/facebook/rocksdb/pull/2094
Differential Revision: D4837730
Pulled By: sagar0
fbshipit-source-id: 344278a
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090
Differential Revision: D4833681
Pulled By: siying
fbshipit-source-id: 2fd8bef
Summary:
Level-based L0->L0 compaction operates on spans of files that aren't currently being compacted. It reduces the number of L0 files, thus making write stall conditions harder to reach.
- L0->L0 is triggered when base level is unavailable due to pending compactions
- L0->L0 always outputs one file of at most `max_level0_burst_file_size` bytes.
- Subcompactions are disabled for L0->L0 since we want to output one file.
- Input files are chosen as the longest span of available files that will fit within the size limit. This minimizes number of files in L0.
Closes https://github.com/facebook/rocksdb/pull/2027
Differential Revision: D4760318
Pulled By: ajkr
fbshipit-source-id: 9d07183
Summary:
It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency.
Closes https://github.com/facebook/rocksdb/pull/2080
Differential Revision: D4821141
Pulled By: siying
fbshipit-source-id: ca7d768
Summary:
PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the
user. This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.
Here is the summary for improvements:
value 100 byte: 1.8% regular, 1.2% merge values
value 1k byte: 11.5% regular, 7.5% merge values
value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.
PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The d
Closes https://github.com/facebook/rocksdb/pull/1756
Differential Revision: D4391738
Pulled By: maysamyabandeh
fbshipit-source-id: 6f3edd3
Summary:
…action
The two options, min_partial_merge_operands and verify_checksums_in_compaction, are not seldom used. Remove them to reduce the total number of options. Also remove them from Java and C interface.
Closes https://github.com/facebook/rocksdb/pull/1902
Differential Revision: D4601219
Pulled By: siying
fbshipit-source-id: aad4cb2
Summary:
reimplement the compaction expansion on lower level.
Considering such a case:
input level file: 1[B E] 2[F G] 3[H I] 4 [J M]
output level file: 5[A C] 6[D K] 7[L O]
If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level.
The previous code is messy and wrong.
In this diff, I first determine the input range [a, b], and output range [c, d],
then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction.
**Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction**.
Closes https://github.com/facebook/rocksdb/pull/1760
Differential Revision: D4395564
Pulled By: lightmark
fbshipit-source-id: 2dc2c5c
Summary:
fix lite bugs
disable direct io in lite mode
Closes https://github.com/facebook/rocksdb/pull/1870
Differential Revision: D4559866
Pulled By: yiwu-arbug
fbshipit-source-id: 3761c51
Summary:
Remove disableDataSync, and another similarly named disable_data_sync options.
This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods.
Closes https://github.com/facebook/rocksdb/pull/1859
Differential Revision: D4541292
Pulled By: sagar0
fbshipit-source-id: 5b3a6ca
Summary:
merger.h was always a confusing name for me, simply give the file a better name
Closes https://github.com/facebook/rocksdb/pull/1836
Differential Revision: D4505357
Pulled By: IslamAbdelRahman
fbshipit-source-id: 07b28d8
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.
Here is the summary for improvements:
1. value 100 byte: 1.8% regular, 1.2% merge values
2. value 1k byte: 11.5% regular, 7.5% merge values
3. value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.
PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732
Differential Revision: D4374613
Pulled By: maysamyabandeh
fbshipit-source-id: a077f1a
Summary:
Previously we used TableCache::NewIterator() for multiple purposes (data
block iterator and range deletion iterator), and returned non-ok status in
the data block iterator. In one case where the caller only used the range
deletion block iterator (9e7cf3469b/db/version_set.cc (L965-L973)),
we didn't check/free the data block iterator containing non-ok status, which
caused a valgrind error.
So, this diff decouples creation of data block and range deletion block iterators,
and updates the callers accordingly. Both functions can return non-ok status
in an InternalIterator. Since the non-ok status is returned in an iterator that the
callers will definitely use, it should be more usable/less error-prone.
Closes https://github.com/facebook/rocksdb/pull/1513
Differential Revision: D4181423
Pulled By: ajkr
fbshipit-source-id: 835b8f5
Summary:
Note: reviewed in https://reviews.facebook.net/D65115
- DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times.
- DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205)
- DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones
Closes https://github.com/facebook/rocksdb/pull/1464
Differential Revision: D4131753
Pulled By: ajkr
fbshipit-source-id: be86559
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.
added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456
Differential Revision: D4111271
Pulled By: ajkr
fbshipit-source-id: 6e388d4
Summary:
Changes in the diff
API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API
Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers
Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob
Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)
Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong
Reviewed By: sdong
Subscribers: jkedgar, hcz, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65061
Summary:
This diff introduces RangeDelAggregator, which takes ownership of iterators
provided to it via AddTombstones(). The tombstones are organized in a two-level
map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data
copy by holding Slices returned by the iterator, which remain valid thanks to pinning.
For compaction, we create a hierarchical range tombstone iterator with structure
matching the iterator over compaction input data. An aggregator based on that
iterator is used by CompactionIterator to determine which keys are covered by
range tombstones. In case of merge operand, the same aggregator is used by
MergeHelper. Upon finishing each file in the compaction, relevant range tombstones
are added to the output file's range tombstone metablock and file boundaries are
updated accordingly.
To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete()
considers tombstones in the key's snapshot stripe. When this function is used outside of
compaction, it also checks newer stripes, which can contain covering tombstones. Currently
the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges
within a stripe such that binary search can be used.
RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range
to a new table's range tombstone meta-block. Since range tombstones may fall in the gap
between files, we may need to extend some files' key-ranges. The strategy is (1) first file
extends as far left as possible and other files do not extend left, (2) all files extend right
until either the start of the next file or the end of the last range tombstone in the gap,
whichever comes first.
One other notable change is adding release/move semantics to ScopedArenaIterator
such that it can be used to transfer ownership of an arena-allocated iterator, similar to
how unique_ptr is used for malloc'd data.
Depends on D61473
Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927
Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62205
Summary:
We always run consistency checks when compiling in debug mode
allow users to set Options::force_consistency_checks to true to be able to run such checks even when compiling in release mode
Test Plan:
make check -j64
make release
Reviewers: lightmark, sdong, yiwu
Reviewed By: yiwu
Subscribers: hermanlee4, andrewkr, yoshinorim, jkedgar, dhruba
Differential Revision: https://reviews.facebook.net/D64701
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()
Also add tests in db_iter_test and db_iterator_test
Pass all tests
Cheers!
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64149
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.
Test Plan: existing unit tests.
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63945
Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1).
Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?)
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59913
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).
* Prevent incrementing compression counter when compression is turned off in options.
* Prevent incrementing compression counter when compression is turned off in options.
* Added two more supported compression types to test code in db_test.cc
* Prevent incrementing compression counter when compression is turned off in options.
* Added new StatsLevel that excludes compression timing.
* Fixed casting error in coding.h
* Fixed CompressionStatsTest for new StatsLevel.
* Removed unused variable that was breaking the Linux build
Summary: MyRocks release integration build breaks because we treat warnings caused by unused variables as errors. Variable `edit` is only used in debug builds. Therefore we need to guard it using `#ifndef NDEBUG` check.
Test Plan:
- `[p]arc diff --preview` for the default validation.
- Verify that release build fails before this fix and passes after applying it.
Reviewers: andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60423
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.
Test Plan: Add a new unit test.
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59925
Summary:
We have alot of code duplication whenever we call FullMerge we keep duplicating the instrumentation and statistics code
This is a simple diff to refactor the code to use TimedFullMerge instead of FullMerge
Test Plan: COMPILE_WITH_ASAN=1 make check -j64
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59577
Summary: Make CompactionOptionsFIFO a part of mutable_cf_options
Test Plan: UT
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, lgalanis, dhruba
Differential Revision: https://reviews.facebook.net/D58653
Summary:
GetObsoleteFiles() and LogAndApply() functions modify obsolete_manifests_ vector
we need to make sure that the mutex is held when we modify the obsolete_manifests_
Test Plan: run the test under TSAN
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58011
Summary: Currently we estimate bytes needed for compaction by assuming fanout value to be level multiplier. It overestimates when size of a level exceeds the target by large. We estimate by the ratio of actual sizes in levels instead.
Test Plan: Fix existing test cases and add a new one.
Reviewers: IslamAbdelRahman, igor, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57789
Summary:
This is needed so we can measure compression ratio improvements
achieved by D52287.
The property compares raw data size against the total file size for a given
level. If the level is empty it should return 0.0.
Test Plan: new unit test
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56967
Summary: We use object `v` in the error message, which is not initialized if the edit is column family manipulation. This doesn't provide much useful info, so this diff is removing it. Instead, it dumps actual VersionEdit contents.
Test Plan: compiles. would be great to get tests in version_set_test.cc that cover cases where a file write fails
Reviewers: sdong, yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56349
Summary:
There is a hardcoded constraint in our statistics collection that prevents reading properties from more than 20 SST files. This means our statistics will be very inaccurate for databases with > 20 files since additional files are just ignored. The purpose of constraining the number of files used is to bound the I/O performed during statistics collection, since these statistics need to be recomputed every time the database reopened.
However, this constraint doesn't take into account the case where option "max_open_files" is -1. In that case, all the file metadata has already been read, so MaybeInitializeFileMetaData() won't incur any I/O cost. so this diff gets rid of the 20-file constraint in case max_open_files == -1.
Test Plan:
write into unit test db/db_properties_test.cc - "ValidateSampleNumber".
We generate 20 files with 2 rows and 10 files with 1 row.
If max_open_files !=-1, the `rocksdb.estimate-num-keys` should be (10*1 + 10*2)/20 * 30 = 45. Otherwise, it should be the ground truth, 50.
{F1089153}
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56253
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
Summary: We were not measuring the time spent in merge_operator when called from Version::Get()
Test Plan: added a unittest
Reviewers: sdong, yhchiang
Reviewed By: yhchiang
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55905
Summary:
When subcompaction is used, all subcompactions share the same Compaction
pointer in CompactionJob while each subcompaction all keeps their mutable
stats in SubcompactionState. However, there're still some mutable part
that is currently store in the shared Compaction pointer.
This patch makes two changes:
1. Make the shared Compaction pointer const so that it can never be modified
during the compaction.
2. Move necessary states from Compaction to SubcompactionState.
3. Make functions of Compaction const if the function does not modify
its internal state.
Test Plan: rocksdb and MyRocks test
Reviewers: sdong, kradhakrishnan, andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, yoshinorim, gunnarku, leveldb
Differential Revision: https://reviews.facebook.net/D55923
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction.
Test Plan:
Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false).
DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32
Mac: OK.
Linux: with D55287 patched in it's OK.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54801
Summary:
- Keep track of obsolete manifests in VersionSet
- Updated FindObsoleteFiles() to put obsolete manifests in the JobContext for later use by PurgeObsoleteFiles()
- Added test case that verifies a stale manifest is deleted by a non-full purge
Test Plan:
$ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation
Reviewers: IslamAbdelRahman, yoshinorim, sdong
Reviewed By: sdong
Subscribers: andrewkr, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D55269
Summary:
Fixed two related race conditions in backup creation.
(1) CreateNewBackup() uses DB::DisableFileDeletions() to prevent table files
from being deleted while it is copying; however, the MANIFEST file could still
rotate during this time. The fix is to stop deleting the old manifest in the
rotation logic. It will be deleted safely later when PurgeObsoleteFiles() runs
(can only happen when file deletions are enabled).
(2) CreateNewBackup() did not account for the CURRENT file being mutable.
This is significant because the files returned by GetLiveFiles() contain a
particular manifest filename, but the manifest to which CURRENT refers can
change at any time. This causes problems when CURRENT changes between the call
to GetLiveFiles() and when it's copied to the backup directory. To workaround this, I
manually forge a CURRENT file referring to the manifest filename returned in
GetLiveFiles().
(2) also applies to the checkpointing code, so let me know if this approach is
good and I'll make the same change there.
Test Plan:
new test for roll manifest during backup creation.
running the test before this change:
$ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation
...
IO error: /tmp/rocksdbtest-9383/backupable_db/MANIFEST-000001: No such file or directory
running the test after this change:
$ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation
...
[ RUN ] BackupableDBTest.ChangeManifestDuringBackupCreation
[ OK ] BackupableDBTest.ChangeManifestDuringBackupCreation (2836 ms)
Reviewers: IslamAbdelRahman, anthony, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54711
Summary:
Add a new compaction priority as following:
For every file, we calculate total size of files overalapping with the file in the next level, over the file's size itself. The file with smallest ratio will be picked first.
My "db_bench --fillrandom" shows about 5% less compaction than kOldestSmallestSeqFirst if --hard_pending_compaction_bytes_limit value to keep LSM tree in shape. If not limiting hard_pending_compaction_bytes_limit, improvement is only 1% or 2%.
Test Plan: Add a unit test
Reviewers: andrewkr, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54075
Summary:
Following up on D53493, we can still enable the filter-skipping
optimization for last file in L0. It's correct to assume the key will be present
in the last L0 file when we're hit-optimized and L0 is deepest.
The FilePicker encapsulates the state for traversing each level's files, so I
needed to make it expose whether the returned file is last in its level.
Test Plan:
verified below test fails before this patch and passes afterwards.
The change to how the test memtable is populated is needed so file 1 has keys
(0, 30, 60), file 2 has keys (10, 40, 70), etc.
$ ./db_universal_compaction_test --gtest_filter=UniversalCompactionNumLevels/DBTestUniversalCompaction.OptimizeFiltersForHits/*
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53583
Summary: It's a regression bug caused by e089db40f9. With the change, if options.optimize_filters_for_hits=true and there are only L0 files (like single level universal compaction), we skip all the files in L0, which is more than necessary. Fix it by always trying to query bloom filter for files in level 0.
Test Plan: Add a unit test for it.
Reviewers: anthony, rven, yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D53493
Summary: Revert the functionaility of D7809 (but I'm keeping the logging and test code). We decided it was dangerous to ignore sync failures based on attempting to read the data written. The read does not tell us whether the data was synced.
Test Plan: There was no test for the particular functionaility that was reverted. Keeping the test code from D7809 that tests whether we set the DB to be readonly when paranoid checks are enabled.
Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52989
Summary: That line used to dereference `column_family_data`, which is nullptr if we're creating a column family.
Test Plan: `make -j check`
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D52881
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.
Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000
and make sure without the commit, write stop will happen, but with the commit, it will not happen.
Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52131
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
Summary:
List of changes:
1) Fix the snprintf() usage in cases where wrong variable was used to determine the output buffer size.
2) Remove unnecessary checks before calling delete operator.
3) Increase code correctness by using size_t type when getting vector's size.
4) Unify the coding style by removing namespace::std usage at the top of the file to confirm to the majority usage.
5) Fix various lint errors pointed out by 'arc lint'.
Test Plan:
Code review and build:
git diff
make clean
make -j 32 commit-prereq
arc lint
Reviewers: kradhakrishnan, sdong, rven, anthony, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51849
Summary:
Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.
With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).
Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.
Test Plan: unit tests, db bench
Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb, yoshinorim
Differential Revision: https://reviews.facebook.net/D50475
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.
Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.
Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51117
Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases.
Test Plan: Add a unit test and run it in valgrind too.
Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51459
Summary: crash_test sometimes fails, hitting the add file overlapping assert. Add information in info logs help us to find the bug.
Test Plan: Run all test suites. Do some manual tests to make sure printing is correct.
Reviewers: kradhakrishnan, yhchiang, anthony, IslamAbdelRahman, rven, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49017
Summary: We don't yet have a CI build for iOS, so our iOS compile gets broken sometimes. Most of the errors are from assumption that size_t is 64-bit, while it's actually 32-bit on some (all?) iOS platforms. This diff fixes the compile.
Test Plan:
TARGET_OS=IOS make static_lib
Observe there are no warnings
Reviewers: sdong, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D49029
Summary: As above.
Test Plan: USE_CLANG=1 make check -j
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48981
We will need the log number to validate the recycle-style CRCs. The log
is helpful for debugging, but optional, as not all callers have it.
Signed-off-by: Sage Weil <sage@redhat.com>
When we recycle log files, we need to mix the log number into the CRC
for each record. Note that for logs that don't get recycled (like the
manifest), we always pass a log_number of 0 and false.
Signed-off-by: Sage Weil <sage@redhat.com>
Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB.
Test Plan: ran the GetPropertiesOfTablesInRange
Reviewers: rven, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48651
Summary:
Add kill points in:
1. after creating a file
2. before writing a manifest record
3. before syncing manifest
4. before creating a new current file
5. after creating a new current file
Test Plan: Run all current tests.
Reviewers: yhchiang, igor, anthony, IslamAbdelRahman, rven, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48855
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
Test Plan: Run all existing tests.
Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48549
Summary:
Add options.compaction_pri, which specifies the policy about which file to compact first.
kCompactionPriByLargestSeq will compact oldest files first.
Verified the behavior in db_bench but did not write unit tests yet. Also need to make it settable through option string and dynamically changeable.
Test Plan: Will write unit tests
Reviewers: igor, rven, anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, MarkCallaghan
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45951
Summary:
This patch finally fixes the ColumnFamilyTest.ReadDroppedColumnFamily test. The test has been failing very sporadically and it was hard to repro. However, I managed to write a new tests that reproes the failure deterministically.
Here's what happens:
1. We start the flush for the column family
2. We check if the column family was dropped here: a3fc49bfdd/db/flush_job.cc (L149)
3. This check goes through, ends up in InstallMemtableFlushResults() and it goes into LogAndApply()
4. At about this time, we start dropping the column family. Dropping the column family process gets to LogAndApply() at about the same time as LogAndApply() from flush process
5. Drop column family goes through LogAndApply() first, marking the column family as dropped.
6. Flush process gets woken up and gets a chance to write to the MANIFEST. However, this is where it gets stuck: a3fc49bfdd/db/version_set.cc (L1975)
7. We see that the column family was dropped, so there is no need to write to the MANIFEST. We return OK.
8. Flush gets OK back from LogAndApply() and it deletes the memtable, thinking that the data is now safely persisted to sst file.
The fix is pretty simple. Instead of OK, we return ShutdownInProgress. This is not really true, but we have been using this status code to also mean "this operation was canceled because the column family has been dropped".
The fix is only one LOC. All other code is related to tests. I added a new test that reproes the failure. I also moved SleepingBackgroundTask to util/testutil.h (because I needed it in column_family_test for my new test). There's plenty of other places where we reimplement SleepingBackgroundTask, but I'll address that in a separate commit.
Test Plan:
1. new test
2. make check
3. Make sure the ColumnFamilyTest.ReadDroppedColumnFamily doesn't fail on Travis: https://travis-ci.org/facebook/rocksdb/jobs/79952386
Reviewers: yhchiang, anthony, IslamAbdelRahman, kradhakrishnan, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46773
Summary:
Up to this point, the subcompactions that make up a compaction
job have been divided based on the key range of the L1 files, and each
subcompaction has handled the key range of only one file. However
DBOption.max_subcompactions allows the user to designate how many
subcompactions at most to perform. This patch updates the
CompactionJob::GetSubcompactionBoundaries() to determine these
divisions accordingly based on that option and other input/system factors.
The current approach orders the starting and/or ending keys of certain
compaction input files and then generates a histogram to approximate the
size covered by the key range between each consecutive pair of keys. Then
it groups these ranges into groups so that the sizes are approximately equal
to one another. The approach has also been adapted to work for universal
compaction as well instead of just for level-based compaction as it was before.
These subcompactions are then executed in parallel by locally spawning
threads, one for each. The results are then aggregated and the compaction
completed.
Test Plan: make all && make check
Reviewers: yhchiang, anthony, igor, noetzli, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43269
Summary:
In some cases, equality comparisons can be done more efficiently than three-way
comparisons. There are quite a few places in the code where we only care about
equality. This patch adds an Equal() method that defaults to using the
Compare() method.
Test Plan: make clean all check
Reviewers: rven, anthony, yhchiang, igor, sdong
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46233
Summary: TableCache::Get() puts parameters in the wrong places so that table readers created by Get() will not have the histogram updated.
Test Plan: Will write a unit test for that.
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D46035
Summary:
This patch adds "rocksdb.aggregated-table-properties"
and "rocksdb.aggregated-table-properties-at-levelN", the former
returns the aggreated table properties of a column family,
while the later returns the aggregated table properties
of the specified level N.
Test Plan: Added tests in db_test
Reviewers: igor, sdong, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45087
Summary:
Add a counter of estimated bytes the DB needs to compact for all the compactions to finish. Expose it as a DB Property.
In the future, we can use threshold of this counter to replace soft rate limit and hard rate limit. A single threshold of estimated compaction debt in bytes will be easier for users to reason about when should slow down and stopping than more abstract soft and hard rate limits.
Test Plan: Add unit tests
Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, anthony, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44205
Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions
Test Plan: Unittests for the new property
Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D44799
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44193
Summary: Add a new option that all LoadTableHandlers to use multiple threads to load files on DB Open and Recover
Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
DISABLE_JEMALLOC=1 make all valgrind_check -j64 (still running)
Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D43755
Summary:
Changed compaction_job_test to support better/more thorough
tests and added two tests. Also changed MockFileContents
to order using InternalKeyComparator.
Test Plan: make compaction_job_test && ./compaction_job_test; make all && make check
Reviewers: sdong, rven, igor, yhchiang, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42837
Summary:
UpdateAccumulatedStats() is used to optimize compaction decision
esp. when the number of deletion entries are high, but this function
can slowdown DBOpen esp. in disk environment.
This patch adds DBOptions::skip_sats_update_on_db_open, which skips
UpdateAccumulatedStats() in DB::Open() time when it's set to true.
Test Plan: Add DBCompactionTest.SkipStatsUpdateTest
Reviewers: igor, anthony, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: tnovak, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42843
Summary:
As of now compactions involving files from Level 0 and Level 1 are single
threaded because the files in L0, although sorted, are not range partitioned like
the other levels. This means that during L0-L1 compaction each file from L1
needs to be merged with potentially all the files from L0.
This attempt to parallelize the L0-L1 compaction assigns a thread and a
corresponding iterator to each L1 file that then considers only the key range
found in that L1 file and only the L0 files that have those keys (and only the
specific portion of those L0 files in which those keys are found). In this way
the overlap is minimized and potentially eliminated between different iterators
focusing on the same files.
The first step is to restructure the compaction logic to break L0-L1 compactions
into multiple, smaller, sequential compactions. Eventually each of these smaller
jobs will be run simultaneously. Areas to pay extra attention to are
# Correct aggregation of compaction job statistics across multiple threads
# Proper opening/closing of output files (make sure each thread's is unique)
# Keys that span multiple L1 files
# Skewed distributions of keys within L0 files
Test Plan: Make and run db_test (newer version has separate compaction tests) and compaction_job_stats_test
Reviewers: igor, noetzli, anthony, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42699
Summary:
Fixes T6548822. Added a new function for estimating the size of the live data
as proposed in the task. The value can be accessed through the property
rocksdb.estimate-live-data-size.
Test Plan:
There are two unit tests in version_set_test and a simple test in db_test.
make version_set_test && ./version_set_test;
make db_test && ./db_test gtest_filter=GetProperty
Reviewers: rven, igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41493
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
Summary:
When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler.
We have a special case that when you set max_background_flushes to 0, we
schedule the flush to execute on the compaction thread.
Test Plan: make check (there might be some unit tests that depend on this behavior)
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41931
Summary: Try to allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena, instead of calling malloc and free.
Test Plan: valgrind check
Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40929
Summary:
We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch.
The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work
hard_rate_limit is deprecated.
options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up.
Test Plan: Add new unit tests in db_test
Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor
Reviewed By: igor
Subscribers: ikabiljo, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D36351
Summary:
When there are files marked for compaction after compactions, print extra messages to help debugging. Example:
2015/06/08-23:12:55.212855 7ff5013ff700 [default] [JOB 121] Generated table #75: 54 keys, 4807 bytes (need compaction)
2015/06/08-23:12:55.556194 7ff5013ff700 (Original Log Time 2015/06/08-23:12:55.556160) [default] compacted to: base level 1 max bytes base
10240 files[0 1 9 32 12 0 0 0] max score 0.96 (2 files need compaction), MB/sec: 0.0 rd, 0.1 wr, level 2, files in(1, 3) out(5) MB in(0.0,
0.0) out(0.0), read-write-amplify(11.3) write-amplify(5.7) OK, records in: 40, records dropped: 0
Test Plan:
Run test and see LOG files.
valgrind test DBTest.TablePropertiesNeedCompactTest
Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39771
Summary:
It is experimental. Allow users to return from a call back function TablePropertiesCollector::NeedCompact(), based on the data in the file.
It can be used to allow users to suggest DB to clear up delete tombstones faster.
Test Plan: Add a unit test.
Reviewers: igor, yhchiang, kradhakrishnan, rven
Reviewed By: rven
Subscribers: yoshinorim, MarkCallaghan, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39585
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
Summary: In DB::CompactRange(), change parameter "reduce_level" to "change_level". Users can compact all data to the last level if needed. By doing it, users can migrate the DB to options.level_compaction_dynamic_level_bytes=true.
Test Plan: Add a unit test for it.
Reviewers: yhchiang, anthony, kradhakrishnan, igor, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39099