Summary:
The Comparator passed to CollapsedRangeDelMap was not used for
operator less of the std::map `rep_` object contained in
CollapsedRangeDelMap. So the map was always sorted using the
default ByteWiseComparator, which seems wrong.
Passing the specified Comparator through for usage in that map
object fixes actual problems we were seeing with RangeDelete operations
that do not delete keys as expected when using a custom Comparator.
I found that the tests in current master crash when I run them locally,
both with and without my patch, at the very same location. I therefore
don't know if the patch breaks something else, but it seems to fix
RangeDeletion issues in our product that uses RocksDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4386
Differential Revision: D9916506
Pulled By: ajkr
fbshipit-source-id: 27bff8c775831f089dde8c5289df7343d88b2d66
Summary:
Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381
Differential Revision: D9879505
Pulled By: maysamyabandeh
fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1
Summary:
To measure the results of upcoming DeleteRange v2 work, this commit adds
simple benchmarks for RangeDelAggregator. It measures the average time
for AddTombstones and ShouldDelete calls.
Using this to compare the results before #4014 and on the latest master (using the default arguments) produces the following results:
Before #4014:
```
=======================
Results:
=======================
AddTombstones: 1356.28 us
ShouldDelete: 0.401732 us
```
Latest master:
```
=======================
Results:
=======================
AddTombstones: 740.82 us
ShouldDelete: 0.383271 us
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4363
Differential Revision: D9881676
Pulled By: abhimadan
fbshipit-source-id: 793e7d61aa4b9d47eb917bbcc03f08695b5e5442
Summary:
1. Add override keyword to overridden virtual functions in EventListener
2. Fix a memory corruption that can happen during DB shutdown when in
read-only mode due to a background write error
3. Fix uninitialized buffers in error_handler_test.cc that cause
valgrind to complain
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4375
Differential Revision: D9875779
Pulled By: anand1976
fbshipit-source-id: 022ede1edc01a9f7e21ecf4c61ef7d46545d0640
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
Summary:
Because `base_files` and `added_files` both are sorted, using a merge
operation to these two sorted arrays is more effective. The complexity
is reduced to linear time.
- optmize the merge complexity.
- move the `NDEBUG` of sorted `added_files` out of merge process.
Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4366
Differential Revision: D9833592
Pulled By: ajkr
fbshipit-source-id: dd32b67ebdca4c20e5e9546ab8082cecefe99fd0
Summary:
The code is dead in RocksDB as `log::Reader::initial_offset_` is always zero. We should delete it so we don't have to maintain it like in #4359.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4362
Differential Revision: D9817829
Pulled By: ajkr
fbshipit-source-id: 474a2c679e5bd273b40608f3a5332931d9eefe6d
Summary:
Please consider this small PR providing access to the `MemoryUsage::GetApproximateMemoryUsageByType` function in plain C API. Actually I'm working on Go application and now trying to investigate the reasons of high memory consumption (#4313). Go [wrappers](https://github.com/tecbot/gorocksdb) are built on the top of Rocksdb C API. According to the #706, `MemoryUsage::GetApproximateMemoryUsageByType` is considered as the best option to get database internal memory usage stats, but it wasn't supported in C API yet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4340
Differential Revision: D9655135
Pulled By: ajkr
fbshipit-source-id: a3d2f3f47c143ae75862fbcca2f571ea1b49e14a
Summary:
`RangeDelAggregator::AddTombstones` contained an assertion which stated that, if a range tombstone extended past the largest key in the sstable, then `FileMetaData::largest` must have a sentinel sequence number of `kMaxSequenceNumber`, which implies that the tombstone's end key is safe to truncate. However, `largest` will not be a sentinel key when the next sstable in the level's smallest key is equal to the current sstable's largest key, which caused the assertion to fail.
The assertion must hold for the truncation to be safe, so it has been moved to an additional check on end-key truncation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4356
Differential Revision: D9760891
Pulled By: abhimadan
fbshipit-source-id: 7c20c3885cd919dcd14f291f88fd27aa33defebc
Summary:
TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346
Differential Revision: D9759149
Pulled By: maysamyabandeh
fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b
Summary:
In C++ 11, the order of argument and move evaluation in a statement such
as below is unspecified -
foo(a.b).bar(std::move(a))
The compiler is free to evaluate std::move(a) first, and then a.b is unspecified.
In C++ 17, this will be safe if a draft proposal around function
chaining rules is accepted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4348
Differential Revision: D9688810
Pulled By: anand1976
fbshipit-source-id: e4651d0ca03dcf007e50371a0fc72c0d1e710fb4
Summary:
As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited.
Besides this, try to fix some warnings about loss of data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339
Differential Revision: D9654990
Pulled By: ajkr
fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848
Summary:
This is a followup to #4311. Checking `!RangeDelAggregator::IsEmpty()` before opening a dedicated range tombstone SST did not properly prevent empty SSTs from being generated. That's because it relies on `CollapsedRangeDelMap::Size`, which had an underflow bug when the map was empty. This PR fixes that underflow bug.
Also fixed an uninitialized variable in db_stress.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4336
Differential Revision: D9600080
Pulled By: ajkr
fbshipit-source-id: bc6980ca79d2cd01b825ebc9dbccd51c1a70cfc7
Summary:
Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening.
One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable.
This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297
Differential Revision: D9420705
Pulled By: mikhail-antonov
fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc
Summary: For the CURRENT file forged during checkpoint, we were forgetting to `fsync` or `fdatasync` it after its creation. This PR fixes it.
Differential Revision: D9525939
Pulled By: ajkr
fbshipit-source-id: a505483644026ee3f501cfc0dcbe74832165b2e3
Summary:
According to 4848bd0c4e/db/log_reader.cc (L355), the original text is misleading when describing the layout of RecyclableLogHeader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4315
Differential Revision: D9505284
Pulled By: riversand963
fbshipit-source-id: 79994c37a69e7003f03453e7efc0186feeafa609
Summary:
This PR fixes issue 3842. We drop deletion markers iff
1. We are the bottom most level AND
2. All other occurrences of the key are in the same snapshot range as the delete
I've also enhanced db_stress_test to add an option that does a full compare of the keys. This is done by a single thread (thread # 0). For tests I've run (so far)
make check -j64
db_stress
db_stress --acquire_snapshot_one_in=1000 --ops_per_thread=100000 /* to verify that new code doesnt break existing tests */
./db_stress --compare_full_db_state_snapshot=true --acquire_snapshot_one_in=1000 --ops_per_thread=100000 /* to verify new test code */
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4289
Differential Revision: D9491165
Pulled By: shrikanthshankar
fbshipit-source-id: ce144834f31736c189aaca81bed356ba990331e2
Summary:
RocksDB currently queues individual column family for flushing. This is not sufficient to support the needs of some applications that want to enforce order/dependency between column families, given that multiple foreground and background activities can trigger flushing in RocksDB.
This PR aims to address this limitation. Each flush request is described as a `FlushRequest` that can contain multiple column families. A background flushing thread pops one flush request from the queue at a time and processes it.
This PR does not enable atomic_flush yet, but is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3952
Differential Revision: D8529933
Pulled By: riversand963
fbshipit-source-id: 78908a21e389a3a3f7de2a79bae0cd13af5f3539
Summary:
I have a PR to start calling `OnTableFileCreated` for empty SSTs: #4307. However, it is a behavior change so should not go into a patch release.
This PR adds back a check to make sure range deletions at least exist before starting file creation. This PR should be safe to backport to earlier versions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4311
Differential Revision: D9493734
Pulled By: ajkr
fbshipit-source-id: f0d43cda4cfd904f133cfe3a6eb622f52a9ccbe8
Summary:
The API comment on `OnTableFileCreationStarted` (b6280d01f9/include/rocksdb/listener.h (L331-L333)) led users to believe a call to `OnTableFileCreationStarted` will always be matched with a call to `OnTableFileCreated`. However, we were skipping the `OnTableFileCreated` call in one case: no error happens but also no file is generated since there's no data.
This PR adds the call to `OnTableFileCreated` for that case. The filename will be "(nil)" and the size will be zero.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4307
Differential Revision: D9485201
Pulled By: ajkr
fbshipit-source-id: 2f077ec7913f128487aae2624c69a50762394df6
Summary:
Memtables are selected for flushing by the flush job. Currently we
have listener which is invoked when memtables for a column family are
flushed. That listener does not indicate which memtable was flushed in
the notification. If clients want to know if particular data in the
memtable was retired, there is no straight forward way to know this.
This method will help users who implement memtablerep factory and extend
interface for memtablerep, to know if the data in the memtable was
retired.
Another option that was tried, was to depend on memtable destructor to
be called after flush to mark that data was persisted. This works all
the time but sometimes there can huge delays between actual flush
happening and memtable getting destroyed. Hence, if anyone who is
waiting for data to persist will have to wait that longer.
It is expected that anyone who is implementing this method to have
return quickly as it blocks RocksDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4304
Reviewed By: riversand963
Differential Revision: D9472312
Pulled By: gdrane
fbshipit-source-id: 8e693308dee749586af3a4c5d4fcf1fa5276ea4d
Summary:
We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039
Differential Revision: D8670178
Pulled By: riversand963
fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a
Summary:
User reported (https://github.com/facebook/rocksdb/issues/4168) that when opening RocksDB in read-only mode, some statistics are not correctly reported. After some investigation, we believe the following counters are indeed not reported during Get() call in a read-only DB:
rocksdb.memtable.hit
rocksdb.memtable.miss
rocksdb.number.keys.read
rocksdb.bytes.read
As well as histogram rocksdb.bytes.per.read
and perf context get_read_bytes
This PR will add the necessary counter reporting logic in the Get() call path
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4260
Differential Revision: D9476431
Pulled By: miasantreble
fbshipit-source-id: 7ab409d4e59df05d09ae8b69fe75554e5aa240d6
Summary:
Clang analyze is not happy in two pieces of code, with "Potential memory leak". No idea what the problem but slightly changing the code makes clang happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4292
Differential Revision: D9413555
Pulled By: siying
fbshipit-source-id: 9428c9d3664530c72129feefd135ee63d8386137
Summary:
During recovery, RocksDB is able to handle version edits that belong to group commits.
This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3945
Differential Revision: D8529122
Pulled By: riversand963
fbshipit-source-id: 57cb0f9cc55ecca684a837742d6626dc9c07f37e
Summary:
This PR addresses issue #3865 and implements the following approach to fix it:
- adds `MergeContext::GetOperandsDirectionForward` and `MergeContext::GetOperandsDirectionBackward` to query merge operands in a specific order
- `MergeContext::GetOperands` becomes a shortcut for `MergeContext::GetOperandsDirectionForward`
- pass `MergeContext::GetOperandsDirectionBackward` to `MergeOperator::ShouldMerge` and document the order
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4266
Differential Revision: D9360750
Pulled By: sagar0
fbshipit-source-id: 20cb73ff017760b062ecdcf4382560767086e092
Summary:
Add a unit test to check that iterators release data blocks after it has moved away from it. Verify the same for compaction input iterators.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4170
Differential Revision: D8962513
Pulled By: siying
fbshipit-source-id: 05a5b604d7d29887fb488f2cda7286f554a14407
Summary:
Revert this change. Not generating the OnTableFileCreated() notification for a 0 byte SST on flush breaks the assumption that every OnTableFileCreationStarted() notification is followed by a corresponding OnTableFileCreated().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4263
Differential Revision: D9285623
Pulled By: anand1976
fbshipit-source-id: 808c3dcd498b4b4f4ed4be947a29a24b2296aa8d
Summary:
In the current trace_and replay, Get an WriteBatch are traced. This pull request track down the Seek() and SeekForPrev() to the trace file. <target_key, timestamp, column_family_id> are write to the file.
Replay of Iterator is not supported in the current implementation.
Tested with trace_analyzer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4228
Differential Revision: D9201381
Pulled By: zhichao-cao
fbshipit-source-id: 6f9cc9cb3c20260af741bee065ec35c5c96354ab
Summary:
The pair of ROCKSDB_LITE condition inclusion is redundant, it is already inside the #ifndef ROCKSDB_LITE. Remove them to void confusion.
Tested by make asan_check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4254
Differential Revision: D9281652
Pulled By: zhichao-cao
fbshipit-source-id: 06bf7641ede71391f21f6a3fe37fbd13f0e2a43a
Summary:
Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4.
The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
Results with sysbench read-only using 4k blocks and using 16 index restart interval:
Format 2:
19585 rocksdb read-only range=100
Format 3:
19569 rocksdb read-only range=100
Format 4:
19352 rocksdb read-only range=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983
Differential Revision: D8361343
Pulled By: maysamyabandeh
fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
Summary:
We add two subcommands `write_extern_sst` and `ingest_extern_sst` to ldb. This PR avoids changing existing code because we hope to cherry-pick to earlier releases to support compatibility check for external SST file ingestion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4205
Differential Revision: D9112711
Pulled By: riversand963
fbshipit-source-id: 7cae88380d4de86da8440230e87eca66755648e4
Summary:
In the current code, `error_msg` is pointing to the inner buffer of a temporary std::string object. When `error_msg` is used to construct the error message, that array is already released. This PR will fix this bug by copying the string to a local variable.
Fixes https://github.com/facebook/rocksdb/issues/4239
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4240
Differential Revision: D9204334
Pulled By: miasantreble
fbshipit-source-id: 0ac599e166ae0a4ec413e32d8b8853d7c5fba878
Summary:
The test has become complicated over the years and hard to reason about the corner cases that makes the test flaky. The patch simplifies the test and also fixes some probable synchronization issues.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4235
Differential Revision: D9187995
Pulled By: maysamyabandeh
fbshipit-source-id: 53c7b060f14367e5a9e361014578c26debfe3d27
Summary:
So that we can act accordingly on blob index entries
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4233
Differential Revision: D9190205
Pulled By: yiwu-arbug
fbshipit-source-id: e5b84d5b41e44fa7a76762f1f7b0305369bb3a0c
Summary:
There are two issues with `VisibleToActiveSnapshot`:
1. If there are no snapshots, `oldest_snapshot` will be 0 and `VisibleToActiveSnapshot` will always return true. Since the method is used to decide whether it is safe to delete obsolete files, obsolete file won't be able to delete in this case.
2. The `auto` keyword of `auto snapshots = db_impl_->snapshots()` translate to a copy of `const SnapshotList` instead of a reference. Since copy constructor of `SnapshotList` is not defined, using the copy may yield unexpected result.
Issue 2 actually hide issue 1 from being catch by tests. During test `snapshots.empty()` can return false while it should actually be empty, and `snapshots.oldest()` return an invalid address, making `oldest_snapshot` being some random large number.
The issue was originally reported by BlobDB early adopter at Kuaishou.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4236
Differential Revision: D9188706
Pulled By: yiwu-arbug
fbshipit-source-id: a0f2624b927cf9bf28c1bb534784fee5d106f5ea
Summary:
HashMayMatch is related to AddKey() instead of CreateFilter().
Also applies some minor Fixes#4191#4200#3910
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4202
Differential Revision: D9180945
Pulled By: maysamyabandeh
fbshipit-source-id: 6f07b81c5bb9bda5c0273475b486ba8a030471e6
Summary:
In the past, we assume that a job modifies a single column family. Therefore, a job can create at most one superversion since each superversion corresponds to one column family. This assumption leads to the fact that a `JobContext` has only one member variable called `superversion_context`.
Now we want to support group flush of column families, indicating that each job can create multiple superversions. Therefore, we need to make the following change to accommodate this new feature.
Add a vector of `SuperVersionContext` to `JobContext` to support installing
superversions for multiple column families in one job context.
This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3949
Differential Revision: D8864895
Pulled By: riversand963
fbshipit-source-id: 5937a48817276370d3c8172db9c8aafc826d97ca
Summary:
The current verification logic does not consider the case in which multiple
threads (foreground and background) may execute `PurgeObsoleteFiles` function
simultaneously. Each invocation will trigger the callback adding elements to
a vector. Then we verify the elements in the vector, which can fail sometimes.
The solution is to give up checking the elements. Instead, we check the number
of OPTIONS file in the database dir.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4218
Differential Revision: D9128727
Pulled By: riversand963
fbshipit-source-id: 2b13b705fb21bc0ddd41940c4ec9b6b0c8d88224
Summary:
`CollapsedRangeDelMap` internally uses seqno zero as a sentinel value to
denote a gap between range tombstones or the end of range tombstones. It
therefore expects to never have consecutive sentinel tombstones.
However, since `DeleteRange` is now supported in `SstFileWriter`, an
ingested file may contain range tombstones, and that ingested file may
be assigned global seqno zero. When such tombstones are added to the
collapsed map, they resemble sentinel tombstones due to having seqno
zero. Then, the invariant mentioned above about never having consecutive
sentinel tombstones can be violated.
The symptom of this violation was dereferencing the `end()` iterator
(#4204). The fix in this PR is to not add range tombstones with seqno
zero to the collapsed map. They're not needed anyways since they can't
possibly cover anything (in case of a key and a range tombstone with the
same seqno, the key is visible).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4216
Differential Revision: D9121716
Pulled By: ajkr
fbshipit-source-id: f5b78a70bea9527354603ea7ac8542a7e2b6a210
Summary:
A framework for tracing and replaying RocksDB operations.
A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench.
- Column-families are supported
- Multi-threaded tracing is supported.
- TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it.
- This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations.
Currently supported DB operations:
- Writes:
-- Put
-- Merge
-- Delete
-- SingleDelete
-- DeleteRange
-- Write
- Reads:
-- Get (point lookups)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837
Differential Revision: D7974837
Pulled By: sagar0
fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72
Summary:
RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
`global_seqno`. Since random write is not supported in some non-POSIX compliant
file systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update `global_seqno` during
file ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172
Differential Revision: D8961465
Pulled By: riversand963
fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
Summary:
If crash happen after a hard link established, Recover function may reuse the file number that has already assigned to the internal file, and this will overwrite the external file. To protect the external file, we have to make sure the file number will never being reused.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4099
Differential Revision: D9034092
Pulled By: riversand963
fbshipit-source-id: 3f1a737440b86aa2ef01673e5013aacbb7c33e28
Summary:
995fcf7573 has a bug: ReleaseFileNumberFromPendingOutputs() added is not protected by the DB mutex. Fix it by grabbing the lock for this operation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4189
Differential Revision: D9015447
Pulled By: siying
fbshipit-source-id: b8506e09a96c3f95a6fe32b5ca5fcdb9bee88937
Summary:
92ee3350e0 introduces an out-of-bound check in BlockBasedTableIterator::Valid(). However, this flag is not reset when re-seeking in backward direction. This caused the iterator to be invalide by mistake. Fix it by always resetting the out-of-bound flag in every seek.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4187
Differential Revision: D8996600
Pulled By: siying
fbshipit-source-id: b6235ea614f71381e50e7904c4fb036300604ac1
Summary:
This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch.
Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`.
A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104
Differential Revision: D8785717
Pulled By: lth
fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91
Summary:
Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490
Differential Revision: D6969154
Pulled By: miasantreble
fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029
Summary:
Lint is not happy with some new code recently committed. Format them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161
Differential Revision: D8940582
Pulled By: siying
fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7
Summary:
Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others.
Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10%
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156
Differential Revision: D8916847
Pulled By: siying
fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758
Summary:
PR #3944 introduces group commit of `VersionEdit` in MANIFEST. The
implementation has a bug. When updating the log file number of each column
family, we must consider only `VersionEdit`s that operate on the same column
family. Otherwise, a column family may accidentally set its log file number
higher than actual value, indicating that log files with smaller file number
will be ignored, thus causing some updates to be lost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4157
Differential Revision: D8916650
Pulled By: riversand963
fbshipit-source-id: 8f456cf688f17bf35ad87b38e30e899aa162f201
Summary: Windows requires new/delete for memory allocations to be overriden. Refactor to be less intrusive.
Differential Revision: D8878047
Pulled By: siying
fbshipit-source-id: 35f2b5fec2f88ea48c9be926539c6469060aab36
Summary:
I am temporarily disabling DBFlushTest.SyncFail and DBTest.GroupCommitTest tests on Travis until we figure out the root-cause. These tests will still continue to run locally though.
I haven't been able to reproduce these failures locally so far (even on a [local Travis environment](https://docs.travis-ci.com/user/common-build-problems/#Troubleshooting-Locally-in-a-Docker-Image) ).
These tests are failing way too frequently causing everyone to wonder why their PR failed on travis, and waste time in debugging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4154
Differential Revision: D8907258
Pulled By: sagar0
fbshipit-source-id: f40068b16e9245fb3791b6a4796435d1ce1ed205
Summary:
Fix a minor data race in DBSSTTest.DeleteSchedulerMultipleDBPaths reported by TSAN
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4146
Differential Revision: D8880945
Pulled By: siying
fbshipit-source-id: 25c632f685757735c59ad4ff26b2f346a443a446
Summary:
Fix the issue when pipelined write is enabled, writers can get stuck indefinitely and not able to finish the write. It can show with the following example: Assume there are 4 writers W1, W2, W3, W4 (W1 is the first, W4 is the last).
T1: all writers pending in WAL writer queue:
WAL writer queue: W1, W2, W3, W4
memtable writer queue: empty
T2. W1 finish WAL writer and move to memtable writer queue:
WAL writer queue: W2, W3, W4,
memtable writer queue: W1
T3. W2 and W3 finish WAL write as a batch group. W2 enter ExitAsBatchGroupLeader and move the group to memtable writer queue, but before wake up next leader.
WAL writer queue: W4
memtable writer queue: W1, W2, W3
T4. W1, W2, W3 finish memtable write as a batch group. Note that W2 still in the previous ExitAsBatchGroupLeader, although W1 have done memtable write for W2.
WAL writer queue: W4
memtable writer queue: empty
T5. The thread corresponding to W3 create another writer W3' with the same address as W3.
WAL writer queue: W4, W3'
memtable writer queue: empty
T6. W2 continue with ExitAsBatchGroupLeader. Because the address of W3' is the same as W3, the last writer in its group, it thinks there are no pending writers, so it reset newest_writer_ to null, emptying the queue. W4 and W3' are deleted from the queue and will never be wake up.
The issue exists since pipelined write was introduced in 5.5.0.
Closes#3704
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4143
Differential Revision: D8871599
Pulled By: yiwu-arbug
fbshipit-source-id: 3502674e51066a954a0660257e24ac588f815e2a
Summary:
If bulkload fails for an input error, the pending output file number wasn't released. This bug can cause all future files with larger number than the current number won't be deleted, even they are compacted. This commit fixes the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4145
Differential Revision: D8877900
Pulled By: siying
fbshipit-source-id: 080be92a23d43305ca1e13fe1c06eb4cd0b01466
Summary:
Refactor IndexBlockIter to reduce conditional branches on key_includes_seq_. IndexBlockIter::Prev is also separated from DataBlockIter::Prev, not to cache the prev entries as they are of less importance when iterating over the index block.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4141
Differential Revision: D8866437
Pulled By: maysamyabandeh
fbshipit-source-id: fdac76880426fc2be7d3c6354c09ab98f6657d4b
Summary:
Fixes#3391.
This change adds a `DeleteRange` method to `SstFileWriter` and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778
Differential Revision: D8821836
Pulled By: anand1976
fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844
Summary:
Do not consider the range tombstone sentinel key as causing 2 adjacent
sstables in a level to overlap. When a range tombstone's end key is the
largest key in an sstable, the sstable's end key is so to a "sentinel"
value that is the smallest key in the next sstable with a sequence
number of kMaxSequenceNumber. This "sentinel" is guaranteed to not
overlap in internal-key space with the next sstable. Unfortunately,
GetOverlappingFiles uses user-keys to determine overlap and was thus
considering 2 adjacent sstables in a level to overlap if they were
separated by this sentinel key. This in turn would cause compactions to
be larger than necessary.
Note that this conflicts with
https://github.com/facebook/rocksdb/pull/2769 and cases
`DBRangeDelTest.CompactionTreatsSplitInputLevelDeletionAtomically` to
fail.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4050
Differential Revision: D8844423
Pulled By: ajkr
fbshipit-source-id: df3f9f1db8f4cff2bff77376b98b83c2ae1d155b
Summary:
The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel.
Example: ``` ~/gtest-parallel/gtest-parallel ./table_test```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135
Differential Revision: D8846653
Pulled By: maysamyabandeh
fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84
Summary:
Picked up a task to convert this to use the gtest framework. It can't be this simple, can it?
It works, but should all the std::cout be removed?
```
[$] ~/git/rocksdb [gft !]: ./merge_test
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from MergeTest
[ RUN ] MergeTest.MergeDbTest
Test read-modify-write counters...
a: 3
1
2
a: 3
b: 1225
3
Compaction started ...
Compaction ended
a: 3
b: 1225
Test merge-based counters...
a: 3
1
2
a: 3
b: 1225
3
Test merge in memtable...
a: 3
1
2
a: 3
b: 1225
3
Test Partial-Merge
Test merge-operator not set after reopen
[ OK ] MergeTest.MergeDbTest (93 ms)
[ RUN ] MergeTest.MergeDbTtlTest
Opening database with TTL
Test read-modify-write counters...
a: 3
1
2
a: 3
b: 1225
3
Compaction started ...
Compaction ended
a: 3
b: 1225
Test merge-based counters...
a: 3
1
2
a: 3
b: 1225
3
Test merge in memtable...
Opening database with TTL
a: 3
1
2
a: 3
b: 1225
3
Test Partial-Merge
Opening database with TTL
Opening database with TTL
Opening database with TTL
Opening database with TTL
Test merge-operator not set after reopen
[ OK ] MergeTest.MergeDbTtlTest (97 ms)
[----------] 2 tests from MergeTest (190 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (190 ms total)
[ PASSED ] 2 tests.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4114
Differential Revision: D8822886
Pulled By: gfosco
fbshipit-source-id: c299d008e883c3bb911d2b357a2e9e4423f8e91a
Summary:
1. Move kUniversalSubcompactions up before kEnd in db_test_util.h, so
tests that cycle through all the option_configs include this
2. Skip kUniversalSubcompactions wherever kUniversalCompaction and
kUniversalCompactionMultilevel are skipped
Related to #3935
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4125
Differential Revision: D8828637
Pulled By: anand1976
fbshipit-source-id: 650dee15fd27d85281cf9bb4ca8ab460e04cac6f
Summary:
- Avoid `strdup` to use jemalloc on Windows
- Use `size_t` for consistency
- Add GCC 8 to Travis
- Add CMAKE_BUILD_TYPE=Release to Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433
Differential Revision: D6837948
Pulled By: sagar0
fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f
Summary:
Reduce the number of key ranges in `ExternalSSTFileTest.OverlappingRanges` so
that the test completes in shorter time to avoid timeouts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4127
Differential Revision: D8827851
Pulled By: riversand963
fbshipit-source-id: a16387b0cc92a7c872b1c50f0cfbadc463afc9db
Summary:
Reduce #iterations from 5000 to 1000 so that
`ExternalSSTFileTest.CompactDuringAddFileRandom` can finish faster.
On the one hand, 5000 iterations does not seem to improve the quality of unit
test in comparison with 1000. On the other hand, long running tests should belong to stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4123
Differential Revision: D8822514
Pulled By: riversand963
fbshipit-source-id: 0f439b8d5ccd9a4aed84638f8bac16382de17245
Summary:
This fixes the same performance issue that #3992 fixes but with much more invasive cleanup.
I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.
With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.
ajkr please take a look when you have a minute!
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4014
Differential Revision: D8773253
Pulled By: ajkr
fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
Summary:
Run the basic range deletion tests against the standard set of
configurations. This testing exposed that files with hash indexes and
partitioned indexes were not handling the case where the file contained
only range deletions--i.e., where the index was empty.
Additionally file a TODO about the fact that range deletions are broken
when allow_mmap_reads = true is set.
/cc ajkr nvanbenschoten
Best viewed with ?w=1: https://github.com/facebook/rocksdb/pull/4021/files?w=1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4021
Differential Revision: D8811860
Pulled By: ajkr
fbshipit-source-id: 3cc07e6d6210a2a00b932866481b3d5c59775343
Summary:
Prior to this PR, there was a race condition between `DBImpl::SetOptions` and `BackupEngine::CreateNewBackup`, as illustrated below.
```
Time thread 1 thread 2
| CreateNewBackup -> GetLiveFiles
| SetOptions -> RenameTempFileToOptionsFile
| SetOptions -> RenameTempFileToOptionsFile
| SetOptions -> RenameTempFileToOptionsFile // unlink oldest OPTIONS file
| copy the oldest OPTIONS // IO error!
V
```
Proposed fix is to check the value of `DBImpl::disable_obsolete_files_deletion_` before calling `DeleteObsoleteOptionsFiles`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4108
Differential Revision: D8796360
Pulled By: riversand963
fbshipit-source-id: 02045317f793ea4c7d4400a5bf333b8502fa3e82
Summary:
This adds support for recovering WriteUnprepared transactions through the following changes:
- The information in `RecoveredTransaction` is extended so that it can reference multiple batches.
- `MarkBeginPrepare` is extended with a bool indicating whether it is an unprepared begin, and this is passed down to `InsertRecoveredTransaction` to indicate whether the current transaction is prepared or not.
- `WriteUnpreparedTxnDB::Initialize` is overridden so that it will rollback unprepared transactions from the recovered transactions. This can be done without updating the prepare heap/commit map, because this is before the DB has finished initializing, and after writing the rollback batch, those data structures should not contain information about the rolled back transaction anyway.
Commit/Rollback of live transactions is still unimplemented and will come later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4078
Differential Revision: D8703382
Pulled By: lth
fbshipit-source-id: 7e0aada6c23bd39299f1f20d6c060492e0e6b60a
Summary:
`std::map::at(key)` throws std::out_of_range if key does not exist. Current
code does not handle this. Although this case is unlikely, I feel it's safe to
use `std::map::find`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4098
Differential Revision: D8753865
Pulled By: riversand963
fbshipit-source-id: 9a9ba43badb0fb5e0d24cd87903931fd12f3f8ec
Summary:
clang analyze is giving the following warnings:
> db/compaction_job.cc:1178:16: warning: Called C++ object pointer is null
} else if (meta->smallest.size() > 0) {
^~~~~~~~~~~~~~~~~~~~~
db/compaction_job.cc:1201:33: warning: Access to field 'marked_for_compaction' results in a dereference of a null pointer (loaded from variable 'meta')
meta->marked_for_compaction = sub_compact->builder->NeedCompact();
~~~~
db/version_set.cc:2770:26: warning: Called C++ object pointer is null
uint32_t cf_id = last_writer->cfd->GetID();
^~~~~~~~~~~~~~~~~~~~~~~~~
Closes https://github.com/facebook/rocksdb/pull/4072
Differential Revision: D8685852
Pulled By: miasantreble
fbshipit-source-id: b0e2fd9dfc1cbba2317723e09886384b9b1c9085
Summary:
This adds a new WAL marker of type kTypeBeginUnprepareXID.
Also, DBImpl now contains a field called batch_per_txn (meaning one WriteBatch per transaction, or possibly multiple WriteBatches). This would also indicate that this DB is using WriteUnprepared policy.
Recovery code would be able to make use of this extra field on DBImpl in a separate diff. For now, it is just used to determine whether the WAL is compatible or not.
Closes https://github.com/facebook/rocksdb/pull/4069
Differential Revision: D8675099
Pulled By: lth
fbshipit-source-id: ca27cae1738e46d65f2bb92860fc759deb874749
Summary:
Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
3. Provide an API for the user to clear the error and resume the DB instance
This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
Closes https://github.com/facebook/rocksdb/pull/3997
Differential Revision: D8653831
Pulled By: anand1976
fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
Summary:
This PR supports the group commit of multiple version edit entries corresponding to different column families. Column family drop/creation still cannot be grouped. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Closes https://github.com/facebook/rocksdb/pull/3944
Differential Revision: D8432536
Pulled By: riversand963
fbshipit-source-id: 8f11bd05193b6c0d9272d82e44b676abfac113cb
Summary:
Lite does not support readonly DBs.
Closes https://github.com/facebook/rocksdb/pull/4070
Differential Revision: D8677858
Pulled By: maysamyabandeh
fbshipit-source-id: 536887d2363ee2f5d8e1ea9f1a511e643a1707fa
Summary:
with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests.
Closes https://github.com/facebook/rocksdb/pull/4067
Differential Revision: D8669002
Pulled By: miasantreble
fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a
Summary:
…ression
For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts
Closes https://github.com/facebook/rocksdb/pull/3985
Reviewed By: riversand963
Differential Revision: D8385911
Pulled By: zhichao-cao
fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc
Summary:
https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case.
Closes https://github.com/facebook/rocksdb/pull/4053
Differential Revision: D8662546
Pulled By: maysamyabandeh
fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd
Summary:
This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.
There are the places where reads had to be modified.
- Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
- Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
- Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
Closes https://github.com/facebook/rocksdb/pull/3955
Differential Revision: D8286688
Pulled By: lth
fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR #3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.
/cc ajkr nvanbenschoten
Closes https://github.com/facebook/rocksdb/pull/4016
Differential Revision: D8527575
Pulled By: ajkr
fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
Summary:
Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file.
This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at.
Closes https://github.com/facebook/rocksdb/pull/3899
Differential Revision: D8157459
Pulled By: miasantreble
fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41
Summary:
Universal size-amp-triggered compaction was pulling the final sorted run into the compaction without checking whether any of its files are already being compacted. When all compactions are automatic, it is safe since it verifies the second-last sorted run is not already being compacted, which implies the last sorted run is also not being compacted (in automatic compaction multiple sorted runs are always compacted together). But with manual compaction, files in the last sorted run can be compacted independently, so the last sorted run also must be checked.
We were seeing the below assertion failure in `db_stress`. Also the test case included in this PR repros the failure.
```
db_universal_compaction_test: db/compaction.cc:312: void rocksdb::Compaction::MarkFilesBeingCompacted(bool): Assertion `mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted' failed.
Aborted (core dumped)
```
Closes https://github.com/facebook/rocksdb/pull/4055
Differential Revision: D8630094
Pulled By: ajkr
fbshipit-source-id: ac3b30a874678b76e113d4f6c42c1260411b08f8
Summary:
Pass in `for_compaction` to `BlockBasedTableIterator` via `BlockBasedTableReader::NewIterator`.
In 7103559f49, `for_compaction` was set in `BlockBasedTable::Rep` via `BlockBasedTable::SetupForCompaction`. In hindsight it was not the right decision; it also caused TSAN to complain.
Closes https://github.com/facebook/rocksdb/pull/4048
Differential Revision: D8601056
Pulled By: sagar0
fbshipit-source-id: 30127e898c15c38c1080d57710b8c5a6d64a0ab3
Summary:
Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead.
Closes https://github.com/facebook/rocksdb/pull/4037
Differential Revision: D8596218
Pulled By: maysamyabandeh
fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2
Summary:
For example calling CompactionFilter is always timed and gives the user no way to disable.
This PR will disable the timer if `Statistics::stats_level_` (which is part of DBOptions) is `kExceptDetailedTimers`
Closes https://github.com/facebook/rocksdb/pull/4029
Differential Revision: D8583670
Pulled By: miasantreble
fbshipit-source-id: 913be9fe433ae0c06e88193b59d41920a532307f
Summary:
We potentially need this information for tracing, profiling and diagnosis.
Closes https://github.com/facebook/rocksdb/pull/4026
Differential Revision: D8555214
Pulled By: riversand963
fbshipit-source-id: 4263e06c00b6d5410b46aa46eb4e358ff2161dd2
Summary:
b555ed30a4 makes the BlockBasedTableIterator to be invalidated if the current position if over the upper bound. However, this can bring performance regression to the case of multiple Seek()s hitting the same data block but all out of upper bound.
For example, if an SST file has a data block containing following keys : {a, z}
The user sets the upper bound to be "x", and it executed following queries:
Seek("b")
Seek("c")
Seek("d")
Before the upper bound optimization, these queries always come to this same current data block of the iterator, but now inside each Seek() the data block is read from the block cache but is returned again.
To prevent this regression case, we keep the current data block iterator if it is upper bound.
Closes https://github.com/facebook/rocksdb/pull/4004
Differential Revision: D8463192
Pulled By: siying
fbshipit-source-id: 8710628b30acde7063a097c3184d6c4333a8ef81
Summary:
The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843.
This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy.
Closes https://github.com/facebook/rocksdb/pull/3996
Differential Revision: D8416186
Pulled By: miasantreble
fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28
Summary:
Don't call the OnTableFileCreated listener callback when a 0 size SST
file gets created by Flush. Doing so causes an assertion failure in db_stress. It is also not correct behavior as we call env->DeleteFile() for such files right before the notification.
Closes https://github.com/facebook/rocksdb/pull/4003
Differential Revision: D8461385
Pulled By: anand1976
fbshipit-source-id: ae92d4f921c2e2cff981ad58f4929ed8b609f35d
Summary:
Add a new DB property to DB::GetProperty(), which returns the option.statistics. Test is updated to pass.
Closes https://github.com/facebook/rocksdb/pull/3966
Differential Revision: D8311139
Pulled By: zhichao-cao
fbshipit-source-id: ea78f4727358c807b0e5a0ea62e09defb10ad9ac
Summary:
Verify table will load SST into `TableCache`
it occupy memory & `TableCache`‘s capacity ...
but no logic use them
it's unnecessary ...
so , we verify them after all sub compact finished
Closes https://github.com/facebook/rocksdb/pull/3979
Differential Revision: D8389946
Pulled By: ajkr
fbshipit-source-id: 54bd4f474f9e7b3accf39c3068b1f36a27ec4c49
Summary:
The SST file sizes changed slightly after the improvement of PR #3970
which reduces the size of the properties block. Before PR #3970 a size
ratio compaction included all of the first four flushed files but it
only includes two files after. We increase the size_ratio universal
compaction option to make that compaction include all four files again.
Closes https://github.com/facebook/rocksdb/pull/3995
Differential Revision: D8426925
Pulled By: fgwu
fbshipit-source-id: 1429c38672e9f4fb4d4881fd4b06db45c4861d62
Summary:
A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong:
Seek(key);
if (!Valid()) SeekToLast();
Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases.
This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev().
Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case.
Closes https://github.com/facebook/rocksdb/pull/3989
Differential Revision: D8385422
Pulled By: siying
fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4
Summary:
format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters.
Closes https://github.com/facebook/rocksdb/pull/3958
Differential Revision: D8294615
Pulled By: maysamyabandeh
fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe
Summary:
CompactFiles checked whether the existing files conflicted with the chosen compaction. But it missed checking whether future files would conflict, i.e., when another compaction was simultaneously writing new files to the same range at the same output level.
Closes https://github.com/facebook/rocksdb/pull/3926
Differential Revision: D8218996
Pulled By: ajkr
fbshipit-source-id: 21cb00a6fed4c8c62d3ed2ff810962e6bdc2fdfb
Summary:
PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings.
Run `make format` to fix formatting as suggested by siying .
Also piggyback two changes:
1) fix singleton destruction order for windows and posix env
2) fix two clang warnings
Closes https://github.com/facebook/rocksdb/pull/3954
Differential Revision: D8272041
Pulled By: miasantreble
fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f
Summary:
format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
Closes https://github.com/facebook/rocksdb/pull/3942
Differential Revision: D8238413
Pulled By: maysamyabandeh
fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
Summary:
Previous commit https://github.com/facebook/rocksdb/pull/3935 unhide a few test options which includes kDirectIO. However it's not supported by RocksDB lite. Need to hide this option from the lite build.
Closes https://github.com/facebook/rocksdb/pull/3943
Differential Revision: D8242757
Pulled By: miasantreble
fbshipit-source-id: 1edfad3a5d01a46bfb7eedee765981ebe02c500a
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881
Differential Revision: D8076288
Pulled By: ajkr
fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
Summary:
DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests.
Closes https://github.com/facebook/rocksdb/pull/3935
Differential Revision: D8230748
Pulled By: maysamyabandeh
fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db
Summary:
DBImpl::FindObsoleteFiles() may call GetChildren() multiple times if different CFs are on the same path. Fix it.
Closes https://github.com/facebook/rocksdb/pull/3885
Differential Revision: D8084634
Pulled By: siying
fbshipit-source-id: b471fbc251f6a05e9243304dc14c0831060cc0b0
Summary:
In order to make valgrind check test to pass in a day, remove some tests that run prohibitively slow under valgrind.
Closes https://github.com/facebook/rocksdb/pull/3924
Differential Revision: D8210184
Pulled By: siying
fbshipit-source-id: 5b06fb08f3cf57571d422d05a0dbddc9f9376f7a
Summary:
This is still WIP, but I'm hoping for early feedback on the overall approach.
This patch implements deletion triggered compaction, which till now only
worked for leveled, for universal style. SST files are marked for
compaction by the CompactOnDeletionCollertor table property. This is
expected to be used when free disk space is low and the user wants to
reclaim space by deleting a bunch of keys. The deletions are expected to
be dense. In such a situation, we want to avoid a full compaction due to
its space overhead.
The strategy used in this case is similar to leveled. We pick one file
from the set of files marked for compaction. We then expand the inputs
to a clean cut on the same level, and then pick overlapping files from
the next non-mepty level. Picking files from the next level can cause
the key range to expand, and we opportunistically expand inputs in the
source level to include files wholly in this key range.
The main side effect of this is that it breaks the property of no time
range overlap between levels. This shouldn't break any functionality.
Closes https://github.com/facebook/rocksdb/pull/3860
Differential Revision: D8124397
Pulled By: anand1976
fbshipit-source-id: bfa2a9dd6817930e991b35d3a8e7e61304ed3dcf
Summary:
Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
Closes https://github.com/facebook/rocksdb/pull/3894
Differential Revision: D8118775
Pulled By: maysamyabandeh
fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
Summary:
Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609).
There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change.
To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed.
Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised.
Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion.
Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler.
Test plan
```
$ make -j16
$ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion
```
Expected behavior:
All tests should pass.
Closes https://github.com/facebook/rocksdb/pull/3898
Differential Revision: D8149421
Pulled By: riversand963
fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438
Summary:
Add flush_before_backup to rocksdb_backup_engine_create_new_backup. make c api able to control the flush before backup behavior.
Closes https://github.com/facebook/rocksdb/pull/3897
Differential Revision: D8157676
Pulled By: ajkr
fbshipit-source-id: 88998c62f89f087bf8672398fd7ddafabbada505
Summary:
Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache.
Closes https://github.com/facebook/rocksdb/pull/3877
Differential Revision: D8100895
Pulled By: yiwu-arbug
fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da
Summary:
Explicitly specify the underlying type of enums help developers understand the physical storage.
Closes https://github.com/facebook/rocksdb/pull/3892
Differential Revision: D8107027
Pulled By: riversand963
fbshipit-source-id: a00efecbba46df4a3c8eed0994a2d4972ad1a1d3
Summary:
DBTest.GroupCommitTest would often fail when run under valgrind because its sleeps were insufficient to guarantee a group commit had multiple entries. Instead we can use sync point to force a leader to wait until a non-leader thread has enqueued its work, thus guaranteeing a leader can do group commit work for multiple threads.
Closes https://github.com/facebook/rocksdb/pull/3883
Differential Revision: D8079429
Pulled By: ajkr
fbshipit-source-id: 61dc50fad29d2c85547842f681288de60fa29049
Summary:
By using WritableFileWriter rather than WritableFile directly, we can buffer multiple Append() calls to one write() file system call, which will be expensive to underlying Env without its own write buffering.
Closes https://github.com/facebook/rocksdb/pull/3882
Differential Revision: D8080673
Pulled By: siying
fbshipit-source-id: e0db900cb3c178166aa738f3985db65e3ae2cf1b
Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601
Differential Revision: D7253114
Pulled By: miasantreble
fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
Summary:
Change `keys_` from `set<string>` to `vector<set<string>>` so that each column
family's keys are stored in one set.
ajkr When you have a chance, can you PTAL? Thanks!
Closes https://github.com/facebook/rocksdb/pull/3871
Differential Revision: D8056447
Pulled By: riversand963
fbshipit-source-id: 650d0f9cad02b1bc005fc329ad76edbf053e6386
Summary:
`RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion.
Closes https://github.com/facebook/rocksdb/pull/3875
Differential Revision: D8063667
Pulled By: ajkr
fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41
Summary:
Right now ReverseBytewiseComparator::FindShortestSeparator() doesn't really shorten key, and ReverseBytewiseComparator::FindShortestSuccessor() seems to return wrong results. The code is confusing too as it uses BytewiseComparatorImpl::FindShortestSeparator() but the function actually won't do anything if the the first key is larger than the second.
Implement ReverseBytewiseComparator::FindShortestSeparator() and override ReverseBytewiseComparator::FindShortestSuccessor() to be empty.
Closes https://github.com/facebook/rocksdb/pull/3836
Differential Revision: D7959762
Pulled By: siying
fbshipit-source-id: 93acb621c16ce6f23e087ae4e19f7d84d1254683
Summary:
this will fix the failing clang_check test
Closes https://github.com/facebook/rocksdb/pull/3868
Differential Revision: D8050880
Pulled By: miasantreble
fbshipit-source-id: 749932e2e4025f835c961c068d601e522a126da6
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
Summary:
log_ contract specifies that it should not be modified unless both mutex_ and log_write_mutex_ are held. log_.erase however does that with only holding mutex_. This causes a race condition with two_write_queues since logs_.back is read with holding only log_write_mutex_ (which is correct according to logs_ contract) but logs_.erase is called concurrently. This is probably the cause of logs_.back returning nullptr in https://github.com/facebook/rocksdb/issues/3852 although I could not reproduce it.
Fixes https://github.com/facebook/rocksdb/issues/3852
Closes https://github.com/facebook/rocksdb/pull/3859
Differential Revision: D8026103
Pulled By: maysamyabandeh
fbshipit-source-id: ee394e00fe4aa520d884c5ef87981e9d6b5ccb28
Summary:
TSAN reports a false alarm for lock-order-inversion in DBWriteTest.IOErrorOnWALWritePropagateToWriteThreadFollower but Open and FlushWAL are not run concurrently. Suppressing the error by skipping FlushWAL in the test until TSAN is fixed.
The alternative would be to use
```
TSAN_OPTIONS="suppressions=tsan-suppressions.txt" ./db_write_test
```
but it does not seem straightforward to integrate it to our test infra.
Closes https://github.com/facebook/rocksdb/pull/3854
Differential Revision: D8000202
Pulled By: maysamyabandeh
fbshipit-source-id: fde33483d963a7ad84d3145123821f64960a4802
Summary:
This feature was introduced for universal compaction in cc01985d. At that point we thought it'd be used only to prevent long-running universal full compactions from blocking short-lived upper-level compactions. Now we have a level compaction user who could benefit from it since they use more expensive compression algorithm in the bottom level. So enable it for level.
Closes https://github.com/facebook/rocksdb/pull/3835
Differential Revision: D7957179
Pulled By: ajkr
fbshipit-source-id: 177285d2cef3b650b6a4d81dc5db84bc441c9fe4
Summary:
Currently manual_wal_flush if set in the options will be used only for the wal files created during wal switch. The configuration thus does not affect the first wal file. The patch fixes that and also update the related unit tests.
This PR is built on top of https://github.com/facebook/rocksdb/pull/3756
Closes https://github.com/facebook/rocksdb/pull/3824
Differential Revision: D7909153
Pulled By: maysamyabandeh
fbshipit-source-id: 024ed99d2555db06bf096c902b998e432bb7b9ce
Summary:
Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.
This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
Closes https://github.com/facebook/rocksdb/pull/3829
Differential Revision: D7915443
Pulled By: ajkr
fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279
Summary:
`ReadaheadRandomAccessFile` had an unwritten assumption, which was that its wrapped file's `Read()` function always copies into the provided scratch buffer. Actually this was not true when the wrapped file was `PosixMmapReadableFile`, whose `Read()` implementation does no copying and instead returns a `Slice` pointing directly into the `mmap`'d memory region. This PR:
- prevents `ReadaheadRandomAccessFile` from ever wrapping mmap readable files
- adds an assert for the assumption `ReadaheadRandomAccessFile` makes about the wrapped file's use of scratch buffer
Closes https://github.com/facebook/rocksdb/pull/3813
Differential Revision: D7891513
Pulled By: ajkr
fbshipit-source-id: dc64a55222d6af280c39a1852ee39e9e9d7cde7d
Summary:
tsan flavor of this test occasionally times out in our test infra. The patch split the test to two, each working on half of the option range.
Before:
[ OK ] FaultTest/FaultInjectionTest.FaultTest/0 (5918 ms)
[ OK ] FaultTest/FaultInjectionTest.FaultTest/1 (5336 ms)
After:
[ OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/0 (2930 ms)
[ OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/1 (2676 ms)
[ OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/2 (2759 ms)
[ OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/3 (2546 ms)
Closes https://github.com/facebook/rocksdb/pull/3819
Differential Revision: D7894975
Pulled By: maysamyabandeh
fbshipit-source-id: 809f1411cbcc27f8aa71a6b29a16b039f51b67c9
Summary:
The origin commit #3635 will hurt performance for users who aren't using range deletions, because unneeded std::set operations, so it was reverted by commit 44653c7b7a. (see #3672)
To fix this, move the set to and add a check in , i.e., file will be added only if is non-nullptr.
The db_bench command which find the performance regression:
> ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 > --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 > --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 > -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none
Before and after the modification, I re-run this command on the machine, the results of are as follows:
**fillrandom**
Table | P50 | P75 | P99 | P99.9 | P99.99 |
---- | --- | --- | --- | ----- | ------ |
before commit | 5.92 | 8.57 | 19.63 | 980.97 | 12196.00 |
after commit | 5.91 | 8.55 | 19.34 | 965.56 | 13513.56 |
**seekrandomwhilewriting**
Table | P50 | P75 | P99 | P99.9 | P99.99 |
---- | --- | --- | --- | ----- | ------ |
before commit | 1418.62 | 1867.01 | 3823.28 | 4980.99 | 9240.00 |
after commit | 1450.54 | 1880.61 | 3962.87 | 5429.60 | 7542.86 |
Closes https://github.com/facebook/rocksdb/pull/3800
Differential Revision: D7874245
Pulled By: ajkr
fbshipit-source-id: 2e8bec781b3f7399246babd66395c88619534a17
Summary:
Delete archive directory before WAL folder
since archive may be contained as a subfolder.
Also improve loop readability.
Closes https://github.com/facebook/rocksdb/pull/3797
Differential Revision: D7866378
Pulled By: riversand963
fbshipit-source-id: 0c45d97677ce6fbefa3f8d602ef5e2a2a925e6f5
Summary:
ManualCompactionTest.Test occasionally times out in tsan flavor of our test infra. The patch reduces the number of keys to make the test run faster. The change does not seem to negatively impact the coverage of the test.
Closes https://github.com/facebook/rocksdb/pull/3802
Differential Revision: D7865596
Pulled By: maysamyabandeh
fbshipit-source-id: b4f60e32c3ae1677e25506f71c766e33fa985785
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729