Summary:
logs_.back() is called out of DB mutex, which can cause data race. We move the access into the DB mutex protection area.
Closes https://github.com/facebook/rocksdb/pull/1774
Reviewed By: AsyncDBConnMarkedDownDBException
Differential Revision: D4417472
Pulled By: AsyncDBConnMarkedDownDBException
fbshipit-source-id: 2da1f1e
Summary:
GetAndRefSuperVersion() should not be called again in the same thread before ReturnAndCleanupSuperVersion() is called.
If we have a compaction filter that is using DB::Get, This will happen
```
CompactFiles() {
GetAndRefSuperVersion() // -- first call
..
CompactionFilter() {
GetAndRefSuperVersion() // -- second call
ReturnAndCleanupSuperVersion()
}
..
ReturnAndCleanupSuperVersion()
}
```
We solve this issue in the same way Iterator is solving it, but using GetReferencedSuperVersion()
This was discovered in https://github.com/facebook/mysql-5.6/issues/427 by alxyang
Closes https://github.com/facebook/rocksdb/pull/1803
Differential Revision: D4460155
Pulled By: IslamAbdelRahman
fbshipit-source-id: 5e54322
Summary:
GetAndRefSuperVersionUnlocked
ReturnAndCleanupSuperVersionUnlocked
GetColumnFamilyHandleUnlocked
Are dead code that are not used any where
Closes https://github.com/facebook/rocksdb/pull/1802
Differential Revision: D4459948
Pulled By: IslamAbdelRahman
fbshipit-source-id: 30fa89d
Summary:
Fixing GetApproximateSize bug for the case of computing stats for mem tables only.
Closes https://github.com/facebook/rocksdb/pull/1795
Differential Revision: D4445507
Pulled By: IslamAbdelRahman
fbshipit-source-id: 3905846
Summary:
We should validate this option, otherwise we may see
std::out_of_range thrown at: db/db_impl.cc:1124
1123 for (unsigned int i = 0; i <= end; i++) {
1124 std::string& to_delete = old_info_log_files.at(i);
1125 std::string full_path_to_delete =
1126 (immutable_db_options_.db_log_dir.empty()
Closes https://github.com/facebook/rocksdb/pull/1722
Differential Revision: D4379495
Pulled By: yiwu-arbug
fbshipit-source-id: e136552
Summary:
Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables.
Closes https://github.com/facebook/rocksdb/pull/1787
Differential Revision: D4441111
Pulled By: IslamAbdelRahman
fbshipit-source-id: c11f4c3
Summary:
Fix the bug when sync log fail, FlushJob::Run() will not be execute and
reference to cfd->current() will not be release.
Closes https://github.com/facebook/rocksdb/pull/1792
Differential Revision: D4441316
Pulled By: yiwu-arbug
fbshipit-source-id: 5523e28
Summary:
Consider the following single column family scenario:
prepare in log A
commit in log B
*WAL is too large, flush all CFs to releast log A*
*CFA is on log B so we do not see CFA is depending on log A so no flush is requested*
To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on.
Closes https://github.com/facebook/rocksdb/pull/1768
Differential Revision: D4403265
Pulled By: reidHoruff
fbshipit-source-id: ce800ff
Summary:
DB shutdown aborts running compactions by setting an atomic shutting_down=true that CompactionJob periodically checks. Without this PR it checks it before processing every _output_ value. If compaction filter filters everything out, the compaction is uninterruptible. This PR adds checks for shutting_down on every _input_ value (in CompactionIterator and MergeHelper).
There's also some minor code cleanup along the way.
Closes https://github.com/facebook/rocksdb/pull/1639
Differential Revision: D4306571
Pulled By: yiwu-arbug
fbshipit-source-id: f050890
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.
Here is the summary for improvements:
1. value 100 byte: 1.8% regular, 1.2% merge values
2. value 1k byte: 11.5% regular, 7.5% merge values
3. value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.
PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732
Differential Revision: D4374613
Pulled By: maysamyabandeh
fbshipit-source-id: a077f1a
Summary:
If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files.
Closes https://github.com/facebook/rocksdb/pull/1724
Differential Revision: D4368319
Pulled By: siying
fbshipit-source-id: cc2c746
Summary:
This PR update IngestExternalFile to return an error if we try to ingest a file into a dropped CF.
Right now if IngestExternalFile want to flush a memtable, and it's ingesting a file into a dropped CF, it will wait forever since flushing is not possible for the dropped CF
Closes https://github.com/facebook/rocksdb/pull/1657
Differential Revision: D4318657
Pulled By: IslamAbdelRahman
fbshipit-source-id: ed6ea2b
Summary:
Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions().
Closes https://github.com/facebook/rocksdb/pull/1595
Differential Revision: D4246569
Pulled By: tonek
fbshipit-source-id: d23f560
Summary:
If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for
the write request, then fail immediately with Status::Incomplete().
Closes https://github.com/facebook/rocksdb/pull/1527
Differential Revision: D4191405
Pulled By: maysamyabandeh
fbshipit-source-id: 7f3ce3f
Summary:
The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator.
Closes https://github.com/facebook/rocksdb/pull/1547
Differential Revision: D4207781
Pulled By: ajkr
fbshipit-source-id: 9d1c130
Summary:
Since a RangeDelAggregator is created for each read request, these heap-allocating member variables were consuming significant CPU (~3% total) which slowed down request throughput. The map and pinning manager are only necessary when range deletions exist, so we can defer their initialization until the first range deletion is encountered. Currently lazy initialization is done for reads only since reads pass us a single snapshot, which is easier to store on the stack for later insertion into the map than the vector passed to us by flush or compaction.
Note the Arena member variable is still expensive, I will figure out what to do with it in a subsequent diff. It cannot be lazily initialized because we currently use this arena even to allocate empty iterators, which is necessary even when no range deletions exist.
Closes https://github.com/facebook/rocksdb/pull/1539
Differential Revision: D4203488
Pulled By: ajkr
fbshipit-source-id: 3b36279
Summary:
Remove the ticker count because:
* Having to reset the ticker count in WriteImpl is ineffiecent;
* It doesn't make sense to have it as a ticker count if multiple db
instance share a statistics object.
Closes https://github.com/facebook/rocksdb/pull/1531
Differential Revision: D4194442
Pulled By: yiwu-arbug
fbshipit-source-id: e2110a9
Summary:
Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present
the stats in any format required by the them.
Closes https://github.com/facebook/rocksdb/pull/1477
Differential Revision: D4149836
Pulled By: maysamyabandeh
fbshipit-source-id: b3df19f
Summary:
Originally sequence ids were calculated, in recovery, based off of the first seqid found if the first log recovered. The working seqid was then incremented from that value based on every insertion that took place. This was faulty because of the potential for missing log files or inserts that skipped the WAL. The current recovery scheme grabs sequence from current recovering batch and increments using memtableinserter to track how many actual inserts take place. This works for 2PC batches as well scenarios where some logs are missing or inserts that skip the WAL.
Closes https://github.com/facebook/rocksdb/pull/1486
Differential Revision: D4156064
Pulled By: reidHoruff
fbshipit-source-id: a6da8d9
Summary:
copied from: 5ebfd2623a
Opening existing RocksDB attempts recovery from log files, which uses
wrong sequence number to create the memtable. This is a regression
introduced in change a400336.
This change includes a test demonstrating the problem, without the fix
the test fails with "Operation failed. Try again.: Transaction could not
check for conflicts for operation at SequenceNumber 1 as the MemTable
only contains changes newer than SequenceNumber 2. Increasing the value
of the max_write_buffer_number_to_maintain option could reduce the
frequency of this error"
This change is a joint effort by Peter 'Stig' Edwards thatsafunnyname
and me.
Closes https://github.com/facebook/rocksdb/pull/1458
Differential Revision: D4143791
Pulled By: reidHoruff
fbshipit-source-id: 5a25033
Summary:
A deadlock is possible if this happen
(1) Writer thread is stopped because it's waiting for compaction to finish
(2) Compaction is waiting for current IngestExternalFile() calls to finish
(3) IngestExternalFile() is waiting to be able to acquire the writer thread
(4) WriterThread is held by stopped writes that are waiting for compactions to finish
This patch fix the issue by not incrementing num_running_ingest_file_ except when we acquire the writer thread.
This patch include a unittest to reproduce the described scenario
Closes https://github.com/facebook/rocksdb/pull/1480
Differential Revision: D4151646
Pulled By: IslamAbdelRahman
fbshipit-source-id: 09b39db
Summary:
Note: reviewed in https://reviews.facebook.net/D65115
- DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times.
- DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205)
- DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones
Closes https://github.com/facebook/rocksdb/pull/1464
Differential Revision: D4131753
Pulled By: ajkr
fbshipit-source-id: be86559
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.
added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456
Differential Revision: D4111271
Pulled By: ajkr
fbshipit-source-id: 6e388d4
Summary:
Changed BuildTable() (used for flush) to (1) add range
tombstones to the aggregator, which is used by CompactionIterator to
determine which keys can be removed; and (2) add aggregator's range
tombstones to the table that is output for the flush.
Closes https://github.com/facebook/rocksdb/pull/1438
Differential Revision: D4100025
Pulled By: ajkr
fbshipit-source-id: cb01a70
Summary:
Currently the compression suppport and fast CRC support information is printed as info level. They should be in the same level as options, which is header level.
Also add ZSTD to this printing.
Closes https://github.com/facebook/rocksdb/pull/1448
Differential Revision: D4106608
Pulled By: yiwu-arbug
fbshipit-source-id: cb9a076
Summary:
DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line.
Closes https://github.com/facebook/rocksdb/pull/1439
Differential Revision: D4100943
Pulled By: yiwu-arbug
fbshipit-source-id: 79f7218
Summary:
Currently, we don't use options.compression_per_level[0] as the compression style for L0 compression type, unless it is None. This behavior
doesn't look like on purpose. This diff will make sure L0 compress using the style of options.compression_per_level[0].
Reviewed and accepted in: https://reviews.facebook.net/D65607
Closes https://github.com/facebook/rocksdb/pull/1435
Differential Revision: D4099368
Pulled By: siying
fbshipit-source-id: cfbbdcd
Summary: Siying suggested to keep old code for normal mode prev() for safety
Test Plan: make check -j64
Reviewers: yiwu, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D65439
Summary:
change ioptions.comparator to user_comparator instread of internal_comparator.
Also change Comparator* to InternalKeyComparator* to make its type explicitly.
Test Plan: make all check -j64
Reviewers: andrewkr, sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D65121
Summary:
Changes in the diff
API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API
Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers
Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob
Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)
Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong
Reviewed By: sdong
Subscribers: jkedgar, hcz, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65061
Summary:
Previously the WAL files that were avoided during recovery would never
be considered for deletion. That was because alive_log_files_ was only
populated when log files are created. This diff further populates
alive_log_files_ with existing log files that aren't flushed during recovery,
such that FindObsoleteFiles() can find them later.
Depends on D64053.
Test Plan: new unit test, verifies it fails before this change and passes after
Reviewers: sdong, IslamAbdelRahman, yiwu
Reviewed By: yiwu
Subscribers: leveldb, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D64059
Summary:
Issue scenario:
(1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2
(2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files)
(3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency
Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile()
Test Plan: unit tests
Reviewers: sdong
Reviewed By: sdong
Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D64947
Summary: I accidentally left out these changes from my commit of D64053 due to
messing up the merge conflict resolution.
Test Plan: ./db_wal_test
Reviewers:
Subscribers:
Tasks:
Blame Revision: D64053
Summary:
This reverts commit 9e4aa798c3,
which doesn't handle all cases (see inline comment).
I reimplemented the logic as suggested in the initial PR: https://github.com/facebook/rocksdb/pull/1313.
This approach has two benefits:
- All the parsing/filtering of full_scan_candidate_files is kept together in PurgeObsoleteFiles.
- We only need to check whether log file is recycled in one place where we've already determined it's a log file
Test Plan:
new unit test, verified fails before the original fix, still passes
now.
Reviewers: IslamAbdelRahman, yiwu, sdong
Reviewed By: yiwu, sdong
Subscribers: leveldb, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D64053
Summary:
1) The previous solution for Prev() prefix support is not clean.
Since I add api SeekForPrev(), now the Prev() can be symmetric to Next().
and we do not need SeekToLast() to be called in Prev() any more.
Also, Next() will Seek(prefix_seek_key_) to solve the problem of possible inconsistency between db_iter and merge_iter when
there is merge_operator. And prefix_seek_key is only refreshed when change direction to forward.
2) This diff also solves the bug of Iterator::SeekToLast() with iterate_upper_bound_ with prefix extractor.
add test cases for the above two cases.
There are some tests for the SeekToLast() in Prev(), I will clean them later.
Test Plan: make all check
Reviewers: IslamAbdelRahman, andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63933
Summary:
When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement.
```
RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off
INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628
RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off
INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054
```
Test Plan: Two unit tests added.
Reviewers: sdong, yiwu, IslamAbdelRahman
Reviewed By: yiwu
Subscribers: hermanlee4, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D64599
Summary:
Fix the conflict bug between AddFile() and CompactRange() by
- Make sure that no AddFile calls are running when asking CompactionPicker to pick compaction for manual compaction
- If AddFile() run after we pick the compaction for the manual compaction it will be aware of it since we will add the manual compaction to running_compactions_ after picking it
This will solve these 2 scenarios
- If AddFile() is running, we will wait for it to finish before we pick a compaction for the manual compaction
- If we already picked a manual compaction and then AddFile() started ... we ensure that it never ingest a file in a level that will overlap with the manual compaction
Test Plan: unit tests
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, jkedgar, dhruba
Differential Revision: https://reviews.facebook.net/D64449
Summary: We didn't recompute compaction score on SetOptions, and end up not having compaction if no flush happens afterward. The PR fixing it.
Test Plan: See unit test.
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64167
Summary:
Since AddFile unlock/lock the mutex inside LogAndApply() we need to ensure that during this period other compactions cannot run since such compactions are not aware of the file we are ingesting and could create a compaction that overlap wit this file
this diff add
- WaitForAddFile() call that will ensure that no AddFile() calls are being processed right now
- Call `WaitForAddFile()` in 3 locations
-- When doing manual Compaction
-- When starting automatic Compaction
-- When doing CompactFiles()
Test Plan: unit test
Reviewers: lightmark, yiwu, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, jkedgar, dhruba
Differential Revision: https://reviews.facebook.net/D64383
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065
Summary:
Revert the behavior where we don't read sequence id from WAL, but increase it as we replay the log. We still keep the behave for 2PC for now but will fix later.
This change fixes github issue 1339, where some writes come with WAL disabled and we may recover records with wrong sequence id.
Test Plan: Added unit test.
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64275
Summary:
Mitigate regression bug of options.max_successive_merges hit during DB Recovery
For https://reviews.facebook.net/D62625
Test Plan: make all check
Reviewers: horuff, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62655
Summary: WritableFile::SetPreallocationBlockSize() requires parameter as size_t, and options used in DBImpl::GetWalPreallocateBlockSize() are all size_t. WritableFile::SetPreallocationBlockSize() should return size_t to avoid build break if size_t is not uint64_t.
Test Plan: Run existing tests.
Reviewers: andrewkr, IslamAbdelRahman, yiwu
Reviewed By: yiwu
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64137
Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller.
Test Plan: Add a unit test.
Reviewers: andrewkr, yiwu
Reviewed By: yiwu
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63957
Summary: One more small refactor before I split DBOptions into mutable and immutable parts.
Test Plan: existing unit tests.
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64047
Summary:
if one or more CFs had no data in the WAL, the log number that's used
by FindObsoleteFiles() wasn't updated. We need to treat this case the same as
if the data for that WAL had been flushed.
Test Plan: new unit test
Reviewers: IslamAbdelRahman, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63963
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.
Test Plan: existing unit tests.
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63945
If log recycling is enabled with the rocksdb (recycle_log_file_num=16)
db->Writebatch is erroring out with keynotfound after ~5-6 hours of run
(1M seq but can happen to any workload I guess).See my detailed bug
report here (https://github.com/facebook/rocksdb/issues/1303).
This commit is the fix for this, a check is been added not to delete
the log file if it is already there in the recycle list.
Test Plan:
Unit tested it and ran the similar profile. Not reproducing anymore.
Summary: As title, make sure Prev() works as expected with Next() when the current iter->key() in the range of the same prefix in prefix seek mode
Test Plan: make all check -j64 (add prefix_test with PrefixSeekModePrev test case)
Reviewers: andrewkr, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: yoshinorim, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61419
Summary:
After 1b8a2e8fdd, DB Pointer is passed to WriteBatchInternal::InsertInto() while DB recovery. This can cause deadlock if options.max_successive_merges hits. In that case DB::Get() will be called. Get() will try to acquire the DB mutex, which is already held by the DB::Open(), causing a deadlock condition.
This commit mitigates the problem by not passing the DB pointer unless 2PC is allowed.
Test Plan: Add a new test and run it.
Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, horuff
Reviewed By: kradhakrishnan
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62625
Summary:
Move the manual memtable flush for databases containing data that has
bypassed the WAL from DBImpl's destructor to CancleAllBackgroundWork().
CancelAllBackgroundWork() is a publicly exposed API which allows
async operations performed by background threads to be disabled on a
database. In effect, this places the database into a "shutdown" state
in advance of calling the database object's destructor. No compactions
or flushing of SST files can occur once a call to this API completes.
When writes are issued to a database with WriteOptions::disableWAL
set to true, DBImpl::has_unpersisted_data_ is set so that
memtables can be flushed when the database object is destroyed. If
CancelAllBackgroundWork() has been called prior to DBImpl's destructor,
this flush operation is not possible and is skipped, causing unnecessary
loss of data.
Since CancelAllBackgroundWork() is already invoked by DBImpl's destructor
in order to perform the thread join portion of its cleanup processing,
moving the manual memtable flush to CancelAllBackgroundWork() ensures
data is persisted regardless of client behavior.
Test Plan:
Write an amount of data that will not cause a memtable flush to a rocksdb
database with all writes marked with WriteOptions::disableWAL. Properly
"close" the database. Reopen database and verify that the data was
persisted.
Reviewers: IslamAbdelRahman, yiwu, yoshinorim, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62277
Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too.
Test Plan: I am adding unit tests.
Reviewers: andrewkr, IslamAbdelRahman, yiwu
Reviewed By: yiwu
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61503
Summary:
My understanding is that the purpose of write stall triggers are to wait for auto-compaction to catch up. Without auto-compaction, we don't need to stall writes.
Also with this diff, flush/compaction conditions are recalculated on dynamic option change. Previously the conditions are recalculate only when write stall options are changed.
Test Plan: See the new test. Removed two tests that are no longer valid.
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61437
Summary:
Stale log files can be deleted out of order. This can happen for various reasons. One of the reason is that no data is ever inserted to a column family and we have an optimization to update its log number, but not all the old log files are cleaned up (the case shown in the unit tests added). It can also happen when we simply delete multiple log files out of order.
This causes data corruption because we simply increase seqID after processing the next row and we may end up with writing data with smaller seqID than what is already flushed to memtables.
In DB recovery, for the oldest files we are replaying, if there it contains no data for any column family, we ignore the sequence IDs in the file.
Test Plan: Add two unit tests that fail without the fix.
Reviewers: IslamAbdelRahman, igor, yiwu
Reviewed By: yiwu
Subscribers: hermanlee4, yoshinorim, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60891
Summary: Multiput atomiciy is broken across multiple column families if we don't sync WAL before flushing one column family. The WAL file may contain a write batch containing writes to a key to the CF to be flushed and a key to other CF. If we don't sync WAL before flushing, if machine crashes after flushing, the write batch will only be partial recovered. Data to other CFs are lost.
Test Plan: Add a new unit test which will fail without the diff.
Reviewers: yhchiang, IslamAbdelRahman, igor, yiwu
Reviewed By: yiwu
Subscribers: yiwu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60915
Summary:
add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to close column family instead of deleting cfh*
User should call this to close a cf and then we can detect the deletion in this function.
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60765
Summary:
When write stalls because of auto compaction is disabled, or stop write trigger is reached,
user may change these two options to unblock writes. Unfortunately we had issue where the write
thread will block the attempt to persist the options, thus creating a deadlock. This diff
fix the issue and add two test cases to detect such deadlock.
Test Plan:
Run unit tests.
Also, revert db_impl.cc to master (but don't revert `DBImpl::BackgroundCompaction:Finish` sync point) and run db_options_test. Both tests should hit deadlock.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60627
Summary:
DB::AddFile(std::string file_path) API that allow them to ingest an SST file created using SstFileWriter
We want to update this interface to be able to accept a list of files that will be ingested, DB::AddFile(std::vector<std::string> file_path_list).
Test Plan:
Add test case `AddExternalSstFileList` in `DBSSTTest`. To make sure:
1. files key ranges are not overlapping with each other
2. each file key range dont overlap with the DB key range
3. make sure no snapshots are held
Reviewers: andrewkr, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58587
Summary: We saw instances where total_log_size is off the real value, but I'm not able to reproduce it. Add more logging to help debugging when it happens again.
Test Plan: Run the unit test and see the logging.
Reviewers: andrewkr, yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60081
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.
Test Plan: Add a new unit test.
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59925
Summary: Reported in T11889874. When registering the cleanup function we should copy the option so that we can still access it if ReadOptions is deleted.
Test Plan: Add a unit test to reproduce this bug.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60087
Summary:
Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators.
Instead, a job is scheduled in high priority queue and would be executed in a separate background thread.
Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background.
Reviewers: IslamAbdelRahman, sdong
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59499
Summary:
DB::AddFile() right now always add the ingested file to L0
update the logic to add the file to the lowest possible level
Test Plan: unit tests
Reviewers: jkedgar, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D59637
Summary: filter_deltes is not a frequently used feature. Remove it.
Test Plan: Run all test suites.
Reviewers: igor, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59427
Summary:
We dont report the bytes that we ingested from AddFile which make the write amplification numbers incorrect
Update InternalStats and add logging for AddFile()
Test Plan: Make sure the code compile and existing tests pass
Reviewers: lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59763
Summary:
Add option to not flush memtable on open()
In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST.
Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable.
If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery.
Test Plan: See unit test.
Reviewers: dhruba, horuff, sdong
Reviewed By: sdong
Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57813
Summary: Backup options file to private directory
Test Plan:
backupable_db_test.cc, BackupOptions
Modify DB options by calling OpenDB for 3 times. Check the latest options file is in the right place. Also check no redundent files are backuped.
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D59373
Summary:
MemTableList::current_ could be written by background flush thread and
simultaneously read in the user thread (NumNotFlushed() is used in
SwitchMemtable()). Use the lock to prevent this case. Found the error from tsan.
Related: D58833
Test Plan:
$ OPT=-g COMPILE_WITH_TSAN=1 make -j64 db_test
$ TEST_TMPDIR=/dev/shm/rocksdb ./db_test --gtest_filter=DBTest.RepeatedWritesToSameKey
Reviewers: lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59139
* Create a callback for memtable becoming immutable
Create a callback for memtable becoming immutable
Create a callback for memtable becoming immutable
moved notification outside the lock
Move sealed notification to unlocked portion of SwitchMemtable
* fix lite build
Summary:
We see some write stalls because of number of unflushed memtables. With existing logging I couldn't figure out what's happening exactly. See internal task t11446054 for details if interested. This diff adds:
- logging of memtable creation at info level; I wanted it on multiple occasions for different reasons; also include number of immutable memtables,
- logging of number of remaining immutable memtables after a flush.
Test Plan: ran tests
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58833
Summary: Make CompactionOptionsFIFO a part of mutable_cf_options
Test Plan: UT
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, lgalanis, dhruba
Differential Revision: https://reviews.facebook.net/D58653
kPointInTimeRecovery is indistinguishable from
kTolerateCorruptedTailRecords in recycle mode since we define
the "end" of the log as the first corrupt record we encounter.
kAbsoluteConsistency doesn't make sense because even a clean
shutdown leaves old junk at the end of the log file.
Signed-off-by: Sage Weil <sage@redhat.com>
Summary:
NotifyOnCompactionCompleted can unlock the mutex.
That mean that we can schedule a background compaction that will start before we ReleaseCompactionFiles().
Test Plan:
added unittest
existing unittest
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58065
Summary:
- Make sure we clean up recovered_transactions_ on DBImpl destructor
- delete leaked txns and env in TransactionTest
Test Plan: Run transaction_test under valgrind
Reviewers: sdong, andrewkr, yhchiang, horuff
Reviewed By: horuff
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58263
Summary: DBTestXactLogIterator.TransactionLogIterator was failing due the sequence gaps. This was caused by an off-by-one error when calculating the new sequence number after recovering from logs.
Test Plan: db_log_iter_test
Reviewers: andrewkr
Subscribers: andrewkr, hermanlee4, dhruba, IslamAbdelRahman
Differential Revision: https://reviews.facebook.net/D58053
Summary:
1. prepare()
2. crash
3. recover
4. commit()
5. crash
6. data is lost
This is due to the transaction data still only residing in the WAL but because the logs were flushed on the first recovery the data is ignored on the second recovery. We must scan all logs found on recovery and only ignore redundant data at the time of replay. It is not possible to know which logs still contain relevant data at time of recovery. We cannot simply ignore a log because all of the non-2pc data it contains has already been written to L0.
The changes made to MemTableInserter are to ensure that prepared sections are still recovered even if all of the non-2pc data in that log has already been flushed to L0.
Test Plan: Provided test.
Reviewers: sdong
Subscribers: andrewkr, hermanlee4, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57729
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]
The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.
Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.
We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))
So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?
Test Plan: provided test.
Reviewers: sdong
Subscribers: andrewkr, leveldb, dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D57645
Summary:
This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC.
modfication of DBImpl::WriteImpl()
- added two arguments *uint64_t log_used = nullptr, uint64_t log_ref = 0;
- *log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place.
- log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed.
- Recovery/writepath is now aware of prepared batches and commit and rollback markers.
Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing.
Reviewers: IslamAbdelRahman, sdong
Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D56919
Summary:
Add a new option that can be used to set a specific compression algorithm for bottommost level.
This option will only affect levels larger than base level.
I have also updated CompactionJobInfo to include the compression algorithm used in compaction
Test Plan:
added new unittest
existing unittests
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: lightmark, andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D57669
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
Summary: Warning is printed out with USE_CLANG=1 when including jemalloc.h. Disable it in that case.
Test Plan: Run db_bench with USE_CLANG=1 and not. Make sure they can all build and jemalloc status is printed out in the case where USE_CLANG is not set.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57399