Summary:
We should validate this option, otherwise we may see
std::out_of_range thrown at: db/db_impl.cc:1124
1123 for (unsigned int i = 0; i <= end; i++) {
1124 std::string& to_delete = old_info_log_files.at(i);
1125 std::string full_path_to_delete =
1126 (immutable_db_options_.db_log_dir.empty()
Closes https://github.com/facebook/rocksdb/pull/1722
Differential Revision: D4379495
Pulled By: yiwu-arbug
fbshipit-source-id: e136552
Summary:
Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables.
Closes https://github.com/facebook/rocksdb/pull/1787
Differential Revision: D4441111
Pulled By: IslamAbdelRahman
fbshipit-source-id: c11f4c3
Summary:
Fix the bug when sync log fail, FlushJob::Run() will not be execute and
reference to cfd->current() will not be release.
Closes https://github.com/facebook/rocksdb/pull/1792
Differential Revision: D4441316
Pulled By: yiwu-arbug
fbshipit-source-id: 5523e28
Summary:
Consider the following single column family scenario:
prepare in log A
commit in log B
*WAL is too large, flush all CFs to releast log A*
*CFA is on log B so we do not see CFA is depending on log A so no flush is requested*
To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on.
Closes https://github.com/facebook/rocksdb/pull/1768
Differential Revision: D4403265
Pulled By: reidHoruff
fbshipit-source-id: ce800ff
Summary:
DB shutdown aborts running compactions by setting an atomic shutting_down=true that CompactionJob periodically checks. Without this PR it checks it before processing every _output_ value. If compaction filter filters everything out, the compaction is uninterruptible. This PR adds checks for shutting_down on every _input_ value (in CompactionIterator and MergeHelper).
There's also some minor code cleanup along the way.
Closes https://github.com/facebook/rocksdb/pull/1639
Differential Revision: D4306571
Pulled By: yiwu-arbug
fbshipit-source-id: f050890
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.
Here is the summary for improvements:
1. value 100 byte: 1.8% regular, 1.2% merge values
2. value 1k byte: 11.5% regular, 7.5% merge values
3. value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.
PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732
Differential Revision: D4374613
Pulled By: maysamyabandeh
fbshipit-source-id: a077f1a
Summary:
If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files.
Closes https://github.com/facebook/rocksdb/pull/1724
Differential Revision: D4368319
Pulled By: siying
fbshipit-source-id: cc2c746
Summary:
This PR update IngestExternalFile to return an error if we try to ingest a file into a dropped CF.
Right now if IngestExternalFile want to flush a memtable, and it's ingesting a file into a dropped CF, it will wait forever since flushing is not possible for the dropped CF
Closes https://github.com/facebook/rocksdb/pull/1657
Differential Revision: D4318657
Pulled By: IslamAbdelRahman
fbshipit-source-id: ed6ea2b
Summary:
Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions().
Closes https://github.com/facebook/rocksdb/pull/1595
Differential Revision: D4246569
Pulled By: tonek
fbshipit-source-id: d23f560
Summary:
If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for
the write request, then fail immediately with Status::Incomplete().
Closes https://github.com/facebook/rocksdb/pull/1527
Differential Revision: D4191405
Pulled By: maysamyabandeh
fbshipit-source-id: 7f3ce3f
Summary:
The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator.
Closes https://github.com/facebook/rocksdb/pull/1547
Differential Revision: D4207781
Pulled By: ajkr
fbshipit-source-id: 9d1c130
Summary:
Since a RangeDelAggregator is created for each read request, these heap-allocating member variables were consuming significant CPU (~3% total) which slowed down request throughput. The map and pinning manager are only necessary when range deletions exist, so we can defer their initialization until the first range deletion is encountered. Currently lazy initialization is done for reads only since reads pass us a single snapshot, which is easier to store on the stack for later insertion into the map than the vector passed to us by flush or compaction.
Note the Arena member variable is still expensive, I will figure out what to do with it in a subsequent diff. It cannot be lazily initialized because we currently use this arena even to allocate empty iterators, which is necessary even when no range deletions exist.
Closes https://github.com/facebook/rocksdb/pull/1539
Differential Revision: D4203488
Pulled By: ajkr
fbshipit-source-id: 3b36279
Summary:
Remove the ticker count because:
* Having to reset the ticker count in WriteImpl is ineffiecent;
* It doesn't make sense to have it as a ticker count if multiple db
instance share a statistics object.
Closes https://github.com/facebook/rocksdb/pull/1531
Differential Revision: D4194442
Pulled By: yiwu-arbug
fbshipit-source-id: e2110a9
Summary:
Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present
the stats in any format required by the them.
Closes https://github.com/facebook/rocksdb/pull/1477
Differential Revision: D4149836
Pulled By: maysamyabandeh
fbshipit-source-id: b3df19f
Summary:
Originally sequence ids were calculated, in recovery, based off of the first seqid found if the first log recovered. The working seqid was then incremented from that value based on every insertion that took place. This was faulty because of the potential for missing log files or inserts that skipped the WAL. The current recovery scheme grabs sequence from current recovering batch and increments using memtableinserter to track how many actual inserts take place. This works for 2PC batches as well scenarios where some logs are missing or inserts that skip the WAL.
Closes https://github.com/facebook/rocksdb/pull/1486
Differential Revision: D4156064
Pulled By: reidHoruff
fbshipit-source-id: a6da8d9
Summary:
copied from: 5ebfd2623a
Opening existing RocksDB attempts recovery from log files, which uses
wrong sequence number to create the memtable. This is a regression
introduced in change a400336.
This change includes a test demonstrating the problem, without the fix
the test fails with "Operation failed. Try again.: Transaction could not
check for conflicts for operation at SequenceNumber 1 as the MemTable
only contains changes newer than SequenceNumber 2. Increasing the value
of the max_write_buffer_number_to_maintain option could reduce the
frequency of this error"
This change is a joint effort by Peter 'Stig' Edwards thatsafunnyname
and me.
Closes https://github.com/facebook/rocksdb/pull/1458
Differential Revision: D4143791
Pulled By: reidHoruff
fbshipit-source-id: 5a25033
Summary:
A deadlock is possible if this happen
(1) Writer thread is stopped because it's waiting for compaction to finish
(2) Compaction is waiting for current IngestExternalFile() calls to finish
(3) IngestExternalFile() is waiting to be able to acquire the writer thread
(4) WriterThread is held by stopped writes that are waiting for compactions to finish
This patch fix the issue by not incrementing num_running_ingest_file_ except when we acquire the writer thread.
This patch include a unittest to reproduce the described scenario
Closes https://github.com/facebook/rocksdb/pull/1480
Differential Revision: D4151646
Pulled By: IslamAbdelRahman
fbshipit-source-id: 09b39db
Summary:
Note: reviewed in https://reviews.facebook.net/D65115
- DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times.
- DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205)
- DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones
Closes https://github.com/facebook/rocksdb/pull/1464
Differential Revision: D4131753
Pulled By: ajkr
fbshipit-source-id: be86559
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.
added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456
Differential Revision: D4111271
Pulled By: ajkr
fbshipit-source-id: 6e388d4
Summary:
Changed BuildTable() (used for flush) to (1) add range
tombstones to the aggregator, which is used by CompactionIterator to
determine which keys can be removed; and (2) add aggregator's range
tombstones to the table that is output for the flush.
Closes https://github.com/facebook/rocksdb/pull/1438
Differential Revision: D4100025
Pulled By: ajkr
fbshipit-source-id: cb01a70
Summary:
Currently the compression suppport and fast CRC support information is printed as info level. They should be in the same level as options, which is header level.
Also add ZSTD to this printing.
Closes https://github.com/facebook/rocksdb/pull/1448
Differential Revision: D4106608
Pulled By: yiwu-arbug
fbshipit-source-id: cb9a076
Summary:
DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line.
Closes https://github.com/facebook/rocksdb/pull/1439
Differential Revision: D4100943
Pulled By: yiwu-arbug
fbshipit-source-id: 79f7218
Summary:
Currently, we don't use options.compression_per_level[0] as the compression style for L0 compression type, unless it is None. This behavior
doesn't look like on purpose. This diff will make sure L0 compress using the style of options.compression_per_level[0].
Reviewed and accepted in: https://reviews.facebook.net/D65607
Closes https://github.com/facebook/rocksdb/pull/1435
Differential Revision: D4099368
Pulled By: siying
fbshipit-source-id: cfbbdcd
Summary: Siying suggested to keep old code for normal mode prev() for safety
Test Plan: make check -j64
Reviewers: yiwu, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D65439
Summary:
change ioptions.comparator to user_comparator instread of internal_comparator.
Also change Comparator* to InternalKeyComparator* to make its type explicitly.
Test Plan: make all check -j64
Reviewers: andrewkr, sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D65121
Summary:
Changes in the diff
API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API
Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers
Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob
Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)
Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong
Reviewed By: sdong
Subscribers: jkedgar, hcz, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65061
Summary:
Previously the WAL files that were avoided during recovery would never
be considered for deletion. That was because alive_log_files_ was only
populated when log files are created. This diff further populates
alive_log_files_ with existing log files that aren't flushed during recovery,
such that FindObsoleteFiles() can find them later.
Depends on D64053.
Test Plan: new unit test, verifies it fails before this change and passes after
Reviewers: sdong, IslamAbdelRahman, yiwu
Reviewed By: yiwu
Subscribers: leveldb, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D64059
Summary:
Issue scenario:
(1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2
(2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files)
(3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency
Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile()
Test Plan: unit tests
Reviewers: sdong
Reviewed By: sdong
Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D64947
Summary: I accidentally left out these changes from my commit of D64053 due to
messing up the merge conflict resolution.
Test Plan: ./db_wal_test
Reviewers:
Subscribers:
Tasks:
Blame Revision: D64053
Summary:
This reverts commit 9e4aa798c3,
which doesn't handle all cases (see inline comment).
I reimplemented the logic as suggested in the initial PR: https://github.com/facebook/rocksdb/pull/1313.
This approach has two benefits:
- All the parsing/filtering of full_scan_candidate_files is kept together in PurgeObsoleteFiles.
- We only need to check whether log file is recycled in one place where we've already determined it's a log file
Test Plan:
new unit test, verified fails before the original fix, still passes
now.
Reviewers: IslamAbdelRahman, yiwu, sdong
Reviewed By: yiwu, sdong
Subscribers: leveldb, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D64053
Summary:
1) The previous solution for Prev() prefix support is not clean.
Since I add api SeekForPrev(), now the Prev() can be symmetric to Next().
and we do not need SeekToLast() to be called in Prev() any more.
Also, Next() will Seek(prefix_seek_key_) to solve the problem of possible inconsistency between db_iter and merge_iter when
there is merge_operator. And prefix_seek_key is only refreshed when change direction to forward.
2) This diff also solves the bug of Iterator::SeekToLast() with iterate_upper_bound_ with prefix extractor.
add test cases for the above two cases.
There are some tests for the SeekToLast() in Prev(), I will clean them later.
Test Plan: make all check
Reviewers: IslamAbdelRahman, andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63933
Summary:
When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement.
```
RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off
INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628
RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off
INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054
```
Test Plan: Two unit tests added.
Reviewers: sdong, yiwu, IslamAbdelRahman
Reviewed By: yiwu
Subscribers: hermanlee4, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D64599
Summary:
Fix the conflict bug between AddFile() and CompactRange() by
- Make sure that no AddFile calls are running when asking CompactionPicker to pick compaction for manual compaction
- If AddFile() run after we pick the compaction for the manual compaction it will be aware of it since we will add the manual compaction to running_compactions_ after picking it
This will solve these 2 scenarios
- If AddFile() is running, we will wait for it to finish before we pick a compaction for the manual compaction
- If we already picked a manual compaction and then AddFile() started ... we ensure that it never ingest a file in a level that will overlap with the manual compaction
Test Plan: unit tests
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, jkedgar, dhruba
Differential Revision: https://reviews.facebook.net/D64449
Summary: We didn't recompute compaction score on SetOptions, and end up not having compaction if no flush happens afterward. The PR fixing it.
Test Plan: See unit test.
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64167
Summary:
Since AddFile unlock/lock the mutex inside LogAndApply() we need to ensure that during this period other compactions cannot run since such compactions are not aware of the file we are ingesting and could create a compaction that overlap wit this file
this diff add
- WaitForAddFile() call that will ensure that no AddFile() calls are being processed right now
- Call `WaitForAddFile()` in 3 locations
-- When doing manual Compaction
-- When starting automatic Compaction
-- When doing CompactFiles()
Test Plan: unit test
Reviewers: lightmark, yiwu, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, jkedgar, dhruba
Differential Revision: https://reviews.facebook.net/D64383
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065
Summary:
Revert the behavior where we don't read sequence id from WAL, but increase it as we replay the log. We still keep the behave for 2PC for now but will fix later.
This change fixes github issue 1339, where some writes come with WAL disabled and we may recover records with wrong sequence id.
Test Plan: Added unit test.
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64275
Summary:
Mitigate regression bug of options.max_successive_merges hit during DB Recovery
For https://reviews.facebook.net/D62625
Test Plan: make all check
Reviewers: horuff, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62655
Summary: WritableFile::SetPreallocationBlockSize() requires parameter as size_t, and options used in DBImpl::GetWalPreallocateBlockSize() are all size_t. WritableFile::SetPreallocationBlockSize() should return size_t to avoid build break if size_t is not uint64_t.
Test Plan: Run existing tests.
Reviewers: andrewkr, IslamAbdelRahman, yiwu
Reviewed By: yiwu
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64137
Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller.
Test Plan: Add a unit test.
Reviewers: andrewkr, yiwu
Reviewed By: yiwu
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63957
Summary: One more small refactor before I split DBOptions into mutable and immutable parts.
Test Plan: existing unit tests.
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64047
Summary:
if one or more CFs had no data in the WAL, the log number that's used
by FindObsoleteFiles() wasn't updated. We need to treat this case the same as
if the data for that WAL had been flushed.
Test Plan: new unit test
Reviewers: IslamAbdelRahman, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63963
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.
Test Plan: existing unit tests.
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63945
If log recycling is enabled with the rocksdb (recycle_log_file_num=16)
db->Writebatch is erroring out with keynotfound after ~5-6 hours of run
(1M seq but can happen to any workload I guess).See my detailed bug
report here (https://github.com/facebook/rocksdb/issues/1303).
This commit is the fix for this, a check is been added not to delete
the log file if it is already there in the recycle list.
Test Plan:
Unit tested it and ran the similar profile. Not reproducing anymore.
Summary: As title, make sure Prev() works as expected with Next() when the current iter->key() in the range of the same prefix in prefix seek mode
Test Plan: make all check -j64 (add prefix_test with PrefixSeekModePrev test case)
Reviewers: andrewkr, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: yoshinorim, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61419
Summary:
After 1b8a2e8fdd, DB Pointer is passed to WriteBatchInternal::InsertInto() while DB recovery. This can cause deadlock if options.max_successive_merges hits. In that case DB::Get() will be called. Get() will try to acquire the DB mutex, which is already held by the DB::Open(), causing a deadlock condition.
This commit mitigates the problem by not passing the DB pointer unless 2PC is allowed.
Test Plan: Add a new test and run it.
Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, horuff
Reviewed By: kradhakrishnan
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62625
Summary:
Move the manual memtable flush for databases containing data that has
bypassed the WAL from DBImpl's destructor to CancleAllBackgroundWork().
CancelAllBackgroundWork() is a publicly exposed API which allows
async operations performed by background threads to be disabled on a
database. In effect, this places the database into a "shutdown" state
in advance of calling the database object's destructor. No compactions
or flushing of SST files can occur once a call to this API completes.
When writes are issued to a database with WriteOptions::disableWAL
set to true, DBImpl::has_unpersisted_data_ is set so that
memtables can be flushed when the database object is destroyed. If
CancelAllBackgroundWork() has been called prior to DBImpl's destructor,
this flush operation is not possible and is skipped, causing unnecessary
loss of data.
Since CancelAllBackgroundWork() is already invoked by DBImpl's destructor
in order to perform the thread join portion of its cleanup processing,
moving the manual memtable flush to CancelAllBackgroundWork() ensures
data is persisted regardless of client behavior.
Test Plan:
Write an amount of data that will not cause a memtable flush to a rocksdb
database with all writes marked with WriteOptions::disableWAL. Properly
"close" the database. Reopen database and verify that the data was
persisted.
Reviewers: IslamAbdelRahman, yiwu, yoshinorim, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62277
Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too.
Test Plan: I am adding unit tests.
Reviewers: andrewkr, IslamAbdelRahman, yiwu
Reviewed By: yiwu
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61503
Summary:
My understanding is that the purpose of write stall triggers are to wait for auto-compaction to catch up. Without auto-compaction, we don't need to stall writes.
Also with this diff, flush/compaction conditions are recalculated on dynamic option change. Previously the conditions are recalculate only when write stall options are changed.
Test Plan: See the new test. Removed two tests that are no longer valid.
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61437
Summary:
Stale log files can be deleted out of order. This can happen for various reasons. One of the reason is that no data is ever inserted to a column family and we have an optimization to update its log number, but not all the old log files are cleaned up (the case shown in the unit tests added). It can also happen when we simply delete multiple log files out of order.
This causes data corruption because we simply increase seqID after processing the next row and we may end up with writing data with smaller seqID than what is already flushed to memtables.
In DB recovery, for the oldest files we are replaying, if there it contains no data for any column family, we ignore the sequence IDs in the file.
Test Plan: Add two unit tests that fail without the fix.
Reviewers: IslamAbdelRahman, igor, yiwu
Reviewed By: yiwu
Subscribers: hermanlee4, yoshinorim, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60891
Summary: Multiput atomiciy is broken across multiple column families if we don't sync WAL before flushing one column family. The WAL file may contain a write batch containing writes to a key to the CF to be flushed and a key to other CF. If we don't sync WAL before flushing, if machine crashes after flushing, the write batch will only be partial recovered. Data to other CFs are lost.
Test Plan: Add a new unit test which will fail without the diff.
Reviewers: yhchiang, IslamAbdelRahman, igor, yiwu
Reviewed By: yiwu
Subscribers: yiwu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60915
Summary:
add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to close column family instead of deleting cfh*
User should call this to close a cf and then we can detect the deletion in this function.
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60765
Summary:
When write stalls because of auto compaction is disabled, or stop write trigger is reached,
user may change these two options to unblock writes. Unfortunately we had issue where the write
thread will block the attempt to persist the options, thus creating a deadlock. This diff
fix the issue and add two test cases to detect such deadlock.
Test Plan:
Run unit tests.
Also, revert db_impl.cc to master (but don't revert `DBImpl::BackgroundCompaction:Finish` sync point) and run db_options_test. Both tests should hit deadlock.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60627
Summary:
DB::AddFile(std::string file_path) API that allow them to ingest an SST file created using SstFileWriter
We want to update this interface to be able to accept a list of files that will be ingested, DB::AddFile(std::vector<std::string> file_path_list).
Test Plan:
Add test case `AddExternalSstFileList` in `DBSSTTest`. To make sure:
1. files key ranges are not overlapping with each other
2. each file key range dont overlap with the DB key range
3. make sure no snapshots are held
Reviewers: andrewkr, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58587
Summary: We saw instances where total_log_size is off the real value, but I'm not able to reproduce it. Add more logging to help debugging when it happens again.
Test Plan: Run the unit test and see the logging.
Reviewers: andrewkr, yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60081
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.
Test Plan: Add a new unit test.
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59925
Summary: Reported in T11889874. When registering the cleanup function we should copy the option so that we can still access it if ReadOptions is deleted.
Test Plan: Add a unit test to reproduce this bug.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60087
Summary:
Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators.
Instead, a job is scheduled in high priority queue and would be executed in a separate background thread.
Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background.
Reviewers: IslamAbdelRahman, sdong
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59499
Summary:
DB::AddFile() right now always add the ingested file to L0
update the logic to add the file to the lowest possible level
Test Plan: unit tests
Reviewers: jkedgar, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D59637
Summary: filter_deltes is not a frequently used feature. Remove it.
Test Plan: Run all test suites.
Reviewers: igor, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59427
Summary:
We dont report the bytes that we ingested from AddFile which make the write amplification numbers incorrect
Update InternalStats and add logging for AddFile()
Test Plan: Make sure the code compile and existing tests pass
Reviewers: lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59763
Summary:
Add option to not flush memtable on open()
In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST.
Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable.
If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery.
Test Plan: See unit test.
Reviewers: dhruba, horuff, sdong
Reviewed By: sdong
Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57813
Summary: Backup options file to private directory
Test Plan:
backupable_db_test.cc, BackupOptions
Modify DB options by calling OpenDB for 3 times. Check the latest options file is in the right place. Also check no redundent files are backuped.
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D59373
Summary:
MemTableList::current_ could be written by background flush thread and
simultaneously read in the user thread (NumNotFlushed() is used in
SwitchMemtable()). Use the lock to prevent this case. Found the error from tsan.
Related: D58833
Test Plan:
$ OPT=-g COMPILE_WITH_TSAN=1 make -j64 db_test
$ TEST_TMPDIR=/dev/shm/rocksdb ./db_test --gtest_filter=DBTest.RepeatedWritesToSameKey
Reviewers: lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59139
* Create a callback for memtable becoming immutable
Create a callback for memtable becoming immutable
Create a callback for memtable becoming immutable
moved notification outside the lock
Move sealed notification to unlocked portion of SwitchMemtable
* fix lite build
Summary:
We see some write stalls because of number of unflushed memtables. With existing logging I couldn't figure out what's happening exactly. See internal task t11446054 for details if interested. This diff adds:
- logging of memtable creation at info level; I wanted it on multiple occasions for different reasons; also include number of immutable memtables,
- logging of number of remaining immutable memtables after a flush.
Test Plan: ran tests
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58833
Summary: Make CompactionOptionsFIFO a part of mutable_cf_options
Test Plan: UT
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, lgalanis, dhruba
Differential Revision: https://reviews.facebook.net/D58653
kPointInTimeRecovery is indistinguishable from
kTolerateCorruptedTailRecords in recycle mode since we define
the "end" of the log as the first corrupt record we encounter.
kAbsoluteConsistency doesn't make sense because even a clean
shutdown leaves old junk at the end of the log file.
Signed-off-by: Sage Weil <sage@redhat.com>
Summary:
NotifyOnCompactionCompleted can unlock the mutex.
That mean that we can schedule a background compaction that will start before we ReleaseCompactionFiles().
Test Plan:
added unittest
existing unittest
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58065
Summary:
- Make sure we clean up recovered_transactions_ on DBImpl destructor
- delete leaked txns and env in TransactionTest
Test Plan: Run transaction_test under valgrind
Reviewers: sdong, andrewkr, yhchiang, horuff
Reviewed By: horuff
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58263
Summary: DBTestXactLogIterator.TransactionLogIterator was failing due the sequence gaps. This was caused by an off-by-one error when calculating the new sequence number after recovering from logs.
Test Plan: db_log_iter_test
Reviewers: andrewkr
Subscribers: andrewkr, hermanlee4, dhruba, IslamAbdelRahman
Differential Revision: https://reviews.facebook.net/D58053
Summary:
1. prepare()
2. crash
3. recover
4. commit()
5. crash
6. data is lost
This is due to the transaction data still only residing in the WAL but because the logs were flushed on the first recovery the data is ignored on the second recovery. We must scan all logs found on recovery and only ignore redundant data at the time of replay. It is not possible to know which logs still contain relevant data at time of recovery. We cannot simply ignore a log because all of the non-2pc data it contains has already been written to L0.
The changes made to MemTableInserter are to ensure that prepared sections are still recovered even if all of the non-2pc data in that log has already been flushed to L0.
Test Plan: Provided test.
Reviewers: sdong
Subscribers: andrewkr, hermanlee4, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57729
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]
The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.
Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.
We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))
So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?
Test Plan: provided test.
Reviewers: sdong
Subscribers: andrewkr, leveldb, dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D57645
Summary:
This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC.
modfication of DBImpl::WriteImpl()
- added two arguments *uint64_t log_used = nullptr, uint64_t log_ref = 0;
- *log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place.
- log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed.
- Recovery/writepath is now aware of prepared batches and commit and rollback markers.
Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing.
Reviewers: IslamAbdelRahman, sdong
Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D56919
Summary:
Add a new option that can be used to set a specific compression algorithm for bottommost level.
This option will only affect levels larger than base level.
I have also updated CompactionJobInfo to include the compression algorithm used in compaction
Test Plan:
added new unittest
existing unittests
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: lightmark, andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D57669
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
Summary: Warning is printed out with USE_CLANG=1 when including jemalloc.h. Disable it in that case.
Test Plan: Run db_bench with USE_CLANG=1 and not. Make sure they can all build and jemalloc status is printed out in the case where USE_CLANG is not set.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57399
Summary: It looks like we mistakenly enable JEMALLOC even if it's not available on the machine, that's why travis is failing
Test Plan:
check on my devserver
check on my mac
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57345
Summary:
Introduced option to dump malloc statistics using new option flag.
Added new command line option to db_bench tool to enable this
funtionality.
Also extended build to support environments with/without jemalloc.
Test Plan:
1) Build rocksdb using `make` command. Launch the following command
`./db_bench --benchmarks=fillrandom --dump_malloc_stats=true
--num=10000000` end verified that jemalloc dump is present in LOG file.
2) Build rocksdb using `DISABLE_JEMALLOC=1 make db_bench -j32` and ran
the same db_bench tool and found the following message in LOG file:
"Please compile with jemalloc to enable malloc dump".
3) Also built rocksdb using `make` command on MacOS to verify behavior
in non-FB environment.
Also to debug build configuration change temporary changed
AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build
tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler
command line. For both 2) and 3) this flag was not present.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57321
Summary: CompactedDB skips memtable. So we shouldn't use compacted DB if there is outstanding WAL files.
Test Plan: Change to options.max_open_files = -1 perf context test to create a compacted DB, which we shouldn't do.
Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57057
Summary:
In https://reviews.facebook.net/D56271, we fixed an issue where
we consider flush as compaction. However, that makes us mistakenly
count FLUSH_WRITE_BYTES twice (one in flush_job and one in db_impl.)
This patch removes the one incremented in db_impl.
Test Plan: db_test
Reviewers: yiwu, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57111
Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it.
Test Plan: Try db_bench and see the stats are printed out.
Reviewers: yhchiang
Reviewed By: yhchiang
Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56769
Summary: The code assumes that if use_mmap_reads is on then use_os_buffer is also on. This make sense as by using memory mapped files for reading you are expecting the OS to cache what it needs. Add code to make sure the user does not turn off use_os_buffer when they turn on use_mmap_reads
Test Plan: New test: DBTest.MMapAndBufferOptions
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56397
Summary:
Added the column family name to the properties block. This property
is omitted only if the property is unavailable, such as when RepairDB()
writes SST files.
In a next diff, I will change RepairDB to use this new property for
deciding to which column family an existing SST file belongs. If this
property is missing, it will add it to the "unknown" column family (same
as its existing behavior).
Test Plan:
New unit test:
$ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55605
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
record.size() should not be less than 12.
This "magic number" seems to be the WriteBatch header (8 byte sequence
and 4 byte count). Replaced all the places where "12" was used
by WriteBatchInternal::kHeader.
Summary:
Fix for https://github.com/facebook/mysql-5.6/issues/174
When there is no old files to purge, vector.at(i) function was crashing
if (old_info_log_file_count != 0 &&
old_info_log_file_count >= db_options_.keep_log_file_num) {
std::sort(old_info_log_files.begin(), old_info_log_files.end());
size_t end = old_info_log_file_count - db_options_.keep_log_file_num;
for (unsigned int i = 0; i <= end; i++) {
std::string& to_delete = old_info_log_files.at(i);
Added check to old_info_log_file_count be non zero.
Test Plan: run existing tests
Reviewers: gunnarku, vasilep, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: andrewkr, webscalesql-eng, dhruba
Differential Revision: https://reviews.facebook.net/D55245
Summary:
- Keep track of obsolete manifests in VersionSet
- Updated FindObsoleteFiles() to put obsolete manifests in the JobContext for later use by PurgeObsoleteFiles()
- Added test case that verifies a stale manifest is deleted by a non-full purge
Test Plan:
$ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation
Reviewers: IslamAbdelRahman, yoshinorim, sdong
Reviewed By: sdong
Subscribers: andrewkr, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D55269
Summary:
This is a follow-up patch of https://reviews.facebook.net/D54891.
As the information about files being compacted will also be used
when making compaction decision, it is necessary to update the compaction
score when a compaction plan has been made but not yet execute.
This patch adds a missing call to update the compaction score in
CompactFiles().
Test Plan: compact_files_test
Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55227
Summary:
Currently, when rocksdb tries to run manual compaction to refit data into a level,
there's a ReFitLevel() process that requires no bg work is currently running.
When RocksDB plans to ReFitLevel(), it will do the following:
1. pause scheduling new bg work.
2. wait until all bg work finished
3. do the ReFitLevel()
4. unpause scheduling new bg work.
However, as it pause scheduling new bg work at step one and waiting for all bg work
finished in step 2, RocksDB will stop flushing until all bg work is done (which
could take a long time.)
This patch fix this issue by changing the way ReFitLevel() pause the background work:
1. pause scheduling compaction.
2. wait until all bg work finished.
3. pause scheduling flush
4. do ReFitLevel()
5. unpause both flush and compaction.
The major difference is that. We only pause scheduling compaction in step 1 and wait
for all bg work finished in step 2. This prevent flush being blocked for a long time.
Although there's a very rare case that ReFitLevel() might be in starvation in step 2,
but it's less likely the case as flush typically finish very fast.
Test Plan: existing test.
Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55029
Summary:
The call to
```
CaptureCurrentFileNumberInPendingOutputs()
```
should be before
```
versions_->NewFileNumber()
```
Right now we are not actually protecting the file from being deleted
Test Plan: make check
Reviewers: sdong, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D54645
Summary: We want to provide a way to detect whether an iterator is stale and needs to be recreated. Add a iterator property to return version number.
Test Plan: Add two unit tests for it.
Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54921
Summary:
In some situations the DB will scan all existing files in the DB path and delete the ones that are Obsolete.
If this happen during adding an external sst file. this could cause the file to be deleted while we are adding it.
This diff fix this issue
Test Plan:
unit test to reproduce the bug
existing unit tests
Reviewers: sdong, yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D54627
Summary: DBImpl::SyncWAL() releases db mutex before calling DBImpl::MarkLogsSynced(), while inside DBImpl::MarkLogsSynced() we assert there is none or one outstanding log file. However, a memtable switch can happen in between and causing two or outstanding logs there, failing the assert. The diff adds a unit test that repros the issue and fix the assert so that the unit test passes.
Test Plan: Run the new tests.
Reviewers: anthony, kolmike, yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54621
Summary:
We started getting two kinds of crashes since we started using `DB::CompactFiles()`:
(1) `CompactFiles()` fails saying something like "/data/logdevice/4440/shard12/012302.sst: No such file or directory", and presumably makes DB read-only,
(2) DB fails to open saying "Corruption: Can't access /267000.sst: IO error: /data/logdevice/4440/shard1/267000.sst: No such file or directory".
AFAICT, both can be explained by background thread deleting compaction output as "obsolete" while it's being written, before it's committed to manifest. If it ends up committed to the manifest, we get (2); if compaction notices the disappearance and fails, we get (1). The internal tasks t10068021 and t10134177 have some details about the investigation that led to this.
Test Plan: `make -j check`; the new test fails to reopen the DB without the fix
Reviewers: yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, sdong
Differential Revision: https://reviews.facebook.net/D54561
Summary:
Introude SstFileManager::SetMaxAllowedSpaceUsage() that can be used to limit the maximum space usage allowed for RocksDB.
When this limit is exceeded WriteImpl() will fail and return Status::Aborted()
Test Plan: unit testing
Reviewers: yhchiang, anthony, andrewkr, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D53763
Summary: There is an issue in DBImpl::WriteImpl where if an empty writebatch comes in and sync=true then the logs will be marked as being synced yet the sync never actually happens because there is no data in the writebatch. This causes the next incoming batch to hang while waiting for the logs to complete syncing. This fix syncs logs even if the writebatch is empty.
Test Plan: DoubleEmptyBatch unit test in transaction_test.
Reviewers: yoshinorim, hermanlee4, sdong, ngbronson, anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54057
Summary:
Add kSstFileTier to ReadTier, which allows Get and MultiGet to
read only directly from SST files and skip mem-tables.
kSstFileTier = 0x2 // data in SST files.
// Note that this ReadTier currently only supports
// Get and MultiGet and does not support iterators.
Test Plan: add new test in db_test.
Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: igor, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53511
Summary:
copy from task 8196669:
1) Optimistic transactions do not support batching writes from different threads.
2) Pessimistic transactions do not support batching writes if an expiration time is set.
In these 2 cases, we currently do not do any write batching in DBImpl::WriteImpl() because there is a WriteCallback that could decide at the last minute to abort the write. But we could support batching write operations with callbacks if we make sure to process the callbacks correctly.
To do this, we would first need to modify write_thread.cc to stop preventing writes with callbacks from being batched together. Then we would need to change DBImpl::WriteImpl() to call all WriteCallback's in a batch, only write the batches that succeed, and correctly set the state of each batch's WriteThread::Writer.
Test Plan: Added test WriteWithCallbackTest to write_callback_test.cc which creates multiple client threads and verifies that writes are batched and executed properly.
Reviewers: hermanlee4, anthony, ngbronson
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52863
Summary:
Before this diff, there were duplicated constants to refer to properties (user-
facing API had strings and InternalStats had an enum). I noticed these were
inconsistent in terms of which constants are provided, names of constants, and
documentation of constants. Overall it seemed annoying/error-prone to maintain
these duplicated constants.
So, this diff gets rid of InternalStats's constants and replaces them with a map
keyed on the user-facing constant. The value in that map contains a function
pointer to get the property value, so we don't need to do string matching while
holding db->mutex_. This approach has a side benefit of making many small
handler functions rather than a giant switch-statement.
Test Plan: db_properties_test passes, running "make commit-prereq -j32"
Reviewers: sdong, yhchiang, kradhakrishnan, IslamAbdelRahman, rven, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53253
Summary:
Concurrent memtable adds were incorrectly computing
the last sequence number for a write batch group when the
write batches were not solitary. This is the cause of
https://github.com/facebook/mysql-5.6/issues/155
Test Plan:
1. unit tests
2. new unit test
3. parallel db_bench stress tests with batch size of 10 and asserts enabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: IslamAbdelRahman, MarkCallaghan, dhruba
Differential Revision: https://reviews.facebook.net/D53595
Summary:
If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions.
The watermarks are calculated based on slowdown thresholds.
Test Plan:
Add new test cases in column_family_test.
Adding more unit tests.
Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D53409
Summary:
Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler
SstFileTracker can be used later to abort writes when we exceed a specific size
Test Plan: unit tests
Reviewers: rven, anthony, yhchiang, sdong
Reviewed By: sdong
Subscribers: igor, lovro, march, dhruba
Differential Revision: https://reviews.facebook.net/D50469
Summary:
We can avoid the dependency by forward-declaring ColumnFamilyData and
then treating it as a black box. That means callers of ThreadStatusUtil need to
explicitly provide more options, even if they can be derived from the
ColumnFamilyData, since ThreadStatusUtil doesn't include the definition.
This is part of a series of diffs to eliminate circular dependencies between
directories (e.g., db/* files depending on util/* files and vice-versa).
Test Plan:
$ ./db_test --gtest_filter=DBTest.GetThreadStatus
$ make -j32 commit-prereq
Reviewers: sdong, yhchiang, IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53361
Summary:
I split the db-specific test points out into a separate file under db/
directory. There were also a few bugs to fix in xfunc.{h,cc} that prevented it
from compiling previously; see https://reviews.facebook.net/D36825.
Test Plan:
compilation works now, below command works, will also run "make xfunc".
$ make check ROCKSDB_XFUNC_TEST='managed_new' tests-regexp='DBTest' -j32
Reviewers: sdong, yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D53343
Summary:
SstFileWriter may create an sst file with no entries
Right now this will fail when being ingested using DB::AddFile() saying that the keys are corrupted
Test Plan: make check
Reviewers: yhchiang, rven, anthony, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D52815
Makefile adjust paths for solaris build
Makefile enable _GLIBCXX_USE_C99 so that std::to_string is available
db_compaction_test.cc Initialise a variable to avoid a compilation error
db_impl.cc Include <alloca.h>
db_test.cc Include <alloca.h>
Environment.java recognise solaris envrionment
options_bulder.cc Make log unambiguous
geodb_impl.cc Make log and floor unambiguous
Summary:
While running the myrocks regression suite, I found that while
dropping a table soon after inserting rows into it resulted in an
assertion failure in CheckConsistencyForDeletes for not finding
a file which was recently added or moved. Marking the files to be
deleted as being compacted before calling LogAndApplyChange
fixed the assertion failures.
Test Plan: DBCompactionTest.DeleteFileRange
Reviewers: IslamAbdelRahman, anthony, yhchiang, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, yoshinorim, leveldb
Differential Revision: https://reviews.facebook.net/D52599
Summary: DBImpl::GetLatestSequenceForKey() can do memcpy's to load a value that will never be used. This can be optimized by changing all the Get() functions called to optionally not fetch the value (and only fetch the sequencenumber).
Test Plan: optimistic_transaction_test and transaction_test
Reviewers: anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D52227
Summary:
We need to clean the job context if we end up not deleting any
files because no files are in the range specified.
Test Plan: DBCompactionTest.DeleteFileRange
Reviewers: sdong, anthony, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52467
Summary:
myrocks seems to build rocksdb using
-Wmissing-field-initializers (and treats warnings as errors). This diff
adds that flag to the rocksdb build, and fixes the compilation failures
that result. I have not checked for any other differences in the build
flags for rocksdb build as part of myrocks.
Test Plan: make check
Reviewers: sdong, rven
Reviewed By: rven
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D52443
Summary:
This is an initial diff for providing the ability to delete
files which are completely within a given range of keys.
Test Plan: DBCompactionTest.DeleteRange
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52293
Summary: Fix some CLANG errors introduced in 7d87f02799
Test Plan: Build with both of CLANG and gcc
Reviewers: rven, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, ngbronson
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52329
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
Summary:
Add CompactionReason to CompactionJobInfo
This will allow users to understand why compaction started which will help options tuning
Test Plan:
added new tests
make check -j64
Reviewers: yhchiang, anthony, kradhakrishnan, sdong, rven
Reviewed By: rven
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51975
Summary:
This patch fixes https://github.com/facebook/mysql-5.6/issues/121
There is a recent change in rocksdb to disable auto compactions on startup: https://reviews.facebook.net/D51147. However, there is a small timing window where a column family needs to be compacted and schedules a compaction, but the scheduled compaction fails when it checks the disable_auto_compactions setting. The expectation is once the application is ready, it will call EnableAutoCompactions() to allow new compactions to go through. However, if the Column family is stalled because L0 is full, and no writes can go through, it is possible the column family may never have a new compaction request get scheduled. EnableAutoCompaction() should probably schedule an new flush and compaction event when it resets disable_auto_compaction.
Using InstallSuperVersionAndScheduleWork, we call SchedulePendingFlush,
SchedulePendingCompaction, as well as MaybeScheduleFlushOrcompaction on all the
column families to avoid the situation above.
This is still a first pass for feedback.
Could also just call SchedePendingFlush and SchedulePendingCompaction directly.
Test Plan:
Run on Asan build
cd _build-5.6-ASan/ && ./mysql-test/mtr --mem --big --testcase-timeout=36000 --suite-timeout=12000 --parallel=16 --suite=rocksdb,rocksdb_rpl,rocksdb_sys_vars --mysqld=--default-storage-engine=rocksdb --mysqld=--skip-innodb --mysqld=--default-tmp-storage-engine=MyISAM --mysqld=--rocksdb rocksdb_rpl.rpl_rocksdb_stress_crash --repeat=1000
Ensure that it no longer hangs during the test.
Reviewers: hermanlee4, yhchiang, anthony
Reviewed By: anthony
Subscribers: leveldb, yhchiang, dhruba
Differential Revision: https://reviews.facebook.net/D51747
Summary:
When there are waiting manual compactions, we need to signal
them after removing the current manual compaction from the deque.
Test Plan: ColumnFamilytTest.SameCFManualManualCommaction
Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D52119
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted
ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted
Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.
Benchmark results (using https://phabricator.fb.com/P20083553)
```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G /home/tec/local/normal.4K.Snappy/db10077
// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G /home/tec/local/zero.8K.LZ4/db10077
// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
// --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
// --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"
// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s
// ============================================================================
// BM_StringCopy 1.73s 576.97m
// BM_StringPiece 103.74% 1.67s 598.55m
// ============================================================================
// Match rate : 1000000 / 1000000
// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s
// ============================================================================
// BM_StringCopy 611.99ms 1.63
// BM_StringPiece 203.76% 300.35ms 3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```
Test Plan: Unit tests
Reviewers: sdong, igor, anthony, yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, lovro, adsharma
Differential Revision: https://reviews.facebook.net/D48999
Summary:
List of changes:
1) Fix the snprintf() usage in cases where wrong variable was used to determine the output buffer size.
2) Remove unnecessary checks before calling delete operator.
3) Increase code correctness by using size_t type when getting vector's size.
4) Unify the coding style by removing namespace::std usage at the top of the file to confirm to the majority usage.
5) Fix various lint errors pointed out by 'arc lint'.
Test Plan:
Code review and build:
git diff
make clean
make -j 32 commit-prereq
arc lint
Reviewers: kradhakrishnan, sdong, rven, anthony, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51849
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973